When Chance Meets Necessity: Part 2

On the implications of incorrect independence assumptions in statistics

Nariman Mammadli
4 min readOct 11, 2021
Photo by Samuel Sianipar on Unsplash

In Part 1, I wrote on the divide between Bayesians and frequentists in their approach to statistical problems. In this article, I explore the assumption of independence (known as i.i.d) that is a necessary condition for the frequentist conceptualization of probability. The assumption of independence treats data as a composition of logically independent pieces occurring at different frequencies, forming a frequency distribution. In the state of uncertainty, multiple assumptions about independence are possible, leading to multiple possible frequency distributions. Incorrect assumptions of independence lead to incorrect frequency distributions, and therefore incorrect predictions. In this article, I provide two example cases to demonstrate the problem, and describe their solutions.

The Coin Flip Scam

What is the probability of observing heads after seeing "HTHTHTHTHTHT" so far? If the frequentist assumes that each flip is independent of others, then the probability of observing heads next is 0.5 (Figure 1a). However, the answer does not satisfy intuition. There is an alternating pattern which hints at logical dependence across individual coin flips. The frequentist can take the intuition into account by assuming independence across two flips instead of one and reach the correct answer of 1.0 (Figure 1b).

Figure 1. (a) Assuming that each coin flip is independent leads to the wrong answer of 0.5. (b) Assuming that double flips are independent corrects the error and leads to the true answer.

The Emperor's Height

Edward Jaynes [1] gives another example. Imagine we want to figure out the emperor's height by conducting a poll. Suppose each person knows the emperor's height with some plus-minus error. It seems reasonable — to frequentists — to ask thousands of people and calculate the average of their answers, hoping that individual errors will cancel out and the true answer will emerge. Such a premise assumes logical independence across answers given by different individuals.

What if the majority never saw the emperor in real life, and rumours had circulated about his height for a long time? Then there is a popular bias about the matter. Some individuals may still have some objective data. However, their final answers will be the average of objective data and the popular bias. Therefore, each answer contains two types of error: random error and systematic error due to bias. The average across the population will not eliminate the latter because answers are not independent, contrary to the original assumption. How can the scientist correct the second error? Prelec et al. [2] suggest an interesting method to extract the popular bias. Their method is to choose the answer that is more popular than people predict. Applying this insight to our case, we would ask each participant two questions: 1) How do you think others will answer this question? 2) What is your answer? Now, we have two datasets, one for each question. Suppose a minority of the population has seen the emperor, and this minority is aware of the popular bias. In such a case, everyone, including the minority, will agree on the first question. However, on the second question, the minority's disagreement will show up (Figure 3a).

Figure 3. (a) If the minority knows the majority's bias, the difference will emerge between Q1 and Q2 distributions. (b) If the minority is unaware of the popular bias, the results of Q1 will match the results of Q2, hiding the signal we had in (a)

However, if the minority were not aware of the popular bias despite being closer to the truth, their answer to Q1 would match their answer to Q2 (Figure 3b).

Conclusion

From these two examples, we can see how the wrong independence assumption could lead to systematic errors that cannot be remedied simply by collecting more data; it requires a change in the assumption itself. Systematic errors point us in the right direction towards the true hypothesis. The popular bias about the emperor's height created a logical dependence across answers which caused the systematic error; however, the presence of bias and its nature is also a piece of knowledge that the statistician is interested in. In subsequent posts, I will explore this idea further.

FAQ

  1. How can we systematically identify when the assumption of independence is violated in real-world data sets, beyond intuitive examples like coin flips or historical knowledge about the emperor’s height? — Identifying violations typically involves statistical tests for independence, such as Chi-square tests, and exploratory data analysis to detect patterns or correlations that suggest dependencies.
  2. How does the process of adjusting for popular bias, as suggested by Prelec et al., perform in diverse fields of study, such as social sciences or epidemiology, where subjective perception and objective data intersect frequently? — The effectiveness of adjusting for popular bias using methods like those suggested by Prelec et al. varies by context. It often requires domain-specific knowledge to correctly interpret the biases and dependencies present in the data, suggesting a tailored approach for each field.

References

[1] Jaynes, E. T., & Bretthorst, G. L. (2019). 8.9. In Probability theory: The logic of science (pp. 257–259). essay, Cambridge Univ. Press.
[2] Prelec, D., Seung, H. & McCoy, J. A solution to the single-question crowd wisdom problem. Nature 541, 532–535 (2017). https://doi.org/10.1038/nature21054

--

--

Nariman Mammadli

Exploring the boundaries of artificial intelligence with a special interest in its applications on cybersecurity. linkedin.com/in/mammadlinariman