When Chance Meets Necessity: Part 1

How Bayesians and frequentists pursue knowledge in uncertainty

Nariman Mammadli
4 min readJan 1, 2021
Image by Arek Socha from Pixabay

Unlike other branches of mathematics that deal with exactitude, statistics deals with uncertainty. Despite being a branch of mathematics, there has been quite a bit of conflict about applying statistics and interpreting its results. The word statistics is derived from the Latin term statisticum collegium (“council of state”) and the Italian word statista (“ statesman” or “ politician”). In statistics, we have two “parties”: frequentists and Bayesians. As is the case with all opposing parties, frequentists and Bayesians have different presuppositions about uncertainty. The former perceives it as an inherent property of external events; the latter perceive it as something caused by one’s lack of knowledge about external events. These two presuppositions suggest two distinct ways of doing statistics. The conflict between them has far-reaching consequences for life sciences, social sciences, and artificial intelligence. In this series of posts, I will explore the depths of this conflict and discuss some scenarios especially relevant to life sciences and artificial intelligence in which their co-operation becomes necessary in spite of their opposing preconceptions.

Setting the stage

The frequentist conceptualizes the probability of an event as its long-term frequency of occurrence. For example, if an event A occurred 20% of the time so far, then the probability of A is 20%. The Bayesian understands the probability of an event as its degree of deductibility from prior knowledge [1]. For example, if A can be deduced from prior knowledge with 20% certainty, then the probability of A is 20%. Being the objective property of the world, frequencies (usually) do not change, but the degree of certainty changes with changing knowledge. In the remaining text, I will use the word ‘frequency’ (or ‘frequent’) to denote the frequentist viewpoint and reserve ‘probability’ (or ‘probable’) for the Bayesian.

I do not know; therefore, I am uncertain.

The Bayesian defines candidate hypotheses that claim to explain a given dataset. Each candidate is assigned some confidence level or prior probability. Prior probabilities are derived from past knowledge and experiences. Using the dataset at hand, prior probabilities are updated to incorporate the evidence found in the dataset. The above process is formalized in the Bayes formula:

where;

P(D|H) tells us how probable the world D is, given that H is true. P(H) besides encoding the prior knowledge also controls for overfitting. Even though some H perfectly explains D, it is still not a good option if H’s prior probability is very low (ex. conspiracy theories).

I see; therefore, I am certain.

To talk about frequencies in a given dataset, one needs to count the number of occurrences falling into different pre-defined bins (see histogram). One needs to ‘cut’ the given dataset into pieces and then compute such a histogram. These resulting pieces are assumed to be independent (see i.i.d) and connected via the logical ‘and’ operator. ‘Independent repetitions of a random experiment’ in the scientific literature denotes the same principle. For example, assume multiple repetitions of a coin toss experiment resulted in “HTHTHTHTHTHTHTHT?” and we are asked about the probability of heads in the next toss. The frequentist would reach the wrong answer of 0.5 if he cut the dataset into single tosses; H; T; H; T, and so on. (Figure 1a) There is an alternating pattern or a logical chain that binds the consecutive tosses. A better cut would be every two tosses; HT; HT; HT, and so on. (Figure 1b)

Figure 1. (a) Assuming that each coin flip is independent leads to the wrong answer of 0.5. (b) Assuming that double flips are independent corrects the error and leads to the true answer.

The decision on how to cut the dataset goes hand in hand with finding the best hypothesis. Different independence assumptions result in different frequency distributions. Inference capacity (ex. precision, recall and etc.) varies with each distribution as in the above example. The goal is to find the hypothesis or the way to cut the data that maximizes P(D|H) subject to some constraints. The extra constraints are to control the overfitting (ex. regularization, drop-out, etc.). Note that the frequentist’s conclusion is not posterior probabilities over candidate hypotheses but a single hypothesis that achieves the goal.

Conclusion

I discussed how Bayesians and frequentists reach their knowledge in uncertainty. The Bayesian starts from multiple candidate hypotheses with prior probabilities and checks them against data to re-adjust those probabilities. The frequentist starts from data, frames it as logically independent repetitions of a random experiment, and finds the one hypothesis that minimizes some cost function.

The second difference is about their approach to solve overfitting. The Bayesian formulates prior credibility for proposed hypotheses based on their content; the frequentist does the same based on their form.

References

[1] Jaynes, E. T., & Bretthorst, G. L. (2019). Plausible Reasoning. In Probability theory: The logic of science (pp. 3–23). Cambridge: Cambridge Univ. Press.

--

--

Nariman Mammadli

Exploring the boundaries of artificial intelligence with a special interest in its applications on cybersecurity. linkedin.com/in/mammadlinariman