Machine learning
bayes
Jan 23, 2022     9 minutes read

1. Why yet another article about Bayes’ theorem and bayesian inference?

There seem to be enough of them on the net.

From my perspective the articles available on the internet (by internet I mean medium or towardsdatascience) are:

I will try to adress all of the flaws above.

2. Gaining intuition

circles and areas

Probably the most common way of presenting intuition to anything is by showing a picture. Bayes’ theorem can be shown on a graph as well: a graph presenting e.g. two slightly overlapping circles:

We can see two areas: A and B, which overlap and have a common area: A.B. We may amuse ourselves with a short riddle:

Knowing I am in B, what is the probability that I am in A?

May sound simple enough and we denote it as conditional probability:

$$ P(A|B) = \frac{P(A\cap B)}{P(B)} $$

which means we clearly are in an area denoted as A.B ($A \cap B$, in other words), but we know that we are surely, 100%, in B, so the probability of being in B is one. Hence we adjust our perspective to be surely in B by dividing by P(B).

Personally I find it easier to think of areas instead of probabilities. If area of A.B is 2, and area of B is 10, then the chance of being in A.B given B is 0.2.

But conditional probability is not Bayes’ theorem yet. We can rewrite the nominator using conditional probability:

$$ P(A|B) = \frac{P(B|A) \cdot P(B)}{P(B)} $$

which is a little more difficult to understand graphically, but still possible. Now the nominator is a funny expression showing that the common area can be denoted in terms of conditional probability. The best way to see it is making a few drawing by yourself (it took me maybe 3 hours until I actually felt it). One transformation turned out to particurarily useful, the moment I saw it in terms of circles in my head:

$$ P(A\cap B) = P(A|B) \cdot P(B) = P(B|A) \cdot P(B) $$

The intersection can be described in two symmetrical ways, by referring both to A and B and their conditional probabilities.

Yet still there is one more sophistication: we can rewrite the probability of B in the context of conditional probabilities of A:

$$ P(B) = P(B|A) \cdot P(A) + P(B|\neg A) \cdot P(\neg A) = \Sigma_{A^*} P(B|A^*) \cdot P(A^*) $$ where $$ A^* = {A, \neg A } $$

which for the moment seem to serve to purpose. It seems fairly obvious that P(B) is equal to the intersection of B and A plus the intersection of B and everything that is not A. In the case of circles, it is useless (at least for me), but for bayesian inference, it is crucial.

Unfortunately, this “circle/area” intuition cannot be used to understand bayesian inference whatsoever. Imagine the following example:

I don’t think is it possible to interpret the intersection of parameters and data as the common area of these two sets. So other visualizations must be used to understand it.

contingency table

is often used in statistics to analyze distributions of two random variables, e.g. $X$ and $Y$. It is useful, because one can pretty easily see if the variables are correlated/dependent (condition for independence: $P(A \cap B) = P(A) \cdot P(B)$ for every possible values of $A$ and $B$).

In machine learning a contingency table is usually used for model validation in a form of confusion matrix, with its derivative metrics: precision/recall, sensitivity/specificity etc. The elements of a confusion matrix can also be interpreted as I and II type errors (respectively: FP and FN).

As you can see, contingency table can express basic concepts of machine learning (confusion matrix), statistics (hypothesis testing) and bayesian inference (marginal distributions), which helps building your knowledge on concepts you already know.

There are two popular versions of contingency table: with probabilities and with counts. The only difference is that the latter is multiplied by n (number of observations), beside the obvious differences in interpretation.

A contingency table in its basic form, for random variables looks like this:

$$ \begin{array}{cc|cc|c} & & \pmb{X} & \\ & & 1 & 0 & \Sigma \\ \hline \pmb{Y} & 1 & 0.1 & 0.3 & 0.4 \\ & 0 & 0.4 & 0.2 & 0.6 \\ \hline & \Sigma & 0.5 & 0.5 & 1 \\ \end{array} $$

in case of probabilities. Confusion matrices have integers, not fractions. In the example above I used two random variables: $X$ and $Y$, but they can be called anyhow, e.g. $\theta$ and $D$:

$$ \begin{array}{cc|cc|c} & & \pmb{\theta} & \\ & & 1 & 0 & \Sigma \\ \hline \pmb{D} & 1 & 0.1 & 0.3 & 0.4 \\ & 0 & 0.4 & 0.2 & 0.6 \\ \hline & \Sigma & 0.5 & 0.5 & 1 \\ \end{array} $$

which will give us a feel of bayesian inference. Let’s see how we can derive the table above:

$$ \begin{array}{cc|cc|c} & & \pmb{\theta} & \\ & & \theta = 1 & \theta = 0 & \Sigma \\ \hline \pmb{D} & D = 1 & P(\theta = 1 \cap D = 1) & P(\theta = 0 \cap D = 1) & P(\theta = 1 \cap D = 1) + P(\theta = 0 \cap D = 1) \\ & D = 0 & P(\theta = 1 \cap D = 0) & P(\theta = 1 \cap D = 0) & P(\theta = 1 \cap D = 0) + P(\theta = 0 \cap D = 0) \\ \hline & \Sigma & P(\theta = 1 \cap D = 1) + & P(\theta = 0 \cap D = 1) + & 1 \\ & & + P(\theta = 1 \cap D = 0) & + P(\theta = 0 \cap D = 0) & \\ \end{array} $$

and let’s see if we can derive any of the rows (we could do the same for columns) as we are well equipped with our brand new knowledge of Bayes’ theorem.

$$ \begin{array}{cc|cc|c} & & \pmb{\theta} & \\ & & \theta = 1 & \theta = 0 & \Sigma \\ \hline \pmb{D} & D = 1 & P(D=1|\theta=0)\cdot P(\theta=0) & P(D=1|\theta=1)\cdot P(\theta=1) & \Sigma_{\theta^*}P(D=1|\theta=\theta^*) \cdot P(\theta=\theta^*) \\ \end{array} $$

we could obviously substitute probability intersections with their symmetrical conditionals, because

$$ P(\theta \cap D) = P(\theta|D) \cdot P(D) = P(D|\theta) \cdot P(\theta) $$

but in the rightmost column where we have the sum, it has a slightly nicer interpretation: $D$ is equal to 1, e.g. is constant in the whole row, and the sum itself is called marginal likelihood and in bayesian statistics is often called shortly evidence and denoted as $P(D)$. As we can see, it must be equal to $P(D)$, as we sum over all the possible values of $\theta$, so it’s the probability of D regardless of the value of $\theta$.

What we see in the table above is a specific case for variables which take values of 0 and 1. Sometimes we want to abstract from specific values of $\theta$ and $D$, as there may be many of them, up to infinity, even. Then we can rewrite the table above as the following:

$$ \begin{array}{cc|ccc|c} & & & \pmb{\theta} & & \\ & & … & \text{some}~\theta & … & \Sigma \\ \hline & … & … & … & …\\ \pmb{D} & \text{some}~D & … & P(D|\theta)\cdot P(\theta) & …& \Sigma_{\theta^*} P(D|\theta^*) \cdot P(\theta^*) \\ & … & … & … & …\\ \hline & \Sigma & … & \Sigma_{D^*}P(\theta|D^*) \cdot P(D^*) & …\\ \end{array} $$

from which we can deduce a value

$$ \frac{P(D|\theta)\cdot P(\theta)}{\Sigma_{\theta^*} P(D|\theta^*) \cdot P(\theta^*)} $$

which is obviously equal to $P(\theta|D)$, as Bayes’ theorem states. Just as in the basic example with $X$ and $Y$ random variables, conditional probability of $\theta$ given $D$ is their intersection divided by the whole $D$.

Now imagine that $\theta$ and $D$ have continuous distributions. In this case in order to calculate the denominator of the equation above we would have to add up infinitely many numbers. In mathematics there is even a special notation (abstraction) for that: the integral. Hence we can rewrite the Bayes’ theorem for continuous variables:

$$ P(\theta|D) = \frac{P(D|\theta)\cdot P(\theta)}{\int d \theta^* ~ P(D|\theta^*) \cdot P(\theta^*)} $$

See? We’ve just exchanged $\Sigma_{\theta^*}$ with $\int d \theta^*$. (Computationally it is slightly more complicated than just replcaing a symbol with one eanother, since many posterior distributions are impossible to solve analytically, and numeric metods must be used, like MCMC).

3. References

The best book on Bayes’ theorem and bayesian inferece is Doing Bayesian Analysis by John K. Kruschke and many examples I gave in this blogpost were my own interpretations of some chapters of this book.

4. TODO