Bayes’ Theorem – Part 2

In the previous post, we have seen Bayes’ theorem and a way to use it in the framework of updating mathematical models. We concluded by foreshadowing multiple updates of a model (or a belief) following multiple experiments. So this is where we take off now.

Assume that we have two experiments E_1 and E_2. In the example we have seen in the first part, this could be “The first student I ask is in the program focused on Models” and “The second student I ask is in the program”. What we want to compute is our belief a posteriori, after both replies:

    \begin{align*}P(H|E_1, E_2) &= \frac{P(E_2|H, E_1)\cdot P(H|E_1)}{P(E_2|H, E_1)\cdot P(H|E_1)+P(E_2|H^c, E_1)\cdot P(H^c|E_1)} \\& = \frac{P(E_2|H, E_1)}{P(E_2|H, E_1)\cdot P(H|E_1)+P(E_2|H^c, E_1)\cdot P(H^c|E_1)}\\&\phantom{=}\cdot \frac{P(E_1|H)\cdot P(H)}{P(E_1|H)\cdot P(H) + P(E_1|H^c)P(H^c)}.\end{align*}

We can see some iteration in the conditioning, but it is quite difficult to simplify unless we assume some kind of independence between the experiments. We will assume that E_1 and E_2 are independent conditioned to H (and H^c), that is P(E_1,E_2|H)=P(E_1|H)\cdot P(E_2|H) (and similarly for H^c). In other words, we are saying that if H is true (or in the world where H is true), then the two experiments are independent. This is not the same as asking that the two experiments are independent, even if we are considering both H and H^c. It makes sense to assume that, under H and H^c, the experiments are independent of one another: at the same time, if we did not condition, knowing the result of the first experiment would give some information on the other one (for example through making H more likely, when conditioned to E_1).

Going back to our updating, we have, under the assumption of conditional independence, that

    \begin{align*}P(H|E_1, E_2) &= \frac{P(E_2|H)}{P(E_2|H)\cdot P(H|E_1)+P(E_2|H^c)\cdot P(H^c|E_1)}\cdot \frac{P(E_1|H)\cdot P(H)}{P(E_1|H)\cdot P(H) + P(E_1|H^c)P(H^c)}.\end{align*}

Even though we have simplified it a little bit, it is still quite difficult to carry out the computations, so we are looking for a different approach: we will consider the odds, instead of the probabilities.

The odds are a measure of the relative probability of two events: if we believe the event A to be twice as likely as the event B, we will say that the odds of A against B, written A: B, are 2:1. If we know the two corresponding probabilities P(A) and P(B), then A: B=\frac{P(A)}{P(B)}, but the power of the odds is in the fact that we do not need to assign absolute probabilities to events to compute their odds. This is also their weakness because, in general, we are not able to recover absolute probabilities from relative ones.

We are now going to state Bayes’ theorem once more, in this new perspective.

Theorem (Bayes, odds form). Given three events A, B and C, with nonzero probability, the following identity holds:

    \[\frac{P(A|C)}{P(B|C)}=\frac{P(C|A)}{P(C|B)}\cdot \frac{P(A)}{P(B)}.\]

The proof of this result is immediate from Bayes’ theorem in probability form. What is interesting here is that we are updating the odds: starting from the odds a priori \frac{P(A)}{P(B)}, we obtain the odds a posteriori, after seeing C: \frac{P(A|C)}{P(B|C)}. In particular the term P(C) that we had previously, now cancels out and disappears.

If we look at this result in terms of hypotheses and experiments, we can compare the relative probabilities (the odds) of two assumptions, without giving too much thought to all the possible hypotheses and their (non-zero) a priori probabilities. We will focus now on the very special case in which we have just H and its complement H^c against one another. This is special in particular because it is the case in which we are always able to recover the absolute probabilities (thanks to the fact that their sum is 1). We have

    \[\frac{P(H|E)}{P(H^c|E)} = \frac{P(E|H)}{P(E|H^c)}\cdot \frac{P(H)}{P(H^c)},\]

and if we run more than one experiment (again, assuming independence conditional to H and H^c of E_1 and E_2),

    \begin{align*}\frac{P(H|E_1, E_2)}{P(H^c|E_1,E_2)} & = \frac{P(E_2|H)}{P(E_2|H^c)}\cdot \frac{P(H|E_1)}{P(H^c|E_1)}\\& = \frac{P(E_2|H)}{P(E_2|H^c)}\cdot \frac{P(E_1|H)}{P(E_1|H^c)}\cdot\frac{P(H)}{P(H^c)},\end{align*}

which is much simpler than the form we had before. Notice in particular that if the experiments are just a repetition of the same one (for example if we keep asking students in the class), the terms \frac{P(E_i|H)}{P(E_i|H^c)} are constant for all is.

This means that if we keep repeating the same experiment (with binary output) to test or hypothesis H, we just need to evaluate the prior odds and the likelihood ratios

    \[\frac{P(E_i|H)}{P(E_i|H^c)},\quad \frac{P(E_i^c|H)}{P(E_i^c|H^c)} ,\]

that come up in the consecutive updates.

The last step to take is to get rid of the product because we humans prefer addition to multiplication. To do so we use a long-known trick: the logarithm. Taking the logarithm of the odds gives us the log-odds and an even nicer form of Bayes’ theorem.

Theorem (Bayes, log-odds form). Given three events A, B and C, with nonzero probability, the following identity holds:

    \[\log\left(\frac{P(A|C)}{P(B|C)}\right)=\log\left(\frac{P(C|A)}{P(C|B)}\right)+ \log\left(\frac{P(A)}{P(B)}\right).\]

This is true for any (reasonable) basis of the logarithm, but of particular interest are the natural logarithm (base e), the decimal logarithm (base 10), and the binary logarithm (base 2). In fact, on top of transforming the products in sums, thus making it easier to use, in particular for quick computations paired up with Fermi estimates, these two manipulations (odds and their logs) transform the usual [0,1] interval of probabilities in something much nicer.

When we moved from probabilities to odds, we mapped the [0,1] interval to the positive half-line, making it clear that the extreme values are not accessible. However, this transformation is still not completely satisfactory, since the values in (0,1) corresponding to (0,\frac{1}{2}) in the usual interval, are quite cramped, in comparison to the half-line (1,+\infty), image of (\frac{1}{2}, 1). Taking the logarithm makes the whole thing symmetric: now we are working on the whole real line, from -\infty to +\infty, and depending on the base we have chosen, we have nice representations of the odds and the probability (again, we are considering H against H^c):

The new representation of probabilities, using logarithms in base 2.

Looking at this from Bayes’ perspective, we can think of the evidence we are accumulating, that is the logarithms of the likelihood ratio \log\left(\frac{P(E|H)}{P(E|H^c)}\right), as measured in bits (or decibels if in base 10, or nits if in base e). We have a first glimpse of the theory of information.

There are still several things to discuss, regarding this new version of Bayes’ theorem, in particular with regard to this new representation. However, we will only mention them briefly here:

  • Even if the amount of information gained by seeing a certain result of an experiment is the same (say, 2 bits), the influence it has on our overall belief depends on where we were before. In particular, the marginal returns are smaller and smaller, evidence that confirms our beliefs is less and less valuable.
  • The further away we are from 0 (that is, where we are completely balanced between the two alternatives), the smaller the change any evidence gives: moving one bit right from 4 adds 3% to our belief, moving one left from 4 removes 5%, moving one left (or right) from 0 removes (or adds) 17% from (to) our belief.
  • A positive and a negative result of an experiment do not necessarily cancel out, since the two terms are

        \[\log\left(\frac{P(E|H)}{P(E|H^c)}\right),\ \log\left(\frac{P(E^c|H)}{P(E^c|H^c)}\right).\]

    This was already easily seen with just the odds.
  • Going back to a previous remark, we have stated that we are not allowed to give infinite log-odds, but can be as far away towards infinity as we want. Combining this to the weight loss when we move further away from 0, we see that in order to neutralise an extreme belief, we are going to need a lot of evidence against.

This particular take on Bayes’ theorem is a staple of the so-called “rationality community”. There are several presentations of the subtler reading of Bayes’ theorem, in particular in log-odds form. One good place to start is Arbital, and unfortunately failed experiment stemmed from LessWrong.

There is still a lot to say on Bayes’ theorem and Bayesian updating. Next time, however, we will move on to Shannon’s theory of communication.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.