Alex Smola - Log-probabilities, semirings and floating point numbers

Here’s a trick/bug that is a) really well known in the research community, b) lots of beginners get it wrong nonetheless, c) simple unit tests may not detect it and d) it may be a really fatal bug in your code. You can even find it in large scale machine learning toolboxes.

Suppose you want to do mixture clustering and you happily compute \(p(x|y)\) and \(p(y)\), say by a mixture of Gaussians. Quite often you’ll need access to \(p(y|x)\) which can be computed via

\[p(y|x) = \frac{p(y,x)}{p(x)} = \frac{p(y,x)}{\sum_{y} p(y,x)}\]

Let’s assume that your code runs well in 2D but now you try it in 100 dimensions and it fails with a division by zero error. What’s going on? After all, your code is algebraically correct. Most likely you ignored the fact that floating point numbers have only a fixed precision and by exponentiating the argument of the Gaussian (recall, we have a normalization that is exponential in the number of dimensions) you encountered numerical underflow. What happened is that you were trying to store the relevant information in the exponent rather than the mantissa of a floating point number. On single precision you have 8 bit for the exponent and 23 for the mantissa and you just ran out of storage (the above image is courtesy of Wikipedia).

Here’s how you can pull things back into the mantissa - simply store log probabilities and operate with them. Hence instead of + and x you should use ‘log +’ and +.

\[\begin{aligned} \prod_i p_i & \rightarrow \sum_i \log p_i \\ \sum_i p_i & \rightarrow \pi + \log \exp \left(\log p_i - \pi \right) \end{aligned}\]

This holds for all \(\pi \in \mathbb{R}\), and thus in particular for \(\pi = \max_i \log p_i\). If we apply this transformation, we will not get numerical overflow since the largest element in the exponential is \(0\). At worst, we will get numerical underflow whenever \(\log p_i \ll \pi\), or equivalently whenever \(p_i \ll \max_j p_j\). As such, the above expression will never diverge (unless we have a comically large number of entries \(p_i\)). The numerical reason for this is that we moved the operation from the exponent into the mantissa where there’s a lot more space is available to store information.

In more general terms, what happened is that we replaced the operations + and x by two operations log + and + which act just in the same way as addition and multiplication. Aji and McEliece in 2000 formalized this as the Generalized Distributive Law. Systems that satisfy these operations are called commutative semirings. Some examples are:

Set union + and intersection x
The tropical semiring which uses min + and + x
Boolean AND x and OR +
Plain old calculus with addition + and multiplication x
The log + semiring with log + + and + x

Replacing these symbols in well known algorithms such as dynamic programming gives the forward-backward algorithm, shortest path calculations and others, but this is a story for another day.