Alex Smola - Random numbers in constant storage

Many algorithms require random number generators to work. For instance, locality sensitive hashing requires one to compute the random projection matrix \(P\) in order to compute the hashes \(z = P x\). Likewise, fast eigenvalue solvers in large matrices often rely on a random matrix, e.g. the work by Halko, Martinsson and Tropp, SIAM Review 2011, which assumes that at some point we multiply a matrix M by a matrix P with Gaussian random entries.

The problem with these methods is that if we want to perform this projection operation in many places, we need to distribute the matrix \(P\) to several machines. This is undesirable since a) it introduces another stage of synchronization between machines and b) it requires space to store the matrix \(P\) in the first place. The latter is often bad since memory access can be much slower than computation, depending on how the memory is being accessed. The prime example here is multiplication with a sparse matrix which would require random memory access.

One way to circumvent this is to share the random seed and then recompute the random matrix from scratch. But this means that we’re critically relying on the implementation of a random number generator. Even worse, we still need to store the entire matrix. What if we could simply access any element of the matrix at will without overhead?

Here’s where hashing comes to the rescue. To motivate things consider the case where the entries of \(P\) are all drawn from the uniform distribution \(U[0,1]\). For a hash function h with range \(\{0, \ldots N-1\}\) simply set \(U[i,j] = h(i,j)/N\). Since hash functions map \((i,j)\) pairs to uniformly distributed, uncorrelated numbers in the range \(\{0, \ldots N-1\}\) this essentially amounts to uniformly distributed random numbers that can be recomputed on the fly.

A slightly more involved example is how to draw Gaussian random variables. We may e.g. resort to the Box-Müller transform which shows how to convert two uniformly distributed random numbers into two Gaussians (the image on top is copied from the Wikpedia article). While being quite wasteful (we use two random numbers rather than one), we simply use two uniform hashes and then compute

\[P[i,j]=\sqrt{−2 \log \frac{h(i,j,1)}{N}} \cos\left(2 \pi \frac{h(i,j,2)}{N}\right)\]

Since this is known to generate Gaussian random variables from uniform random variables this will give us Gaussian distributed hashes. Similar tricks work for other random variables. It means that things like Random Kitchen Sinks, Locality Sensitive Hashing, and related projection methods never really need to store the ‘random’ projection coefficients whenever memory is at a premium or whenever it would be too costly to synchronize the random numbers.

Update - recently someone proposed to use only zeros and ones in the initialization of a deep network with the express purpose of making things reproducible. The paper has a fair amount of analysis in it but all of this can be made redundant simply by using hash functions instead. It requires at most sharing of a ‘salt’ and an agreed-upon convention for referencing weights in the network (e.g. weight and coordinate). Presto a deterministic initialization that is reproducible without the need for fancy math.