Hashing for Linear Functions

linear function
hashing
Published

March 1, 2011

This is the first of a few posts on hashing. It’s an incredibly powerful technique when working with discrete objects and sequences. And it’s also idiot-proof simple. I decided to post it after listening to a talk this morning where the speaker proposed something really complicated for what should be very simple.

A hash function is (in its ideal CS version) a mapping \(h\) from a domain \(X\) into a set \(\{0 \ldots N-1\}\) of integers such that \(h(x)\) is essentially random. A fast implementation is e.g. Murmur hash. Note that as long as the hash is good it does not matter which one you use (I have received the latter question at least twice from referees in the past. It’s about as meaningful as asking which random number generator was used for sampling). It does matter if you’re designing cryptographic algorithms but for most of machine learning the quality bar is a lot lower. When we want to compute inner products of the form

\[f(x) = \langle \phi(x), w \rangle = \sum_i \phi_i(x) w_i\]

we might run into a big problem: whenever the feature map is too high-dimensional we will not have enough memory to store \(w\). For instance, in the olden times (Eskin, Leslie and Noble, 2002)people would use substrings up to a certain order for protein classification. A second problem is that we might have to store a dictionary for computing the features of \(x\), e.g. when we have a bag of words representation. It is quite likely that the dictionary will take up much more memory than the actual parameter vector \(w\). On top of that, we need to write code to generate the dictionary. Instead of that, let’s be lazy and do the following:

\[f(x) = \sum_{i: \phi(x)_i \neq 0} \phi(x)_i w_{h(i)} \sigma(i)\]

Here \(h\) is a hash function with range small enough to fit \(w\) into memory and \(σ\) is a binary \(\{\pm 1\}\) hash function. This looks pretty simple. It’s essentially the sketch that Charikar, Chen and Farrach-Colton suggested in 2003 for data streams. But it’s also very powerful. In particular one can show the following: