Memory Latency, Hashing, Optimal Golomb Rulers and Feistel Networks

feistel network

July 15, 2011

In many problems involving hashing we want to look up a range of elements from a vector where the elements are indicated by a hash function \(h\). For instance, we might want to evaluate \(v[h(i,j)]\) for arbitrary \(i\) and for a range of \(j \in \{0, \ldots n-1\}\). This happens for matrix multiplication, multiclass classification, collaborative filtering, multitask learning and many related problems.

While this works just fine in terms of estimation performance, traversing all values of \(j\) leads to an algorithm which is horrible in terms of memory access patterns. Modern DRAM chips are much faster (over 10x) when it comes to reading values in sequence rather than when carrying out random reads. Update: memory latency hasn’t improved over the past decade. In fact, DDR5 latency can often be higher than DDR4 latency. Furthermore, random access destroys the benefit of a CPU cache. This leads to algorithms which are efficient in terms of their memory footprint, yet which are slow in terms of their actual runtime behavior. One way to address this is to bound the range of \(h(i,j)\) for different values of \(j\) via one of the following strategies.

Don’t hash the secondary key

Decompose \(h(i,j)=h(i)+j \mathop{\mathrm{mod}} N\). This is computationally very cheap, it has excellent sequential access properties but it leads to horrible collisions should there ever be two \(i\) and \(i'\) for which \(|h(i)−h(i')| \leq n\).

Small secondary hash

Decompose \(h(i,j)=h(i)+h'(j)\) where \(h'(j)\) has a small range of values. This leads to less catastrophic collisions for near-collisions \(|h(i) - h(i')| \leq n\). Nonetheless it is a bad idea since now we have a nontrivial probability of collision as soon as the range of \(h'(j)\) is less than \(n^2\) due to the birthday paradox. Moreover, for adjacent values \(h(i)\) and \(h(i')\) we will get many collisions.

Optimal Golomb ruler

Decompose \(h(i,j)=h(i)+g(j)\) where \(g(j)\) is an Optimal Golomb ruler. The latter is an increasing sequence of integers for which any pairwise distance occurs exactly once. In other words, the condition \(g(a)−g(b)=g(c)−g(d)\) implies that \(a=c\) and \(b=d\). For a more intuitive definition consider the conference room in the diagram above. Any room size can only be obtained in one manner. John Langford proposed this to address the problem. In fact, it solves the collision problem perfectly since there are a) no collisions for a fixed \(i\) and b) for neighboring values \(h(i)\) and \(h(i')\) we will get at most one collision (due to the Golomb ruler property). Alas, this only works up to \(n=28\) since finding an Optimal Golomb Ruler is hard. While it is curently unknown whether finding such rulers is actually NP hard, only \(n=27\) and \(n=28\) were discovered in the past decade.

Cryptographic hash

An alternative that works for larger \(n\) and that is sufficiently simple to compute is to use cryptography. After all, all we want is that the hash function \(h'(j)\) has small range and that it doesn’t have any self collisions or any systematic collisions. We can achieve this by encrypting \(j\) using the key \(i\) to generate an encrypted message of \(N\) possible values. In other words we use \(h(i,j)= h(i) + \mathop{\mathrm{crypt}}(j|i,N)\).

Since it is an encryption of \(j\), the mapping is invertible and we won’t have collisions for a given value of \(j\). Furthermore, for different \(i\) the encodings will be uncorrelated (after all, \(i\) is the key). Finally, we can control the range \(N>n\) simply by choosing the encryption algorithm. In this case the random memory access is of bounded range, hence the CPU cache will not suffer from too many misses. A particularly nice algorithm is the [Feist]el cipher]( which works as follows: define the iterative map

\[f(x,y)= (y,x \mathop{\mathrm{XOR}} h(y))\]

As always, \(h\) is a hash function. After 4 iterations \((x,y) \rightarrow f(x,y)\) we obtain an encoding of \((x,y)\) that is cryptographically hard. Now use \(x=i\) and \(y=j\) to obtain the desired result. This encoding can be effective whenever computation is a lot faster than memory latency.

PS: The title image is by Cmglee (CC BY-SA 3.0). It depicts a conference room with 10 different configurations, based on an Optimal Golomb ruler \([0, 2, 7, 8, 11]\).