Alex Smola - Distributed synchronization with the distributed star

Here’s a simple synchronization paradigm between many computers that scales with the number of machines involved and which essentially keeps cost at \(O(1)\) per machine. For lack of a better name I’m going to call it the distributed star since this is what the communication looks like. It’s quite similar to how memcached stores its (key,value) pairs.

Assume you have n computers, each of which have a copy of a large parameter vector \(w\) (typically several GB) and we would like to keep these copies approximately synchronized.

A simple version would be to pause the computers occasionally, have them send their copies to a central node, and then return with a consensus value. Unfortunately this takes \(O(|w| \log n)\) time if we aggregate things on a tree (we can reduce it by streaming data through but this makes the code a lot more tricky). Furthermore we need to stop processing while we do so. The latter may not even be possible and any local computation is likely to benefit from having most up-to-date parameters.

Instead, we use the following: assume that we can break up the parameter vector into smaller (key, value) pairs that need synchronizing. We now have each computer send its local changes for each key to a central server, update the parameters there, and later receive information about global changes. So far this algorithm looks stupid - after all, when using n machines it would require \(O(|w| n)\) time to process since the central server is the bottleneck. This is where the distributed star comes in. Instead of keeping all data on a single server, we use the well known distributed hashing trick and send it to a machine n from a pool P of servers:

\[n(\mathrm{key},P) = \mathop{\mathrm{argmin}}_{n \in P} ~ h(\mathrm{key},n)\]

Here \(h\) is the hash function. Such a system spreads communication evenly and it leads to an \(O(|w|n/|P|)\) load per machine. In particular, if we make each of the computers involved in the local computation also members of the pool, i.e. if we have \(n=|P|\) we get an \(O(|w|)\) cost for keeping terms synchronized regardless of the number of machines involved.

Obvious approximations: we assume that all machines are on the same switch. Moreover we assume that the times to open a TCP/IP connection are negligible (we keep them open after the first message) relative to the work to transmit the data.

The reason I’m calling this a distributed star is that for each key we have a star communication topology, it’s just that we use a different star for each key. If anyone in systems knows what this thing is really called, I’d greatly appreciate feedback. Memcached uses the same setup, alas it doesn’t have versioned writes and callbacks, so we had to build our own system using ICE.

PS: thanks to the Hubble Space telescope for the image of a star distribution in a bright cluster.