Alex Smola - Lazy updates for generic regularization in SGD

Yesterday I wrote about how to do fast stochastic gradient descent updates for quadratic regularization. However, there are lots more regularizers which one would want to use for sparse updates of high-dimensional parameter vectors. In general the problem looks like this:

\[\mathop{\mathrm{minimize}}_x \sum_i f_i(x) + \Omega[x] \text{ where } \Omega[x] = \sum_j \omega[x_j]\]

Whenever the gradients in stochastic gradient descent are sparse and whenever x is very high dimensional it would be very wasteful to use the standard SGD updates. You could use very clever data structures for storing vectors and updating. For instance John Duchi et al. wrote a beautiful paper in 2008 on Efficient Projections onto the \(\ell_1\)-Ball for Learning in High Dimensions to address this. Or you could do something really simple …

Novi Quadrianto and I ran into this problem when trying to compute an optimal tiering of webpages where we had a few billion pages that needed allocating over a bunch of storage systems. Each update in this case only dealt with a dozen or so pages, hence our gradients were very very sparse. Turns out, Bob Carpenter had a similar problem (and solution) for LingPipe. So we used this trick:

Instead of storing \(x\), also store a time-stamp when a particular coordinate was addressed last.
In between updates caused by \(f\), all updates for a coordinate \(j\) are simply shrinkage operations on the coordinate that only depend on the value \(x_j\), the regularizer \(\omega\) and the learning rate \(\eta_t\) at time \(t\). They look as follows: \(x_j \leftarrow x_j - \eta_t \omega[x_j]\) or equivalently \(\Delta x_j/\eta_t = -\omega[x_j]\).
A naive solution is to update \(x_j\) only whenever the loss function touches it and then ‘replay’ all the updates that would have occurred in the meantime. But this is quite inefficient.
Note that we’re dealing with a difference equation. We can approximate it by a differential equation and accumulate over the coordinate changes over time, as rescaled by the learning rate. In other words, we use \(\eta[t,t'] := \sum_{i=t+1}^{t'} \eta_i\) to keep track of the changes. In this case we have \[x_j \leftarrow \zeta(\eta[t,t']) \text{ where } \partial_\eta \zeta(\eta) = -\omega[\zeta(\eta)].\] Using the initialization \(\zeta(0) = x_j\) solves the problem.
Now all you need to do is solve the differential equation once and you’re done. If you don’t want to keep the partial sums over \(\eta\) around (this can be costly if the updates are infrequent) you can use a further approximation of \(\eta[t,t'] = \approx \int_{t}^{t'}\eta_\tau d\tau\). This is easy to compute since we know the learning rate schedule.
This saves an additional GB whenever the updates only occur every billion steps.

This algorithm would obviously work for \(\ell_1\) programming. But it’ll work just as well for any other regularizer you could think of.