Fast quadratic regularization for online learning

optimization
acceleration
regularization
trick
Published

January 15, 2010

After a few discussions within Yahoo I’ve decided to post a bunch of tricks here. A lot of these are well known. Some others might be new. They’re small hacks, too small to put into a paper on their own, unless we allow 1 page papers. But they can make a big difference to make your code go fast or do useful things. So here it goes:

Assume you want to train a model, parametrized by $$w$$ to minimize the objective

$L(w) = \sum_{i=1}^n l_i(w) + \frac{\lambda}{2}\|w^2\|$

In this case you could perform stochastic gradient updates of the form

$w \leftarrow w - \eta \left(\partial_w l_i(w) + \lambda w\right) = (1 - \eta \lambda) w - \eta \partial_w l_i(w)$

The problem here is that if the gradients are sparse, say 1000 features for a document while $$w$$ may have millions, maybe even billions of nonzero coefficients, this update is extremely slow: for each meaningful coefficient update we also update many more coefficients (up to billions) that just shrink summarily. An easy fix is to keep a coefficient, say $$c \in \mathbb{R}$$, which tracks the scale separately via $$w = c \bar{w}$$. This yields:

\begin{aligned}c \bar{w} \leftarrow & (1 - \eta \lambda) c \bar{w} - \eta \partial_{\bar{w}} l_i(c \bar{w}) \\ = & \underbrace{(1 - \eta \lambda)c}_{\mathrm{scalar~update}} \left(\bar{w} - \frac{\eta}{1 - \eta \lambda c} \partial_{\bar{w}} l_i(c \bar{w}) \right) \end{aligned}

In other words, we update the scaling coefficient $$c \leftarrow (1 - \eta \lambda)c$$ and perform only sparse updates on $$w$$. As far as I recall this trick can be found e.g. in Leon Bottou’s stochastic gradient solver, or in Shai Shalev-Schwartz’s Pegasos package. We also used this in the implementation of Kivinen, Williamson & Smola, 1998 on stochastic gradient descent for kernels.