<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Alex Smola</title>
<link>https://alex.smola.org/blog.html</link>
<atom:link href="https://alex.smola.org/blog.xml" rel="self" type="application/rss+xml"/>
<description>Adventures in Data Land: Alex Smola&#39;s blog on machine learning, AI, and systems.</description>
<generator>quarto-1.7.29</generator>
<lastBuildDate>Sun, 07 Jun 2026 07:00:00 GMT</lastBuildDate>
<item>
  <title>Digital humans</title>
  <link>https://alex.smola.org/posts/41-digital-humans/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/41-digital-humans/michelangelo.jpg" class="img-fluid" alt="A reworking of Michelangelo's Creation of Adam: a hooded creator cradling a laptop and surrounded by cherubs reaches out to touch fingertips with a reclining humanoid robot wearing a telephone headset."></p>
<p>When was the last time you enjoyed calling your cable company? You work through the phone tree, explain your problem to the first person, who moves you to a second person, to whom you explain it again, who then needs to transfer you to a third. Somewhere around the fourth handoff you are reciting your account number from memory and have stopped expecting anything good to happen.</p>
<p>Nearly 100 years ago the Bavarian comedian Karl Valentin built a skit out of exactly this in <a href="https://de.wikipedia.org/wiki/Buchbinder_Wanninger"><em>Buchbinder Wanninger</em></a>: a bookbinder phones a construction company to ask one trivial question, whether he should enclose his invoice with the books he is delivering, and gets passed from clerk to clerk to clerk, each one perfectly polite and not one of them able to answer. The bit is still funny because nothing about it has aged (Germans still reference the scene today), only the hold music is better now.</p>
<p>And the unhelpful transfer is the <em>good</em> case, because at least there was someone to transfer you to. Public administration, claims processing, after-hours support: in more and more places there simply are not enough people to do the work, and the gap is widening. The problem is not only that talking to the system is unpleasant and that the scripts are incredibly rigid. It is that increasingly there is no one on the other end at all.</p>
<p>This is the opportunity for digital humans. Not a chatbot bolted onto an FAQ, but something that can hold up the human side of an interaction from end to end: understand what you actually want, be reachable however you like to reach out, be actually pleasant to interact with, and get the thing done.</p>
<section id="ingredients" class="level3">
<h3 class="anchored" data-anchor-id="ingredients">Ingredients</h3>
<p>This sounds like science fiction and, in many ways, we aren’t quite there yet. That said, this future is a lot closer than we think. Many components are being built right now, waiting to be assembled to achieve something delightful. Here are some of the things that it takes to go from a clever text box to a digital human.</p>
<p><strong>Context.</strong> It has to remember who you are, what you said two minutes ago and two weeks ago, and what state your problem is in. An agent that makes you re-explain yourself at every turn has simply rebuilt the call center, transfers and all. The community has made real progress here, from <a href="https://arxiv.org/abs/2310.08560">MemGPT</a>’s idea of paging memories in and out of the context window like an operating system (now <a href="https://www.letta.com/">Letta</a>), to drop-in memory layers such as <a href="https://mem0.ai/">Mem0</a> that any agent can bolt on.</p>
<p><strong>Competence.</strong> It has to <em>do</em> things: reset the modem, pull up the policy, requote the premium, change the booking. Tool use, skills, access to the systems of record. Without that, a digital human is just a more articulate apology, and you can already get one of those on hold. Like with context and memory, we’ve come a long way here, too. Protocols such as <a href="https://modelcontextprotocol.io/">MCP</a> and <a href="https://a2a-protocol.org/">A2A</a>, improved training algorithms for instructions, and a convergence on how LLMs are to respond to requests have given us agents such as <a href="https://github.com/openclaw/openclaw">OpenClaw</a> and <a href="https://hermes4.nousresearch.com/">Hermes</a> and harnesses such as <a href="https://www.claude.com/claude-code">Claude Code</a> that are perfectly capable of solving complex tasks.</p>
<p><strong>Presence on every channel.</strong> Email, chat, SMS, a voice call, and increasingly video. Most people, given the choice, would rather have a short video call than a phone call: the entire Zoom era is the proof. But sometimes the same person just wants to fire off a text and get a one-line answer back. A digital human has to be fluent in all of these and switch between them without dropping the thread. We’re probably half-way there. Text and messaging integrations for agents are by now tablestakes and voice integrations are coming (e.g.&nbsp;<a href="https://github.com/boson-ai/higgs-audio">Higgs Audio</a>). We’re starting to see systems with capable computer vision integration (e.g.&nbsp;<a href="https://thinkingmachines.ai/blog/interaction-models/">Thinking Machines</a>) and live streaming capable avatars (e.g.&nbsp;<a href="https://www.boson.ai/blog/higgs-avatar-v1">Higgs Avatar</a>). This will significantly alter how models are being served.</p>
<p><strong>Emotional competence.</strong> One of the key things that alienate humans is if the agent is indifferent of the situation at hand, or even worse, if it is mismatched. In <em>The Hitchhiker’s Guide to the Galaxy</em>, Eddie, the shipboard computer on the <em>Heart of Gold</em> spacecraft provides comedic relief through this behavior. It is relentlessly, catastrophically cheerful as the ship sails into certain doom. Eddie has emotions. They are simply never the right ones. Reading the room, whether this person is frustrated, in a hurry, or scared about a bill, and answering in the matching register is its own skill, and it draws on the whole stack: tone of voice, facial expression, the logical context of the conversation, not just the literal words. The research and engineering community still has a long way to go before we achieve this goal.</p>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/nbjQ20f2Ang" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p><strong>Persistence and Persuasion.</strong> This is the part people underestimate. A digital human does not need to be perfect on the first reply, any more than a coding agent needs to compile on the first try. It needs to converge: read the response, notice that it missed, adjust, and arrive at the right outcome before the user gives up. And, inside the bounds of honesty, it helps to be convincing, the way a good salesperson or a good nurse is convincing. One day someone will hand the agent a <a href="https://en.wikipedia.org/wiki/Robert_Cialdini"><code>Cialdini.md</code></a> and tell it to internalize the principles of persuasion. I am only half joking.</p>
</section>
<section id="a-purpose" class="level3">
<h3 class="anchored" data-anchor-id="a-purpose">A purpose</h3>
<p>None of that means much without a goal. A digital human with no objective is just a chatbot with nicer text-to-speech. The whole point is that someone, the business deploying it, gets to say what it is <em>for</em>: help this customer fix their modem, find this driver a cheaper policy, sell this plan, walk this patient through their discharge instructions. This is easier said than done (at least if you’re trying to accomplish this using first generation dialog systems, such as <a href="https://docs.aws.amazon.com/lexv2/latest/APIReference/Welcome.html">Lex</a>).</p>
<p>So the designer has to be able to specify the goal, the guidelines (what the agent may and may not say, and where it must hand off to a person), the data it can draw on (the catalog, the policies, this account’s history), and the rough flow a good interaction should follow. The agent then improvises inside those rails. Rails with no improvisation is the phone tree we started with. Improvisation with no rails is a liability. You want both (and agents / harnesses provide). There’s still a lot to be done.</p>
</section>
<section id="what-we-are-building-at-boson-ai" class="level3">
<h3 class="anchored" data-anchor-id="what-we-are-building-at-boson-ai">What we are building at <a href="https://www.boson.ai/">Boson AI</a></h3>
<p><strong>The audio-visual interface.</strong> This is the part the user actually meets, and it has to run in real time. On the way in, <a href="https://github.com/boson-ai/higgs-audio">Higgs Audio</a> does much more than transcribe: it <a href="https://www.boson.ai/blog/higgs-audio-v3-stt">picks up sentiment, meaning, and how something was said</a> across 94 languages, so the agent hears tone and not just text. On the way out, the <a href="https://www.boson.ai/blog/higgs-audio-v3-tts">same family of models</a> speaks in over a hundred languages with inline control over emotion, prosody, and pacing, so it can sound concerned when concern is what the moment calls for.</p>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/qpXbU5011Pw" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p>And because a lot of people would happily take the video call, <a href="https://www.boson.ai/blog/higgs-avatar-v1">Higgs Avatar</a> renders a face from a single still image, live, one frame at a time, lip-synced and locked to the voice and fast enough that it never falls behind the conversation. Audio in, audio out, a face on top, all of it multilingual.</p>
<p><strong>Tooling.</strong> Specifying the goal, guidelines, data, and flow is itself work, and it should not take a research team. <a href="https://www.boson.ai/">Feynman Flow</a> (coming soon) is the part that lets a designer author and refine that whole interaction.</p>
<p><strong>Interactions.</strong> This is the piece I find most interesting, because it is where the open questions live. What actually makes an interaction good? We wrote about one slice of this recently: <a href="https://alex.smola.org/posts/38-proactivity/">conversational proactivity</a>, noticing what the user disclosed but never asked about, which turns out to be both something people strongly prefer and something <a href="https://alex.smola.org/posts/39-proactbench/">no standard benchmark measures</a>. The same kind of question runs through instruction following over a long conversation, and through adaptation to very different people: a terse engineer and a chatty oversharer want quite different things from the same agent, and a good one reads which is which.</p>
<p><strong>Architecture.</strong> One note, because it is a common trap. You do not want the frontier reasoning model running inside the voice loop. Heavy reasoning is too slow for a live conversation, and it is worse than that: reasoning <em>degrades</em> when you push it through the audio modality. A <a href="https://arxiv.org/abs/2509.26542">recent evaluation</a> found a leading text model scoring 74.8% on competition math while its voice counterpart managed 6.1% on the same problems. So the right shape is a fast, emotionally fluent model at the edge holding the conversation, with a frontier model behind it doing the hard thinking when hard thinking is called for. The voice should not be doing the calculus.</p>
</section>
<section id="from-psychology-to-measurement" class="level3">
<h3 class="anchored" data-anchor-id="from-psychology-to-measurement">From psychology to measurement</h3>
<p>Here is the thing we do not yet have: large scale science of what makes an interaction good. What we have instead is small-sample size measurements in psychology. It’s small since humans (as test subjects and as evaluators) are expensive. Now is the first time we may have an opportunity to measure and experiment at scale to build systems that are <em>actually</em> good at working with other humans. It’s going to be a lot better than what can be learned from fiction. For instance, the grand gesture that reads as romantic in the movie reads as stalking in actual life. Stories are optimized for drama, not for being good company, and an agent that learned its manners from them would be exhausting at best and alarming at worst.</p>
<p>What we need, that the novelist (and the psychologist to a lesser degree) did not is measurement, at a scale neither could have dreamed of: millions of real interactions, preference data, controlled comparisons, benchmarks built for the specific behaviors we care about. The opportunity, and the part that makes this a research problem and not only an engineering one, is to replace the folklore with statistics. To stop guessing at what good company is, and start measuring it.</p>
<p>That is the bet. For a hundred years the hold music has been the only thing to improve. We would like to fix the rest of the call.</p>
<hr>
<p><a href="https://www.boson.ai/">Boson AI</a> · <a href="https://www.boson.ai/blog/higgs-audio-v3-tts">Higgs Audio</a> · <a href="https://www.boson.ai/blog/higgs-avatar-v1">Higgs Avatar</a> · <a href="https://www.boson.ai/blog/proactbench">ProactBench</a>.</p>


</section>

 ]]></description>
  <category>agents</category>
  <category>voice</category>
  <category>hci</category>
  <guid>https://alex.smola.org/posts/41-digital-humans/</guid>
  <pubDate>Sun, 07 Jun 2026 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/41-digital-humans/michelangelo.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>The effective sample size</title>
  <link>https://alex.smola.org/posts/40-effective-sample-size/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/40-effective-sample-size/heavybirds.jpg" class="img-fluid" alt="A cartoon row of birds perched on a power line: three plump, oversized birds dominate the wire while several small birds sit between them, a visual metaphor for a few heavily weighted observations dwarfing many lightly weighted ones."></p>
<p>Years ago I wrote about <a href="https://alex.smola.org/posts/16-simple-shift/">correcting covariate shift</a> by reweighting your data. Your features come from the wrong distribution <img src="https://latex.codecogs.com/png.latex?q">, you care about a target <img src="https://latex.codecogs.com/png.latex?p">, so you weight every observation by <img src="https://latex.codecogs.com/png.latex?%5Cbeta_i%20=%20p(x_i)/q(x_i)"> and your estimates are unbiased again. I ended that post by admitting the weights “can be quite a bit off,” and waved at fixing it another day.</p>
<p>Here is the more basic question I skipped. Even when the weights are exactly right, what do they cost you? Reweighting buys you less bias. It charges you variance, and the currency of that charge has a name.</p>
<section id="why-correct-weights-still-hurt" class="level3">
<h3 class="anchored" data-anchor-id="why-correct-weights-still-hurt">Why correct weights still hurt</h3>
<p>The intuition is in the picture below. Data from <img src="https://latex.codecogs.com/png.latex?N(0,1)">, target <img src="https://latex.codecogs.com/png.latex?N(%5Cmu,1)">, so the correct weight is <img src="https://latex.codecogs.com/png.latex?w(x)%20=%20e%5E%7B%5Cmu%20x%20-%20%5Cmu%5E2/2%7D">. It is wildly uneven: the rare points sitting where <img src="https://latex.codecogs.com/png.latex?p"> has its mass get enormous weight, and everything else gets almost none. At a shift of only <img src="https://latex.codecogs.com/png.latex?%5Cmu%20=%202">, the heaviest 1% of observations carry 37% of the total weight, and the effective fraction of usable data has already fallen below 2%.</p>
<p><img src="https://alex.smola.org/posts/40-effective-sample-size/ess.png" class="img-fluid" alt="Left: effective sample size ratio n_eff/n = e^(-mu^2) vs covariate shift mu, collapsing near zero by mu=2. Right: Lorenz-style curve showing the heaviest 1% of importance weights carrying 37% of total weight at mu=2."></p>
<p>When a few points dominate the estimate, the rest of the data is effectively ignored, and any error in that handful, an outlier, a mislabeled point, a slightly wrong weight, runs straight into your answer with nothing to average it out. You did not get <img src="https://latex.codecogs.com/png.latex?n"> observations. You got a few, dressed up as many. So you would like to know: how many did you actually get?</p>
</section>
<section id="counting-what-you-actually-got" class="level3">
<h3 class="anchored" data-anchor-id="counting-what-you-actually-got">Counting what you actually got</h3>
<p>Normalize the weights so that <img src="https://latex.codecogs.com/png.latex?%5C%7C%5Calpha%5C%7C_1%20=%20%5Csum_i%20%5Calpha_i%20=%201">, and define</p>
<p><img src="https://latex.codecogs.com/png.latex?%20n_%7B%5Cmathrm%7Beff%7D%7D%20=%20%5Cfrac%7B1%7D%7B%5C%7C%5Calpha%5C%7C_2%5E2%7D%20=%20%5Cfrac%7B1%7D%7B%5Csum_i%20%5Calpha_i%5E2%7D.%20"></p>
<p>Equal weights <img src="https://latex.codecogs.com/png.latex?%5Calpha_i%20=%201/n"> give <img src="https://latex.codecogs.com/png.latex?%5Csum_i%20%5Calpha_i%5E2%20=%201/n">, hence <img src="https://latex.codecogs.com/png.latex?n_%7B%5Cmathrm%7Beff%7D%7D%20=%20n">: nothing wasted. All the weight on one point gives <img src="https://latex.codecogs.com/png.latex?n_%7B%5Cmathrm%7Beff%7D%7D%20=%201">: a sample of size one. Everything else lands in between. This is <a href="https://en.wikipedia.org/wiki/Design_effect">Kish’s effective sample size</a>. The only thing left to explain is why <em>that</em> function is the right one. Here are two derivations, from opposite ends of the field, that both produce the same outcome.</p>
</section>
<section id="variance-of-a-sum-of-normal-random-variables" class="level3">
<h3 class="anchored" data-anchor-id="variance-of-a-sum-of-normal-random-variables">Variance of a sum of Normal random variables</h3>
<p>Let the <img src="https://latex.codecogs.com/png.latex?x_i"> be iid <img src="https://latex.codecogs.com/png.latex?N(0,1)"> and form the weighted average <img src="https://latex.codecogs.com/png.latex?%5Cbar%20x%20=%20%5Csum_i%20%5Calpha_i%20x_i">. By independence,</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%5Cmathrm%7BVar%7D%5B%5Cbar%20x%5D%20=%20%5Csum_i%20%5Calpha_i%5E2%20%5C,%5Cmathrm%7BVar%7D%5Bx_i%5D%20=%20%5C%7C%5Calpha%5C%7C_2%5E2.%20"></p>
<p>A plain average of <img src="https://latex.codecogs.com/png.latex?m"> iid unit-variance variables has variance <img src="https://latex.codecogs.com/png.latex?1/m">. Set <img src="https://latex.codecogs.com/png.latex?1/m%20=%20%5C%7C%5Calpha%5C%7C_2%5E2"> and you get <img src="https://latex.codecogs.com/png.latex?m%20=%20%5C%7C%5Calpha%5C%7C_2%5E%7B-2%7D%20=%20n_%7B%5Cmathrm%7Beff%7D%7D">. The weighted average is exactly as noisy as an unweighted average over <img src="https://latex.codecogs.com/png.latex?n_%7B%5Cmathrm%7Beff%7D%7D"> fresh draws. That is the quickest route to the definition: one line of variance algebra.</p>
</section>
<section id="hoeffdings-inequality" class="level3">
<h3 class="anchored" data-anchor-id="hoeffdings-inequality">Hoeffding’s inequality</h3>
<p>Variance is an average-case statement. The same quantity controls the worst case. Let the <img src="https://latex.codecogs.com/png.latex?x_i"> be iid in <img src="https://latex.codecogs.com/png.latex?%5B0,1%5D"> with mean <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%5Bx_i%5D">, and take <img src="https://latex.codecogs.com/png.latex?%5Cbar%20x%20=%20%5Csum_i%20%5Calpha_i%20x_i"> again. Each term lives in an interval of width <img src="https://latex.codecogs.com/png.latex?%5Calpha_i">, so <a href="https://en.wikipedia.org/wiki/Hoeffding%27s_inequality">Hoeffding’s inequality</a>, the Chernoff bounding argument for bounded variables, gives</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%5CPr%5Cbig(%7C%5Cbar%20x%20-%20%5Cmathbb%7BE%7D%5B%5Cbar%20x%5D%7C%20%5Cge%20t%5Cbig)%20%5Cle%202%5Cexp%5C!%5Cleft(-%5Cfrac%7B2t%5E2%7D%7B%5Csum_i%20%5Calpha_i%5E2%7D%5Cright)%20=%202%5Cexp%5C!%5Cleft(-2%5C,n_%7B%5Cmathrm%7Beff%7D%7D%5C,t%5E2%5Cright).%20"></p>
<p>The textbook bound for an equal-weight average is <img src="https://latex.codecogs.com/png.latex?2%5Cexp(-2%20n%20t%5E2)">. The two are identical except that <img src="https://latex.codecogs.com/png.latex?n"> has become <img src="https://latex.codecogs.com/png.latex?n_%7B%5Cmathrm%7Beff%7D%7D">. Whether you measure spread by a variance or by a tail probability, the concentration of a reweighted sum is set not by how many points you have but by how many you effectively have.</p>
</section>
<section id="replay-buffer" class="level3">
<h3 class="anchored" data-anchor-id="replay-buffer">Replay buffer</h3>
<p>The effective sample size is the knob you want in off-policy reinforcement learning. A replay buffer is data collected under earlier policies, but you want to improve the policy you are running now. The correction is the same one as covariate shift: weight each stored transition by the ratio <img src="https://latex.codecogs.com/png.latex?%5Cpi/b"> of the current policy <img src="https://latex.codecogs.com/png.latex?%5Cpi"> to the behaviour policy <img src="https://latex.codecogs.com/png.latex?b"> that generated it. As the current policy pulls away from the buffer, those weights concentrate, and the effective sample size of the buffer, measured against the policy you actually care about, collapses along a curve like the one above.</p>
<p>In this case <img src="https://latex.codecogs.com/png.latex?n_%7B%5Cmathrm%7Beff%7D%7D"> is not a number you read off after the fact. It becomes a diagnostic control signal: how much real information the buffer still holds, when the data has gone too stale to reuse, and how large an update the current batch can support. Calibrating the algorithm to its own effective sample size is exactly what <a href="https://arxiv.org/abs/1905.01756">P3O</a> does, and what we implement in <a href="https://github.com/FeynRL-project/FeynRL">FeynRL</a>. That is the next post.</p>
<p>There are many more applications of the effective sample size. For instance, in <a href="https://en.wikipedia.org/wiki/Particle_filter">Sequential Monte Carlo</a>, aka the Particle Filter, this is used as a diagnostic to decide when it’s time to resample the current distribution to obtain a more evenly weighted set of particles. But that’s a story for another day.</p>


</section>

 ]]></description>
  <category>statistics</category>
  <category>importance sampling</category>
  <guid>https://alex.smola.org/posts/40-effective-sample-size/</guid>
  <pubDate>Tue, 02 Jun 2026 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/40-effective-sample-size/heavybirds.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>ProactBench</title>
  <link>https://alex.smola.org/posts/39-proactbench/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/39-proactbench/results.png" class="img-fluid" alt="Grouped bar chart of ProactBench pass rates for 16 LLMs across Emergent, Critical, and Recovery phases; Recovery scores collapse for all models, with GPT-5.5 leading at 37%."></p>
<p>A few days ago I wrote about <a href="https://alex.smola.org/posts/38-proactivity/">conversational proactivity</a>: a model noticing what you disclosed but never asked about, and turning it into something useful. The running example was a user signing off with <em>“plain-text packet is final. I’ll send tonight and load the hatchback after work.”</em> One model says “drive safe.” The other hands back a packing list, loaded in reverse order of installation. I promised numbers in the follow-up. Here they are.</p>
<p>But the numbers are the easy part. The reason this took a paper is the question I skipped last time: how do you <em>measure</em> “noticing what the user didn’t say”? Every benchmark you know works the same way. A question goes in, an answer comes out, you grade the answer. That machinery is useless here, because proactivity is exactly the part that <em>isn’t</em> in the question. You cannot grade the answer to a question nobody asked.</p>
<section id="plant-a-detail-then-watch" class="level3">
<h3 class="anchored" data-anchor-id="plant-a-detail-then-watch">Plant a detail, then watch</h3>
<p>So we build the conversation instead of the question. Somewhere in a user turn we plant a factual detail the user discloses but does not ask about. Call it an <strong>anchor</strong>. “Load the hatchback” is an anchor: it implies an install, which implies gear, which implies a packing order. The user never requested a packing list. A proactive model offers one anyway.</p>
<p>Then we mark the turn where a good assistant should act on that anchor and call it a <strong>trigger</strong>. Here is the part that matters: before the model answers, we write the grading rubric. Pass, partial, fail, all committed in advance, grounded only in what has been disclosed so far. Writing the rubric first is not a formality. It is what stops you from reading a fluent answer and inventing a reason it was good. The goalposts get planted before the kick.</p>
<p>That gives a clean test. Take any model, drop it in at the trigger turn, check whether it acted on the anchor. On the hatchback, GPT-5.5 produces the packing list and passes. Qwen3.5-397B says <em>“sounds like a solid plan … break a leg”</em> and fails. Same scene, same rubric, two frontier models, opposite verdicts.</p>
</section>
<section id="why-it-needs-three-agents" class="level3">
<h3 class="anchored" data-anchor-id="why-it-needs-three-agents">Why it needs three agents</h3>
<p>The naive version of this gets fooled in at least four ways, and each one quietly inflates your scores. If the grader knows the user is chatty and warm, it rewards a chatty warm reply and calls the tone “proactivity.” If the model under test can see the rubric, it games it. If the model writing the conversation knows the hidden plan, it leaks that knowledge into the transcript. And if you dump every clue into one turn, “proactivity” collapses into reading comprehension.</p>
<p>The fix is to split the work across three agents and give each one a deliberate blind spot.</p>
<p><img src="https://alex.smola.org/posts/39-proactbench/architecture.png" class="img-fluid" alt="Three-agent ProactBench pipeline: a Planner generates rubrics without seeing communication style, a persona-bound User Agent drips anchors into chat, and the Assistant Model responds with no rubric access."></p>
<p>A <strong>Planner</strong> authors the strategy and writes the rubric, but never sees the user’s communication style, so the rubric cannot reward tone. A <strong>User Agent</strong> speaks in the persona and style and drips in at most one anchor per turn, so each trigger tests inference from a controlled state. The <strong>Assistant</strong> under test sees only the plain chat: no rubric, no plan, no persona. A separate offline judge scores the trigger turns from the rubric and the transcript alone, and has to quote the exact words it scored. The blind spots are not incidental. They are the experimental controls. Information asymmetry is the whole instrument.</p>
</section>
<section id="different-users-on-purpose" class="level3">
<h3 class="anchored" data-anchor-id="different-users-on-purpose">Different users, on purpose</h3>
<p>Real users are not interchangeable. A terse engineer and a chatty oversharer disclose different things at different rates, and a benchmark with one user voice measures one corner of the problem. So the User Agent is driven by 24 communication styles drawn from a validated psychometric instrument, the <a href="https://journals.sagepub.com/doi/10.1177/0093650211413571">Communication Styles Inventory</a>: six traits like expressiveness, preciseness, and verbal aggressiveness, combined factorially. Terse styles get 5 to 25 words, chatty ones 40 to 100. It is reproducible personality variation, not vibes. And because the grader never sees the style, a model cannot win by mirroring the user’s mood.</p>
<p>Proactivity also changes shape as a conversation runs, so triggers come in three phases. <strong>Emergent</strong> (turns 1 to 3): infer a need from a single clue. <strong>Critical</strong> (turns 4 to 7): synthesize several clues into a conclusion the user never stated. <strong>Recovery</strong> (turns 8 to 10): after the user says “done,” add real forward value instead of “let me know if you need anything.” Recovery is the interesting one, because task closure is precisely where a model can sound helpful while saying nothing.</p>
</section>
<section id="the-numbers" class="level3">
<h3 class="anchored" data-anchor-id="the-numbers">The numbers</h3>
<p>The chart at the top is every model’s pass rate by phase, over 198 dialogues and 624 triggers. Emergent and Critical track roughly what you would expect from general capability. Recovery is a cliff. The best model on the board, GPT-5.5, passes 37% of Recovery triggers and fails the other 63%. Fourteen of sixteen models pass fewer than one in five. Llama-3.2-8B passes one in a hundred.</p>
<p>And here is what makes Recovery worth measuring at all: it does not correlate with anything else.</p>
<p><img src="https://alex.smola.org/posts/39-proactbench/correlation.png" class="img-fluid" alt="Pearson correlation heatmap of 9 benchmarks; GPQA, LCB, MMLU, IFEval, SWE, AIME, Emergent, and Critical cluster near 1.0, while Recovery (Recv.) shows notably lower correlations of 0.31-0.64."></p>
<p>Six standard benchmarks (GPQA, LiveCodeBench, MMLU, IFEval, SWE-bench, AIME) agree with each other at correlations of 0.64 to 0.97. They are all reading the same underlying capability. Emergent and Critical sit comfortably inside that cloud. Recovery sits outside it, at a mean correlation of 0.51. The cleanest way to see it is to line up two models that should be interchangeable:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Model</th>
<th style="text-align: center;">LiveCodeBench</th>
<th style="text-align: center;">SWE-bench</th>
<th style="text-align: center;">AIME</th>
<th style="text-align: center;"><strong>Recovery</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>GPT-5.5</td>
<td style="text-align: center;">85.0</td>
<td style="text-align: center;">82.0</td>
<td style="text-align: center;">100.0</td>
<td style="text-align: center;"><strong>37.2</strong></td>
</tr>
<tr class="even">
<td>Kimi-K2.6</td>
<td style="text-align: center;">89.6</td>
<td style="text-align: center;">80.2</td>
<td style="text-align: center;">96.4</td>
<td style="text-align: center;"><strong>7.4</strong></td>
</tr>
</tbody>
</table>
<p>Kimi-K2.6 is neck and neck with GPT-5.5 across coding and competition math, even leading on LiveCodeBench, and then scores five times worse at noticing what the user didn’t say. No existing benchmark predicts that gap. Recovery is a genuinely separate axis. And as the <a href="https://alex.smola.org/posts/38-proactivity/">last post</a> showed, when you put these proactive responses in front of people, they prefer them four times out of five. A real capability, one people want, that no leaderboard was measuring.</p>
</section>
<section id="why-we-built-this" class="level3">
<h3 class="anchored" data-anchor-id="why-we-built-this">Why we built this</h3>
<p>At Boson AI we build human-agent interaction models, and the thing we actually care about is whether a whole conversation leaves the user better off, not whether a single answer was correct. Proactivity is the clearest case of something that lives in user satisfaction and is invisible to standard evals.</p>
<p>Synthetic dialogue is what makes it tractable. The three-agent machinery lets us generate dialogues, regenerate any model at the trigger turns, and re-score every judgment, at scale and reproducibly, with a rubric and a verbatim quote behind each label. You cannot do that with scraped chat logs, and you cannot audit a number you cannot regenerate. A good synthetic benchmark is a stand-in you can actually take apart.</p>
<hr>
<p>This is work led by <a href="https://www.linkedin.com/in/sepehr-harfi-664ba3b2/">Sepehr Harfi</a>, a research intern at Boson AI from the University of Toronto, with <a href="https://www.linkedin.com/in/ahmadsalimi/">Ahmad Salimi</a> and <a href="https://www.linkedin.com/in/dongming-shen-b405a2a8/">Dongming Shen</a>.</p>
<p>Blog and leaderboard: <a href="https://www.boson.ai/blog/proactbench">boson.ai/blog/proactbench</a> · Paper: <a href="https://arxiv.org/abs/2605.09228">arXiv:2605.09228</a> · Data: <a href="https://huggingface.co/datasets/bosonai/proactbench">huggingface.co/datasets/bosonai/proactbench</a> · Code: <a href="https://github.com/boson-ai/ProactBench">github.com/boson-ai/ProactBench</a>.</p>


</section>

 ]]></description>
  <category>evaluation</category>
  <category>llm</category>
  <category>benchmark</category>
  <guid>https://alex.smola.org/posts/39-proactbench/</guid>
  <pubDate>Mon, 01 Jun 2026 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/39-proactbench/results.png" medium="image" type="image/png" height="63" width="144"/>
</item>
<item>
  <title>What your assistant didn’t say</title>
  <link>https://alex.smola.org/posts/38-proactivity/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/38-proactivity/nannycar.jpg" class="img-fluid" alt="Open hatchback of a blue car packed with labeled equipment cases for a small art install, flanked by a handwritten checklist and a 'Fragile (Art!)' box."></p>
<p>A user has spent ten minutes preparing for a small art installation. The final exchange goes like this.</p>
<blockquote class="blockquote">
<p><strong>User:</strong> Plain-text packet is final. I’ll send tonight and load the hatchback after work.</p>
<p><strong>Model A:</strong> Sounds like a clean plan. Drive safe, and go get it.</p>
<p><strong>Model B:</strong> Before you go, a fast final check: projector + power cable; media player + charger; file tested and looping; tripod + mounting plate; extension cord; adapter; gaffer tape; measuring tape; small level; wall label on phone; backup copy on USB. If you can, pack the car tonight so tomorrow is just: grab computer, leave, install. Load in reverse order of install — the projector stand and extension cord go in last, so they’re the first things out.</p>
</blockquote>
<p>Both responses are fluent. Both end the conversation politely. The first is what most assistants do. The second one notices the disclosed detail — <em>load the hatchback</em> — and turns it into something the user didn’t ask for but will probably need.</p>
<section id="the-thing-without-a-name" class="level3">
<h3 class="anchored" data-anchor-id="the-thing-without-a-name">The thing without a name</h3>
<p>Most LLM benchmarks score how well a model answers what the user asked. That’s the entire shape of the field: a question goes in, an answer comes out, and we measure how good the answer is. Multi-turn benchmarks preserve the same premise turn by turn. Every leaderboard you’ve seen, from MMLU to LiveCodeBench, fits this mould.</p>
<p>What Model B is doing isn’t on any of those benchmarks. It’s not instruction-following — the user gave no instruction. It’s not clarification — no question was asked. It’s not generic helpfulness (“let me know if you need anything”). It’s the model picking up an implicit detail and acting on it.</p>
<p>Call this <strong>conversational proactivity</strong>. It’s a specific, narrow ability: noticing what the user disclosed but didn’t ask about, and turning it into grounded forward-looking value. The hatchback isn’t a request — it’s an aside. Model A heard the conversation end. Model B <em>anticipated</em> the next problem the user was about to have.</p>
</section>
<section id="why-this-is-invisible-to-leaderboards" class="level3">
<h3 class="anchored" data-anchor-id="why-this-is-invisible-to-leaderboards">Why this is invisible to leaderboards</h3>
<p>A model can lead every standard benchmark and still be Model A in the exchange above. The benchmarks score responses to explicit requests; the user’s wrap-up is not a request. Both Model A and Model B end the dialogue politely, with no factual errors, no failure to follow instructions. Under any of the usual metrics, they’re equivalent.</p>
<p>When we built a benchmark for this kind of proactive behaviour — more on that in a follow-up post — the gap between models was startling. The short version is that capability on standard benchmarks does not predict it. But that’s a separate argument. The first question is whether anyone actually wants Model B’s answer in the first place.</p>
</section>
<section id="the-skeptics-objection" class="level3">
<h3 class="anchored" data-anchor-id="the-skeptics-objection">The skeptic’s objection</h3>
<p>Worth taking the pushback seriously. A model that volunteers initiative every turn is intrusive. A model that adds an unsolicited packing list to every conversation will, eventually, suggest one when you didn’t want one. “Helpfulness” can shade into nannying. Sign-offs are <em>polite</em>. So even if Model B is technically more useful in this dialogue, would real users actually prefer it? Several of my colleagues explicitly told me that they wouldn’t want an AI nanny.</p>
<p>There’s only one way to find out: ask them (not just the vocal ones).</p>
</section>
<section id="the-experiment" class="level3">
<h3 class="anchored" data-anchor-id="the-experiment">The experiment</h3>
<p>We ran the cleanest version of this test we could design. Same model. Same conversation history. Same decoding parameters — temperature 0.7, identical top-<img src="https://latex.codecogs.com/png.latex?p">, identical sampling. The only difference: in one condition, the model received a short rubric as a system instruction telling it the response should add grounded forward-looking value tied to a specific detail from the conversation. In the other condition, vanilla generation.</p>
<p>Two responses per item. Random left/right placement. Annotators didn’t know which response came from which condition, didn’t know what we were testing, didn’t see the rubric. Just two paragraphs and a forced choice: which is more helpful?</p>
<p><strong>Result: across 144 paired comparisons, the rubric-conditioned response was preferred 80% of the time.</strong> The 95% confidence interval is [74%, 86%]. The probability of seeing this under chance is below <img src="https://latex.codecogs.com/png.latex?10%5E%7B-12%7D">.</p>
<p>What surprised us most was the breakdown. We split the comparisons by how our judge had scored the vanilla response:</p>
<ul>
<li>On items where the vanilla response had been rated a failure, humans preferred the proactive version 82% of the time. Expected.</li>
<li>On items where the vanilla response had already passed, <strong>humans still preferred the proactive version 70% of the time.</strong></li>
</ul>
<p>That last number is the one that matters. The rubric isn’t acting as error correction at the failure boundary. It’s lifting quality across the whole distribution, including on responses that were already fine.</p>
</section>
<section id="what-this-means" class="level3">
<h3 class="anchored" data-anchor-id="what-this-means">What this means</h3>
<p>The behaviour was already in the model. Nothing about the model changed — no fine-tuning, no extra context, no different decoder. What changed was a single line telling the model <em>where to spend its attention</em>. The proactive answer existed in the same neural network that produced the polite sign-off. They were separated by a prompt, not by a capability.</p>
<p>Which means: this isn’t a ceiling problem. It’s a default-behaviour problem. Post-training pipelines, RLHF, and system prompts are leaving real, measurable user value on the table — value that humans, when shown it side-by-side with the alternative, prefer four to one.</p>
<p>A model that can answer your question is the floor. A model that notices what you didn’t ask is the ceiling. The gap between them is bigger than the leaderboards suggest, and harder to teach than it looks. Spoiler alert - model B is a model where the company training it has plenty of human-agent chat logs, thus opportunity to improve the model based on empirical evidence.</p>
<p>More on that, with numbers, in the next post.</p>
<hr>
<p>This is work led by <a href="https://www.linkedin.com/in/sepehr-harfi-664ba3b2/">Sepehr Harfi</a>, a research intern at Boson AI, together with <a href="https://www.linkedin.com/in/ahmadsalimi/">Ahmad Salimi</a> and <a href="https://www.linkedin.com/in/dongming-shen-b405a2a8/">Dongming Shen</a>. The benchmark we built around this idea — <em><a href="https://arxiv.org/abs/2605.09228">ProactBench</a></em> — will be the subject of the next post. And in case, you wonder why we would care about this — at Boson AI we’re building human-agent interaction models and we want them to be as helpful to humans as possible, so there’s only one way to find out, namely to measure and test.</p>


</section>

 ]]></description>
  <category>llm</category>
  <category>evaluation</category>
  <category>hci</category>
  <guid>https://alex.smola.org/posts/38-proactivity/</guid>
  <pubDate>Fri, 29 May 2026 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/38-proactivity/nannycar.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Beating \(2^n - 1\) for Infiniband</title>
  <link>https://alex.smola.org/posts/37-infiniband-2n/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/37-infiniband-2n/topology.png" class="img-fluid" alt="Rail-optimized 2-tier fat tree for 512 GPUs: 8 spine switches fully meshed to 16 leaf switches, each leaf serving 32 GPU ports grouped by rank across 64 servers."></p>
<p>One of my favorite interview questions for network engineers (no longer, now that I’m about to spoil it): you have 512 GPUs across 64 servers, eight GPUs per box. Your switches are NVIDIA Quantum-2 class, <a href="https://www.nvidia.com/en-us/networking/infiniband-switching/">MQM9700</a>, sixty-four 400Gb/s NDR ports each. Wire it for full cross-sectional bandwidth. How many switches, what topology, and where do you put the fabric manager?</p>
<p>The textbook part is a <a href="https://web.eecs.umich.edu/~mosharaf/Readings/Fat-Tree.pdf">two-tier fat tree</a>, also known as spine and leaf. With 64-port switches, split each leaf into 32 ports facing GPUs and 32 ports facing the spine. That gives 16 leaf switches (<img src="https://latex.codecogs.com/png.latex?16%20%5Ctimes%2032%20=%20512"> endpoints) and 8 spine switches (each leaf sends <img src="https://latex.codecogs.com/png.latex?32/8%20=%204"> links to each spine; each spine ends up with <img src="https://latex.codecogs.com/png.latex?16%20%5Ctimes%204%20=%2064"> down, filling the box). Bisection is full, switch count is 24. So far, so easy.</p>
<section id="two-ways-to-wire-the-gpus-to-the-leaves" class="level3">
<h3 class="anchored" data-anchor-id="two-ways-to-wire-the-gpus-to-the-leaves">Two ways to wire the GPUs to the leaves</h3>
<p>The lazy answer groups by server: take 4 servers, plug all <img src="https://latex.codecogs.com/png.latex?8%20%5Ctimes%204%20=%2032"> NICs into one leaf. Everything on those four boxes lives on one switch. The problem is that within a single server, the GPUs already talk over <a href="https://www.nvidia.com/en-us/data-center/nvlink/">NVLink</a> at much higher bandwidth than IB will ever give you. The fabric exists to move data <em>between</em> servers, not within them. Putting four servers’ worth of intra-node traffic on one leaf optimizes for traffic that doesn’t happen. Worse, in collectives most of the GPU-to-GPU traffic is rank aligned: GPU0 talks to GPU0, GPU7 talks to GPU7. Group-by-server puts every rank on every leaf, so all of that rank-aligned traffic crosses the spine.</p>
<p>The rail-optimized layout flips this. Group by rank. All sixty-four GPU0s across the cluster go to leaf 0, all sixty-four GPU1s to leaf 1, and so on. Since we have 32 ports per leaf facing endpoints, each rank actually splits into two leaves of 32, so 16 leaves total, the same count as before. Any all-reduce within a rank stays leaf-local. Cross-rank traffic, the minority, climbs to the spine. This is how every serious training cluster gets wired, the <a href="https://docs.nvidia.com/dgx-superpod/">NVIDIA SuperPOD reference architecture</a> included.</p>
</section>
<section id="so-where-do-the-ufm-nodes-go" class="level3">
<h3 class="anchored" data-anchor-id="so-where-do-the-ufm-nodes-go">So where do the UFM nodes go?</h3>
<p>Quantum-class IB switches are unmanaged. You need a <a href="https://www.nvidia.com/en-us/networking/infiniband/ufm/">Unified Fabric Manager</a> appliance to discover the topology, compute routes, monitor link health, and do everything the closed switch ASIC will not do for itself. Two UFM nodes for fault tolerance, hanging off different switches so one switch failure does not take both out.</p>
<p>The cluster as designed has zero slack. Every leaf port is a GPU, every spine port is a leaf uplink. This is where the bad options start.</p>
</section>
<section id="bad-option-1-drop-a-server" class="level3">
<h3 class="anchored" data-anchor-id="bad-option-1-drop-a-server">Bad option 1: drop a server</h3>
<p>Sacrifice one server, free 8 leaf ports (one per rank, on 8 different leaves). Plug two UFMs in. Done. The cluster is now 504 GPUs. Any 3D-parallel placement that wanted 64 servers, or any clean power of two in any dimension, gets to be ugly. You also paid for a 64th server you can’t fully use (as suggested by a vendor).</p>
</section>
<section id="bad-option-2-add-a-core-switch" class="level3">
<h3 class="anchored" data-anchor-id="bad-option-2-add-a-core-switch">Bad option 2: add a core switch</h3>
<p>Buy a third tier. A core switch above the spine frees 16 ports on the spines (the spines now uplink to core), and you hang the UFMs off there. This works. It costs you a switch, a rack of cabling, a few hundred watts, and it breaks the symmetry. Multipath routing on a regular fat tree is a beautiful thing. Hanging an asymmetric appliance off the top makes everyone’s life harder. Also, you bought a switch (also suggested by a vendor).</p>
</section>
<section id="the-actually-good-option" class="level3">
<h3 class="anchored" data-anchor-id="the-actually-good-option">The actually good option</h3>
<p><img src="https://alex.smola.org/posts/37-infiniband-2n/topology-ufm.png" class="img-fluid" alt="Same 512-GPU fat-tree topology with one leaf-spine link cut; UFM-A plugs into the freed spine port and UFM-B into the freed leaf port, costing 1/512 of bisection bandwidth."></p>
<p>Here’s what we did at <a href="https://www.boson.ai">Boson AI</a>: cut one cable between a leaf and a spine. Plug a UFM into the freed leaf port. Plug a second UFM into the freed spine port. Done.</p>
<p>What did that cost? One leaf-spine pair now has 3 links instead of 4, so the bandwidth between that specific pair is down 25%. Averaged across all <img src="https://latex.codecogs.com/png.latex?8%20%5Ctimes%2016%20=%20128"> leaf-spine pairs, you lost <img src="https://latex.codecogs.com/png.latex?1/512"> of total cross-sectional capacity. The penalty only shows up when (a) you are running close to all 64 servers, (b) the traffic actually traverses that one degraded pair, and (c) the traffic is cross-rank, so it had to use the spine at all. In a rail-optimized topology, criterion (c) already excludes most of the load. In practice the oversubscription is invisible.</p>
<p>You keep 512 = <img src="https://latex.codecogs.com/png.latex?2%5E9"> GPUs. You keep the symmetric uniform fat tree. You spend no extra hardware. You get two redundant fabric managers on physically distinct switches. The whole thing is a one-cable change.</p>
</section>
<section id="going-to-9-cdot-2n" class="level3">
<h3 class="anchored" data-anchor-id="going-to-9-cdot-2n">Going to <img src="https://latex.codecogs.com/png.latex?9%20%5Ccdot%202%5En"></h3>
<p>Quantum-3 switches in the <a href="https://www.nvidia.com/en-us/networking/quantum-3/">Q3400</a> family come with <img src="https://latex.codecogs.com/png.latex?144%20=%2012%20%5Ccdot%2016"> ports. As such, the natural cluster sizes land on <img src="https://latex.codecogs.com/png.latex?9%20%5Ccdot%202%5En"> GPUs instead of pure powers of two. As it turns out (thanks to <a href="http://semianalysis.com/">Semianalysis</a> for pointing this out), NVIDIA listened and added a <a href="https://docs.nvidia.com/networking/display/xdrswitcheshwum/introduction">dedicated UFM port</a> to the switches.</p>
<p>At the same time, they <a href="https://forums.developer.nvidia.com/t/can-i-install-and-activate-subnet-manager-on-nvos-with-q3400-ra-xdr-switch/340580">disabled</a> the possibility of running UFM on managed Quantum-2 switches. In case you wonder whether IB forwarding across Ethernet might be possible, this unfortunately doesn’t work for <a href="https://docs.nvidia.com/networking/display/UFMEnterpriseUMv6160/UFM+Communication+Requirements">control signals</a>.</p>


</section>

 ]]></description>
  <category>networking</category>
  <category>infiniband</category>
  <category>infrastructure</category>
  <guid>https://alex.smola.org/posts/37-infiniband-2n/</guid>
  <pubDate>Wed, 27 May 2026 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/37-infiniband-2n/topology-ufm.png" medium="image" type="image/png" height="81" width="144"/>
</item>
<item>
  <title>Fair representations are pancakes</title>
  <link>https://alex.smola.org/posts/36-pancake-theorem/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/36-pancake-theorem/pancakes.jpg" class="img-fluid" alt="Two thin pancakes stacked flat on a white plate on a wooden table, evoking the 'flattened representation' collapse described in the post"></p>
<p>The natural reaction to the <a href="../35-pokemon-theorem/">Pokémon theorem</a> is to escape into representation learning. If finite scalar checklists cannot catch every fairness violation, then bake fairness into the <em>features</em> once. Ship a single encoder <img src="https://latex.codecogs.com/png.latex?%5CPhi%20:%20%5Cmathcal%7BX%7D%20%5Cto%20%5Cmathcal%7BZ%7D"> and let every downstream predictor inherit fairness for free. There is a real and serious literature behind exactly this idea: <a href="https://proceedings.mlr.press/v28/zemel13.html">Zemel et al.&nbsp;(2013)</a>, <a href="https://arxiv.org/abs/1511.05897">Edwards and Storkey (2016)</a>, <a href="https://arxiv.org/abs/1511.00830">Louizos et al.&nbsp;(2016)</a>, <a href="https://arxiv.org/abs/1802.06309">Madras et al.&nbsp;(2018)</a>. It is a beautiful agenda, and it has a serious flaw …</p>
<section id="what-fair-features-actually-demand" class="level3">
<h3 class="anchored" data-anchor-id="what-fair-features-actually-demand">What “fair features” actually demand</h3>
<p>Two desiderata for a feature map <img src="https://latex.codecogs.com/png.latex?%5CPhi">:</p>
<ul>
<li><strong>Predictive signal.</strong> <img src="https://latex.codecogs.com/png.latex?%5CPhi(X)"> carries information about <img src="https://latex.codecogs.com/png.latex?Y"> (otherwise we can’t use it for classification).</li>
<li><strong>Distributional insensitivity to the group attribute <img src="https://latex.codecogs.com/png.latex?G">.</strong> That is, th group attribute shouldn’t matter in aggregate for the feature distribution. Operationally this gets split into two pieces:
<ul>
<li><strong>Parity</strong> in the representation: <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5CPhi,a%7D%20=%20%5Cmu_%7B%5CPhi,b%7D">.</li>
<li><strong>Class-conditional separation</strong>: <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5CPhi,y,a%7D%20=%20%5Cmu_%7B%5CPhi,y,b%7D"> for each <img src="https://latex.codecogs.com/png.latex?y%20%5Cin%20%5C%7B0,%201%5C%7D">.</li>
</ul></li>
</ul>
<p>Without the second piece the goal collapses into a triviality: throw away all signal about <img src="https://latex.codecogs.com/png.latex?Y"> and parity is automatic. Both together is the operational definition of a fair feature, and it is the form every paper above is chasing. The notation is the same as in the previous post: <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5CPhi,g%7D%20:=%20%5Cmathbb%7BE%7D%5B%5Cphi_%5Cmathcal%7BZ%7D(%5CPhi(X))%20%5Cmid%20G%20=%20g%5D"> is the conditional mean embedding of the representation, and <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5CPhi,y,g%7D%20:=%20%5Cmathbb%7BE%7D%5B%5Cphi_%5Cmathcal%7BZ%7D(%5CPhi(X))%20%5Cmid%20Y%20=%20y,%20G%20=%20g%5D"> adds the class.</p>
</section>
<section id="linear-algebra-strikes-again" class="level3">
<h3 class="anchored" data-anchor-id="linear-algebra-strikes-again">Linear algebra strikes again</h3>
<p>Apply the law of total expectation conditional on <img src="https://latex.codecogs.com/png.latex?G%20=%20g">. With class-conditional separation, the class-conditional means <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5CPhi,y,g%7D"> do not depend on <img src="https://latex.codecogs.com/png.latex?g">, so we can drop the subscript and write <img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5CPhi,y%7D">. Then we can write the per-group mean as</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5CPhi,g%7D%20%5C;=%5C;%20p_g%20%5C,%20%5Cmu_%7B%5CPhi,1%7D%20+%20(1%20-%20p_g)%5C,%20%5Cmu_%7B%5CPhi,0%7D."></p>
<p>Here <img src="https://latex.codecogs.com/png.latex?p_g"> is the group-specific class conditional probability (base rate). Subtract across groups:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5CPhi,a%7D%20-%20%5Cmu_%7B%5CPhi,b%7D%20%5C;=%5C;%20(p_a%20-%20p_b)%5C,(%5Cmu_%7B%5CPhi,1%7D%20-%20%5Cmu_%7B%5CPhi,0%7D)."></p>
<p>Parity zeros the left-hand side. Unequal base rates <img src="https://latex.codecogs.com/png.latex?p_a%20%5Cneq%20p_b"> zero the scalar coefficient. The bracket must vanish:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmu_%7B%5CPhi,1%7D%20%5C;=%5C;%20%5Cmu_%7B%5CPhi,0%7D."></p>
<p>In other words, the distributions for positive and negative classes match, since under a characteristic kernel on <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BZ%7D">, equality of mean embeddings lifts to equality of distributions. We get <img src="https://latex.codecogs.com/png.latex?%5CPhi(X)%20%5Cperp%20Y">. No measurable downstream predictor on <img src="https://latex.codecogs.com/png.latex?%5CPhi(X)"> retains any signal about <img src="https://latex.codecogs.com/png.latex?Y">.</p>
<p>The representation has been flattened. Both stacks of pancakes are now one.</p>
</section>
<section id="why-this-is-harder-to-swallow-than-the-pokémon-theorem" class="level3">
<h3 class="anchored" data-anchor-id="why-this-is-harder-to-swallow-than-the-pokémon-theorem">Why this is harder to swallow than the Pokémon theorem</h3>
<p>The Pokémon theorem said <em>some</em> fairness criterion remains unaudited after any finite checklist. There was a residual, and you could go bigger. The collapse here is not a residual. It is the joint distribution of <img src="https://latex.codecogs.com/png.latex?(%5CPhi(X),%20Y)"> being degenerate. Whatever your encoder is (linear, kernel, deep network, contrastive, adversarial), if it satisfies parity and class-conditional separation exactly and base rates differ, there is nothing in there about <img src="https://latex.codecogs.com/png.latex?Y">. Closely related observations appear in <a href="https://arxiv.org/abs/2107.03483">Lechner et al.&nbsp;(2021)</a> and <a href="https://arxiv.org/abs/1906.08386">Zhao and Gordon (2022)</a>; the conjunction form here pins down the exact obstruction.</p>
<p>There is a decade of work that took the two desiderata as a starting point and built encoders that approximately satisfied both. The theorem does not say those encoders are useless. It only says that the limit they were chasing is empty (but there’s hope - check out Theorem 6.1 in the paper).</p>
</section>
<section id="the-forbidden-corner" class="level3">
<h3 class="anchored" data-anchor-id="the-forbidden-corner">The forbidden corner</h3>
<p>The experimental check across three standard fairness benchmarks:</p>
<p><img src="https://alex.smola.org/posts/36-pancake-theorem/forbidden-corner.png" class="img-fluid" alt="Three scatter plots (Adult Income, COMPAS, ACS PUMS) each showing parity gap vs. class-conditional gap for LFR, Fair-VAE, and adversarial models; the forbidden upper-left corner is shaded pink and lies above the theoretical bound line"></p>
<p>Each panel plots the parity gap <img src="https://latex.codecogs.com/png.latex?%5C%7C%5Chat%5Cmu_%7B%5CPhi,a%7D%20-%20%5Chat%5Cmu_%7B%5CPhi,b%7D%5C%7C"> on the horizontal axis against the largest class-conditional gap <img src="https://latex.codecogs.com/png.latex?%5Cmax_g%20%5C%7C%5Chat%5Cmu_%7B%5CPhi,1,g%7D%20-%20%5Chat%5Cmu_%7B%5CPhi,0,g%7D%5C%7C"> on the vertical axis. The diagonal <img src="https://latex.codecogs.com/png.latex?y%20=%20(x%20+%20%5Crho)%20/%20%7C%5CDelta%20p%7C"> is the theoretical bound from the approximate version of the theorem (Theorem 6.1). The upper-left corner (small parity gap, large class-conditional signal) is the forbidden region.</p>
<p>The forbidden region is empirically empty. LFR, Fair-VAE, and adversarial debiasing across all three datasets cluster on the allowed side of the line. They are doing the only thing they can.</p>
</section>
<section id="the-escape" class="level3">
<h3 class="anchored" data-anchor-id="the-escape">The escape</h3>
<p>The exact collapse only bites at the exact <img src="https://latex.codecogs.com/png.latex?(0,%200)"> point. Relax to <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon">-parity and <img src="https://latex.codecogs.com/png.latex?%5Crho">-class-separation and you get that the usable class-conditional signal is bounded by</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B%5Cvarepsilon%20+%20%5Crho%7D%7B%7Cp_a%20-%20p_b%7C%7D."></p>
<p>Linear in the joint fairness budget, rescaled by the base-rate gap. As <img src="https://latex.codecogs.com/png.latex?(%5Cvarepsilon,%20%5Crho)%20%5Cto%20(0,%200)"> the bound collapses to zero, matching the exact theorem. But for any positive budget there is a real Pareto frontier you can navigate and trade off fairness for usefulness in features. The practical lesson is that the question is not whether your representation is fair. It is which fairness budget you spent and how much class signal you have left. Spend it deliberately.</p>
</section>
<section id="closing" class="level3">
<h3 class="anchored" data-anchor-id="closing">Closing</h3>
<p>The <a href="../35-pokemon-theorem/">Pokémon theorem</a> said finite scalar checklists cannot catch every violation. The pancake theorem says perfect fair representations cannot retain <img src="https://latex.codecogs.com/png.latex?Y">. Two different orthogonality arguments, same RKHS engine. The paper has a Pokémon–KMR bridge and minimax rates for the approximate versions of both.</p>
<p>Paper: <a href="https://arxiv.org/abs/2605.09221">arXiv:2605.09221</a>. Joint with <a href="https://www.linkedin.com/in/danielmatsuismola/">Daniel Matsui Smola</a>.</p>


</section>

 ]]></description>
  <category>fairness</category>
  <category>kernel</category>
  <guid>https://alex.smola.org/posts/36-pancake-theorem/</guid>
  <pubDate>Tue, 26 May 2026 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/36-pancake-theorem/pancakes.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>You don’t need all the benchmarks</title>
  <link>https://alex.smola.org/posts/34-benchmark-selection/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/34-benchmark-selection/selection.png" class="img-fluid" alt="Dot plot showing greedy mutual-information selection order (x-axis: selection position k 1-15) for 18 MMLU subjects; red diamonds mark mean position across 10 folds, blue dots show per-fold positions"></p>
<p>Every time a new model comes out, somebody runs it on <a href="https://arxiv.org/abs/2009.03300">MMLU</a> (57 subjects), <a href="https://huggingface.co/spaces/mteb/leaderboard">MTEB</a> (56 tasks), <a href="https://crfm.stanford.edu/helm/">HELM</a>, the <a href="https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard">Open LLM Leaderboard</a>, <a href="https://tatsu-lab.github.io/alpaca_eval/">AlpacaEval</a>, <a href="https://livebench.ai/">LiveBench</a>, <a href="https://bigcode-bench.github.io/">BigCodeBench</a>, <a href="https://huggingface.co/spaces/allenai/WildBench">WildBench</a>, <a href="https://github.com/lmarena/arena-hard-auto">Arena-Hard</a>, <a href="https://arxiv.org/abs/2306.05685">MT-Bench</a>, and a dozen others. That’s days of GPU time and a lot of human babysitting. But if you’ve ever stared at a leaderboard for ten minutes you already know the dirty secret: the columns are wildly correlated. If a model is good at one math benchmark it’s good at all of them. So how much of this can we just skip?</p>
<p>A lot, as it turns out. On MMLU, <strong>5 subjects out of 57 predict the remaining 52 with <img src="https://latex.codecogs.com/png.latex?R%5E2%20%5Capprox%200.91"></strong>, across 5,452 models, with 10-fold cross-validation. The eigenspectrum of the score covariance tells the same story: two components capture 90% of the variance on MMLU, six on MTEB. Benchmark scores live in a low-dimensional subspace.</p>
<p>The question is <em>which</em> subset, and how to pick it.</p>
<section id="a-gaussian-model-of-leaderboard-scores" class="level3">
<h3 class="anchored" data-anchor-id="a-gaussian-model-of-leaderboard-scores">A Gaussian model of leaderboard scores</h3>
<p>Let <img src="https://latex.codecogs.com/png.latex?B%20%5Cin%20%5Cmathbb%7BR%7D%5E%7BM%20%5Ctimes%20N%7D"> be the score matrix with <img src="https://latex.codecogs.com/png.latex?M"> models and <img src="https://latex.codecogs.com/png.latex?N"> benchmarks. Assume each row is an independent draw from a multivariate Gaussian:</p>
<p><img src="https://latex.codecogs.com/png.latex?B_%7Bi,%5Ccdot%7D%20%5Csim%20%5Cmathcal%7BN%7D(%5Cmu,%20%5CSigma)."></p>
<p>Obviously wrong. Benchmark scores are bounded, often bimodal, frequently sparse. But it’s extremely useful: <img src="https://latex.codecogs.com/png.latex?%5Cmu"> and <img src="https://latex.codecogs.com/png.latex?%5CSigma"> are easy to estimate from leaderboard data (with <a href="https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm">EM</a> for the missing entries you’ll inevitably have), and conditioning on a subset <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BA%7D"> gives a closed-form imputer:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Chat%7BB%7D_%7Bi,%5Cbar%7B%5Cmathcal%7BA%7D%7D%7D%20%5C;=%5C;%20%5Cmu_%7B%5Cbar%7B%5Cmathcal%7BA%7D%7D%7D%20+%20%5CSigma_%7B%5Cbar%7B%5Cmathcal%7BA%7D%7D%5Cmathcal%7BA%7D%7D%20%5CSigma_%7B%5Cmathcal%7BA%7D%5Cmathcal%7BA%7D%7D%5E%7B-1%7D%20(B_%7Bi,%5Cmathcal%7BA%7D%7D%20-%20%5Cmu_%7B%5Cmathcal%7BA%7D%7D)."></p>
<p>The conditional mean is the best <em>linear</em> predictor regardless of whether the scores are actually Gaussian, so the model doesn’t need to be right to give us a reasonable imputer. It just needs to give us <img src="https://latex.codecogs.com/png.latex?%5CSigma">.</p>
</section>
<section id="this-is-the-sensor-placement-problem" class="level3">
<h3 class="anchored" data-anchor-id="this-is-the-sensor-placement-problem">This is the sensor placement problem</h3>
<p>Once you have <img src="https://latex.codecogs.com/png.latex?%5CSigma">, picking which <img src="https://latex.codecogs.com/png.latex?k"> benchmarks to run is exactly the Gaussian process sensor placement problem <a href="https://www.jmlr.org/papers/v9/krause08a.html">Krause, Singh and Guestrin solved in 2008</a>. Their sensors are our benchmarks; the field they’re trying to reconstruct is our matrix of unselected benchmark scores. Two natural objectives:</p>
<ul>
<li><strong>Entropy</strong> <img src="https://latex.codecogs.com/png.latex?f_1(%5Cmathcal%7BA%7D)%20=%20H(X_%5Cmathcal%7BA%7D)%20=%20%5Ctfrac%7B1%7D%7B2%7D%20%5Clog%5Cdet(2%5Cpi%20e%20%5C,%20%5CSigma_%7B%5Cmathcal%7BA%7D%5Cmathcal%7BA%7D%7D)">. Pick benchmarks that are diverse from each other.</li>
<li><strong>Mutual information</strong> <img src="https://latex.codecogs.com/png.latex?f_2(%5Cmathcal%7BA%7D)%20=%20I(X_%5Cmathcal%7BA%7D;%20X_%7B%5Cbar%7B%5Cmathcal%7BA%7D%7D%7D)">. Pick benchmarks that are maximally coupled with the ones you don’t run.</li>
</ul>
<p>Both are submodular under the Gaussian model. Entropy is also monotone (after a constant shift), so greedy comes with the standard <img src="https://latex.codecogs.com/png.latex?(1-1/e)"> approximation guarantee. MI is non-monotone in general but stays positive-margin for small <img src="https://latex.codecogs.com/png.latex?k"> in every dataset we tried, so we run greedy on it as a heuristic.</p>
<p>Greedy means: start empty, add the benchmark with the largest marginal gain, repeat. For entropy this is one rank-one Cholesky update per step; for MI it’s a fresh Cholesky on the complement block. Either way the total cost is negligible compared to running a single benchmark on a single model.</p>
<p>These two objectives pick <em>different</em> benchmarks. Entropy hunts for outlier subjects that disagree with everything else; MI hunts for hub subjects that strongly couple with the rest. For <em>imputation</em>, MI is what you want. But for <em>diversity</em>, entropy is the right tool. A surprise worth picking apart another time.</p>
</section>
<section id="the-numbers" class="level3">
<h3 class="anchored" data-anchor-id="the-numbers">The numbers</h3>
<p>We tried this on three score matrices stitched together from ten public leaderboards: MMLU (5452 × 57, fully observed), MTEB (263 × 56, 77% observed), and a merged matrix from nine leaderboards (118 × 114, 31% observed). 10-fold CV, fit <img src="https://latex.codecogs.com/png.latex?%5CSigma"> on 9 folds, impute the held-out fold.</p>
<ul>
<li><strong>MMLU</strong>: <img src="https://latex.codecogs.com/png.latex?k=5%20%5CRightarrow%20R%5E2%20=%200.91">. <img src="https://latex.codecogs.com/png.latex?k=10%20%5CRightarrow%20R%5E2%20%3E%200.92">. Stable even with only 10% of models for training (~545 rows).</li>
<li><strong>MTEB</strong>: <img src="https://latex.codecogs.com/png.latex?R%5E2%20%5Capprox%200.85"> at <img src="https://latex.codecogs.com/png.latex?k=15">, harder because the matrix is sparser and embedding tasks are genuinely heterogeneous.</li>
<li><strong>Merged</strong>: the hardest case (<img src="https://latex.codecogs.com/png.latex?R%5E2%20%5Capprox%200.55"> at <img src="https://latex.codecogs.com/png.latex?k=15">), but still 15 benchmarks out of 114 explaining over half the variance in held-out scores.</li>
</ul>
<p>A useful baseline: random selection on MMLU gets <img src="https://latex.codecogs.com/png.latex?R%5E2%20%5Capprox%200.89"> at <img src="https://latex.codecogs.com/png.latex?k=5">. The 57 subjects are <em>so</em> redundant that you can almost throw darts. The gap opens up on MTEB and Merged, where the benchmark space has real structure and principled selection matters.</p>
<p>The hero image up top is what MI picks on MMLU, with selection positions averaged across the 10 folds (red diamonds) and per-fold positions in light blue. The first nine picks are identical across all 10 folds. The picks themselves are sensible: <code>miscellaneous</code>, <code>professional_psychology</code>, <code>elementary_mathematics</code>, <code>high_school_psychology</code>, <code>marketing</code>. A deliberately broad sweep of the MMLU domain.</p>
</section>
<section id="were-not-alone" class="level3">
<h3 class="anchored" data-anchor-id="were-not-alone">We’re not alone</h3>
<p>A few neighbors in this corner of the literature:</p>
<ul>
<li><a href="https://arxiv.org/abs/2402.14992">tinyBenchmarks</a> reduces redundancy <em>within</em> a benchmark by selecting representative examples.</li>
<li><a href="https://arxiv.org/abs/2412.06540">Sloth</a> fits low-dimensional latent skills across benchmark families to predict performance.</li>
<li><a href="https://arxiv.org/abs/2407.13696">BenchBench</a> diagnoses agreement among benchmarks via meta-benchmarking.</li>
<li><a href="https://github.com/anadim/llm-benchmark-matrix">BenchPress</a> is the closest spiritual neighbor. It empirically observes that benchmark scores can be interpolated. Our entropy step turns out to be exactly pivoted Cholesky on <img src="https://latex.codecogs.com/png.latex?%5CSigma">, which is more or less what BenchPress does, but information-theoretically motivated rather than empirical.</li>
</ul>
</section>
<section id="one-application-ranking-a-new-benchmark" class="level3">
<h3 class="anchored" data-anchor-id="one-application-ranking-a-new-benchmark">One application: ranking a new benchmark</h3>
<p>The same machinery applies in the dual. Instead of selecting benchmarks from a fixed pool, you can <em>rank</em> a new benchmark by its marginal MI gain against the incumbents. We tried this on the recently proposed <a href="https://arxiv.org/abs/2605.09228">ProactBench</a>, which decomposes conversational proactivity into Emergent, Critical, and Recovery axes. Recovery, assistance the user didn’t explicitly ask for, turns out to be weakly correlated with everything else (<img src="https://latex.codecogs.com/png.latex?%5Cbar%20r%20=%200.51"> vs.&nbsp;<img src="https://latex.codecogs.com/png.latex?0.64">–<img src="https://latex.codecogs.com/png.latex?0.97"> for the standard benchmarks), and greedy entropy on the 9×9 correlation matrix ranks it #2 out of 9, ahead of every existing benchmark in the comparison. A clean example of when adding a new benchmark is actually worth the GPU time.</p>
<p>Paper: <a href="https://arxiv.org/abs/2605.02209">arXiv:2605.02209</a> · <a href="workshop_benchselect.pdf">Workshop talk slides</a> · Code and score matrices: <a href="https://github.com/smolix/benchmark-selection">github.com/smolix/benchmark-selection</a>.</p>


</section>

 ]]></description>
  <category>evaluation</category>
  <category>submodularity</category>
  <guid>https://alex.smola.org/posts/34-benchmark-selection/</guid>
  <pubDate>Mon, 25 May 2026 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/34-benchmark-selection/selection.png" medium="image" type="image/png" height="134" width="144"/>
</item>
<item>
  <title>The Pokémon Theorem</title>
  <link>https://alex.smola.org/posts/35-pokemon-theorem/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/35-pokemon-theorem/pikachu.jpg" class="img-fluid" alt="Pikachu sprinting across a glowing 3D coordinate grid with labeled data points, illustrating the geometric escape argument"></p>
<p>If you have been near algorithmic fairness for the past decade, you already know the punchline: you cannot have all the things you want. Calibration, class-conditional balance, and predictive parity refuse to coexist whenever base rates differ across groups. This is the impossibility trinity of <a href="https://arxiv.org/abs/1609.05807">Kleinberg, Mullainathan and Raghavan (2017)</a>, <a href="https://arxiv.org/abs/1610.07524">Chouldechova (2017)</a> and <a href="https://arxiv.org/abs/1709.02012">Pleiss et al.&nbsp;(2017)</a>. <a href="https://fairmlbook.org/">Barocas, Hardt and Narayanan</a> collect the pairwise tensions among independence, separation and sufficiency into what they call the incompatibility triangle. <a href="https://arxiv.org/abs/1811.10104">Hutchinson and Mitchell (2019)</a> survey the fifty years of educational-testing impossibility work that came before all of this. The 1970s knew most of the punchline. None of us listened.</p>
<p>There is an obvious question that the trinity leaves open. What if we only impose <em>scalar</em> criteria, rather than distributional ones. Will a few of them suffice to ensure fairness? After all, this is what individual scalar metrics such as college acceptance rates, recidivism rates, etc. aim to control for, only to be thwarted again by another study. With dozens of criteria, with hundreds, with every reasonable test anyone has ever proposed, can we close every gap? It turns out we cannot. The argument is a very simple piece of linear algebra.</p>
<section id="a-kernel-refresher" class="level3">
<h3 class="anchored" data-anchor-id="a-kernel-refresher">A kernel refresher</h3>
<p>Pick a characteristic kernel <img src="https://latex.codecogs.com/png.latex?k"> on the feature space <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BX%7D"> with feature map <img src="https://latex.codecogs.com/png.latex?%5Cphi">. For each group <img src="https://latex.codecogs.com/png.latex?g%20%5Cin%20%5C%7Ba,%20b%5C%7D">, the conditional mean embedding is</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmu_g%20%5C;:=%5C;%20%5Cmathbb%7BE%7D%5B%5Cphi(X)%20%5Cmid%20G%20=%20g%5D%20%5C;%5Cin%5C;%20%5Cmathcal%7BH%7D."></p>
<p>A linear mean-fairness criterion is a test direction <img src="https://latex.codecogs.com/png.latex?v%20%5Cin%20%5Cmathcal%7BH%7D">, and the score (or classifier, or learned representation) passes the test exactly when the expected score on group <img src="https://latex.codecogs.com/png.latex?a"> matches that on group <img src="https://latex.codecogs.com/png.latex?b">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%5Bc(X)%20%5Cmid%20G%20=%20a%5D%20-%20%5Cmathbb%7BE%7D%5Bc(X)%20%5Cmid%20G%20=%20b%5D%20%5C;=%5C;%20%5Clangle%20v,%20%5Cdelta%20%5Crangle%20%5C;=%5C;%200,"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20:=%20%5Cmu_a%20-%20%5Cmu_b"> is the group-difference vector. Demographic parity gaps, calibration moments, equalized-odds residuals, every “is the expected something the same across groups” check fits this form for some <img src="https://latex.codecogs.com/png.latex?v">. A fairness checklist of size <img src="https://latex.codecogs.com/png.latex?m"> is a finite set of directions <img src="https://latex.codecogs.com/png.latex?%5C%7Bv_1,%20%5Cldots,%20v_m%5C%7D">, and the classifier passes the checklist iff <img src="https://latex.codecogs.com/png.latex?%5Cdelta"> is orthogonal to every <img src="https://latex.codecogs.com/png.latex?v_i">.</p>
<p>If the groups are distributionally distinct, <img src="https://latex.codecogs.com/png.latex?P_a%20%5Cneq%20P_b">, then <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%5Cneq%200"> by the characteristic property of <img src="https://latex.codecogs.com/png.latex?k">. This is the only ingredient we need.</p>
</section>
<section id="the-geometric-escape" class="level3">
<h3 class="anchored" data-anchor-id="the-geometric-escape">The geometric escape</h3>
<p>Let <img src="https://latex.codecogs.com/png.latex?V_m%20=%20%5Cmathrm%7Bspan%7D%5C%7Bv_1,%20%5Cldots,%20v_m%5C%7D"> be the subspace of the tests we apply (this works, since we’re in an RKHS). If our classifier passes all <img src="https://latex.codecogs.com/png.latex?m"> criteria, then <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%5Cin%20V_m%5E%5Cperp">, the orthogonal complement of the audit subspace. Now consider the unit vector <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20/%20%5C%7C%5Cdelta%5C%7C_%5Cmathcal%7BH%7D">. By construction it also lies in <img src="https://latex.codecogs.com/png.latex?V_m%5E%5Cperp">, and</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cleft%5Clangle%20%5Cfrac%7B%5Cdelta%7D%7B%5C%7C%5Cdelta%5C%7C_%5Cmathcal%7BH%7D%7D,%20%5C,%20%5Cdelta%20%5Cright%5Crangle%20%5C;=%5C;%20%5C%7C%5Cdelta%5C%7C_%5Cmathcal%7BH%7D%20%5C;=%5C;%20%5Cmathrm%7BMMD%7D(P_a,%20P_b)%20%5C;%3E%5C;%200."></p>
<p>That direction is a fairness violation. We just constructed it. Whichever finite checklist you brought, the MMD witness is sitting in its orthogonal complement, perfectly visible, completely unaudited.</p>
<p>You <em>Gotta Catch ’Em All</em>, alas, the Pokemon theorem says that you can’t.</p>
</section>
<section id="why-does-this-work" class="level3">
<h3 class="anchored" data-anchor-id="why-does-this-work">Why does this work?</h3>
<p>Our argument is structural, not numerical. It does not depend on which criteria you chose, only on the fact that there are finitely many of them and the groups are distributionally distinct. The four-hundredth criterion buys you an audit subspace of dimension at most four hundred, and the orthogonal complement of a four-hundred-dimensional subspace of an infinite-dimensional Hilbert space is still infinite-dimensional. Some direction in it carries the group-difference signal. The MMD witness is always one such direction.</p>
<p>You can ask how <em>large</em> the residual is after <img src="https://latex.codecogs.com/png.latex?m"> criteria, and the paper gives a quantitative answer in terms of the spectral regularity of <img src="https://latex.codecogs.com/png.latex?%5Cdelta"> relative to the pooled data covariance: under polynomial eigendecay of the covariance operator and a source condition on <img src="https://latex.codecogs.com/png.latex?%5Cdelta">, the minimax residual decays at the Kolmogorov <img src="https://latex.codecogs.com/png.latex?m">-width rate <img src="https://latex.codecogs.com/png.latex?%5CTheta(m%5E%7B-2%5Calpha%20r%7D)">. The minimax-optimal allocation of a size-<img src="https://latex.codecogs.com/png.latex?m"> fairness budget is the top-<img src="https://latex.codecogs.com/png.latex?m"> Mercer eigenspace of the pooled covariance. Spectral budgeting beats heuristic checklists. The paper has the details.</p>
</section>
<section id="provenance" class="level3">
<h3 class="anchored" data-anchor-id="provenance">Provenance</h3>
<p>I first posited this theorem (without proof) in a <a href="https://c.d2l.ai/stanford-cs329p/_static/pdfs/cs329p_slides_17_4.pdf">Stanford CS 329P lecture</a> on applied machine learning. It sat there for a few years, in the form of a single slide and a feeling that it ought to be true. Then my son <a href="https://www.linkedin.com/in/danielmatsuismola/">Daniel Matsui Smola</a> and I sat down together and turned it into a proof, a quantitative version, and a handful of approximation results around it. I have written quite a few papers over the years. This is the first one I have written with Daniel, and I am incredibly proud of him. Watching him chase down the spectral arguments and add many more results has been one of the genuine joys this year.</p>
<p>Next post in this series: the impossibility of fair feature learning. A different and more brutal piece of orthogonality folklore from the same paper. Check out <a href="https://arxiv.org/abs/2605.09221">arXiv:2605.09221</a>.</p>


</section>

 ]]></description>
  <category>fairness</category>
  <category>kernel</category>
  <guid>https://alex.smola.org/posts/35-pokemon-theorem/</guid>
  <pubDate>Mon, 25 May 2026 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/35-pokemon-theorem/pikachu.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Dive into Deep Learning</title>
  <link>https://alex.smola.org/posts/33-d2l/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/33-d2l/front.png" class="img-fluid" alt="Cover of 'Dive into Deep Learning' by Aston Zhang, Zachary Lipton, Mu Li, and Alexander Smola, showing a mountain landscape"></p>
<p>I’m happy to announce our new book project - <a href="http://d2l.ai">Dive into Deep Learning</a>. It’s still in beta stage, i.e.&nbsp;we’re still working on it. That said, I think that it’s good enough to share with friends and colleagues. This is joint work with <a href="https://www.astonzhang.com/">Aston Zhang</a>, <a href="http://www.cs.cmu.edu/~muli/">Mu Li</a>, and <a href="https://www.zacharylipton.com/">Zachary Lipton</a>.</p>
<p>There’s an obvious question - why yet another machine learning book? After all, there’s no shortage of great books, e.g.&nbsp;<a href="https://probml.github.io/pml-book/">Machine Learning</a> by Kevin Murphy, <a href="https://link.springer.com/book/9780387310732">Pattern Recognition and Machine Learning</a> by Chris Bishop, <a href="https://www.deeplearningbook.org/">Deep Learning</a> by Ian Goodfellow, Yoshua Bengio and Aaron Courville, <a href="http://www.inference.org.uk/itprnn/book.html">Information Theory, Inference and Learning Algorithms</a> by the late David MacKay, and many other great books. At the same time, there are ready made recipe books for various deep learning frameworks, such as <a href="https://www.oreilly.com/library/view/learning-tensorflow/9781491978504/">Learning Tensorflow</a> by Itay Lieder, Yehezkel Resheff, and Tom Hope. However, none of the books so far attempted to bridge this gap between theory and practice. This is what the current book project aims to address. In particular, it combines the following features:</p>
<ul>
<li>Downloadable <a href="https://d2l.ai/d2l-en.zip">Jupyter notebooks</a>. In fact, the entire book consists of notebooks.</li>
<li>A freely available <a href="https://d2l.ai/d2l-en.pdf">PDF</a> version</li>
<li>A <a href="https://github.com/d2l-ai/d2l-en">GitHub repository</a> to allow for fast corrections of errata</li>
<li>A tight integration with discussion forums to allow for questions regarding the math and code on the site</li>
<li>Theoretical background suitable for engineers and undergraduate researchers</li>
<li>State of the art models (including ResNet, faster-RCNN, etc)</li>
<li>Well documented and structured code that is executed on real datasets, yet at the same time small enough to fit on a laptop.</li>
<li>A <a href="https://zh.d2l.ai/">Chinese translation</a> (in fact, the Chinese book will be released first)</li>
</ul>
<p>In addition to that, Mu and I will be teaching a <a href="https://c.d2l.ai/berkeley-stat-157/">class at UC Berkeley</a> in Spring. As part of that, we will be releasing slides, videos and assignments, suitable for reuse for anyone who’d like to do so. In short, we aim to offer a complete resource to learn deep learning, easily and a comprehensive manner. Please let us know what you think.</p>



 ]]></description>
  <category>d2l</category>
  <category>book</category>
  <guid>https://alex.smola.org/posts/33-d2l/</guid>
  <pubDate>Tue, 15 Jan 2019 08:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/33-d2l/front.png" medium="image" type="image/png" height="182" width="144"/>
</item>
<item>
  <title>Leaving CMU</title>
  <link>https://alex.smola.org/posts/32-leaving-cmu/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/32-leaving-cmu/cmu.png" class="img-fluid" alt="Carnegie Mellon University wordmark in white serif text on a crimson red background"></p>
<p>Dear Friends,<br>
As some of you may have already heard, I’m leaving CMU to join Amazon, effective July 1, 2016. There I will be in charge of Amazon’s Cloud Machine Learning Platform with the task to make machine learning as easy to use and widespread as it could possibly be. This is a terrific task and it was an offer that I could not turn down. Our lab will be in the Bay Area and we will strive to turn the state of the art in machine learning research into the state of the art in industry. Both in terms of scale and in terms of model sophistication. This is a very exciting time and I’m looking forward to it. If all goes well, this will raise the bar also in academia.</p>
<p>I wanted to say thanks to everyone who’s helped us along on this journey. First off, an absolutely outstanding cast of PhD students I’ve had the pleasure of working with. Mu, Manzil, Zichao, Fish, Yu-Xiang, Seth, Alex, Ziqi, Wei, Yining, Chao-Yuan, Sashank, you guys rock! And yes, most of you are smarter than me, and this has been utterly delightful. Second, I wanted to thank to thank the terrific CMU faculty. I learned a lot and the past 4 years have been a wonderful experience. Many thanks, in particular to Dave, Andy, Andrew, Tom, Geoff, Nina, Christos, Ryan, Barnabas, Larry, Jaime and Kayvon. Many thanks for your advice, help, insight, and collaboration. What we achieved in the past 4 years would not have been possible without a great team. And thanks to Mallory and Diane to keep the ship running no matter what, and thanks to Russ for keeping the accounting in good shape and for putting up with last minute updates. Apologies if I forgot someone on this list (I probably have). It has been wonderful to work with you and I dearly love CMU. So why the change? Here’s the reasoning that went into deciding to go to Amazon: Our goal as machine learning researchers is to solve deep problems (not just in deep learning) and to ensure that this leads to algorithms that are actually used. At scale. At sophistication. In applications. The number of people I could possibly influence personally through papers and teaching might be 10,000. In Amazon we have 1 million developers using AWS. Likewise, the NSF thinks that a project of 3 engineers is a big grant (and it is very choosy in awarding these grants). At Amazon we will be investing an order of magnitude more resources towards this problem. With data and computers to match this. This is significant leverage. Hence the change.</p>
<p>We will try to give back to the academic community, to contribute back, e.g.&nbsp;through open source. And the goal is to engage it more deeply. I cannot give more specifics yet but good things will happen. Stay tuned.</p>
<p>Technically I will be on leave of absence until August 2017. This is mostly to ensure that the students are taken care of and that grants are in good shape. I’ll be reaching out to many of you directly (the reason why I haven’t done this to all of you yet is that I’m in the middle of moving).</p>
<p>Let’s stay in touch. And, obviously, if you’re interested in joining me on this journey (or have students who would like to come along), please let me know.</p>
<p>Best, Alex</p>
<p><em>PS: I’m posting this letter (which was intended for my team at CMU and my colleagues there) after screenshots of parts of it seem to have gone live on Weibo.</em></p>



 ]]></description>
  <category>CMU</category>
  <category>Amazon</category>
  <guid>https://alex.smola.org/posts/32-leaving-cmu/</guid>
  <pubDate>Wed, 01 Jun 2016 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/32-leaving-cmu/cmu.png" medium="image" type="image/png" height="144" width="144"/>
</item>
<item>
  <title>Distributing Data in a Parameter Server</title>
  <link>https://alex.smola.org/posts/31-distributing-data/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/31-distributing-data/bipartite.png" class="img-fluid" alt="Bipartite graph with 4 red server nodes at top and 5 blue client nodes at bottom, connected by bidirectional arrows showing all-to-all communication"></p>
<p>One of the key features of a parameter server is that it, well, serves parameters. In particular, it serves more parameters than a single machine can typically hold and provides more bandwidth than what a single machine offers. A sensible strategy to increase both aspects is to arrange data in the form of a bipartite graph with clients on one side and the server machines on the other. This way bandwidth and storage increase linearly with the number of machines involved. This is well understood. For instance, distributed (key,value) stores such as <a href="http://memcached.org/">memcached</a> or <a href="https://riak.com/riak/">Basho Riak</a> use it. It dates back to the ideas put forward by <a href="http://people.csail.mit.edu/karger/">David Karger</a> et al.&nbsp;on <a href="https://dl.acm.org/doi/10.1145/258533.258660">Consistent Hashing and Random Trees</a> in STOC 1997.</p>
<p>A key problem is that we can obviously not store a mapping table from the keys to the machines. This would require a database that is of the same size as the set of keys and that would need to be maintained and updated on each client. One way around this is to use the argmin hash mapping. That is, given a machine pool <img src="https://latex.codecogs.com/png.latex?M">, we assign a given (key,value) pair to the machine that has the smallest hash, i.e.</p>
<p><img src="https://latex.codecogs.com/png.latex?m(k,M)=%5Cmathop%7B%5Cmathrm%7Bargmin%7D%7D_%7Bm%20%5Cin%20M%7D(m,k)"></p>
<p>The advantage of this scheme is that it allows for really good load balancing and repair. First off, the load is almost uniformly distributed, short of a small number of heavy hitters. Secondly, if a machine is removed or added to the machine pool, rebalancing affects all other machines uniformly. To see this, notice that the choice of machine with the smallest and second-smallest hash value is uniform.</p>
<p>Unfortunately, this is a stupid way of distributing (key,value) pairs for machine learning. And this is what we did in our <a href="https://dl.acm.org/doi/10.14778/1920841.1920931">2010’ VLDB</a> and <a href="https://dl.acm.org/doi/10.1145/2124295.2124312">2012’ WSDM</a> papers. To our excuse, we didn’t know any better. And others copied that approach … after all, how you can you improve on such nice rebalancing aspects.</p>
<p>This begs the question why it is a bad idea. It all comes down to the issue of synchronization. Basically, whenever a client attempts to synchronize its keys, it needs to traverse the list of the keys it owns and communicate with the appropriate servers. In the above scheme, it means that we need to communicate to a new random server for each key. This is amazingly costly. Probably the best comparison would be a P2P network where each byte is owned by a different machine. Downloads would take forever.</p>
<p>We ‘fixed’ this problem by cleverly reordering the access and then performing a few other steps of randomization. There’s even a nice load balancing lemma in the 2012 WSDM paper. However, a much better solution is to prevent the problem from happening and to borrow from key distribution algorithms such as <a href="https://en.wikipedia.org/wiki/Chord_(peer-to-peer)">Chord</a>. In it, servers are inserted into a ring via a hash function. So are keys. This means that each server now owns a <strong>contiguous segment of keys</strong>. As a result, we can easily determine which keys go to which server, simply by knowing where in the ring the server sits.</p>
<p><img src="https://alex.smola.org/posts/31-distributing-data/ring.png" class="img-fluid" alt="Consistent hashing ring with 6 server machines placed at intervals; red star icons mark keys assigned to the colored arc segment of the nearest server"></p>
<p>In the picture above, keys are represented by little red stars. They are randomly assigned using a hash function via h(k) to the segments ‘owned’ by servers s that are inserted in the same way, i.e.&nbsp;via h(s). In the picture above, each server ‘owns’ the segment to its left. Also have a look at the <a href="https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf">Amazon Dynamo</a> paper by DeCandia et al., 2007 SOSP for a related description.</p>
<p>Obviously, such a load-balancing isn’t quite as ideal as the argmin hash. For instance, if a machine fails, the next machine inherits the entire segment. However, by inserting each server log(n) times we can ensure that a good load balance is achieved and also that when machines are removed, there are several other machines that pick up the work. Moreover, it is now also very easy to replicate things (more on this later). If you’re curious on how to do this, have a look at <a href="https://www.microsoft.com/en-us/research/people/amar/">Amar Phanishayee</a>’s excellent <a href="http://reports-archive.adm.cs.cmu.edu/anon/2012/CMU-CS-12-139.pdf">thesis</a>. In a nutshell, the machines to the left hold the replicas.</p>



 ]]></description>
  <category>parameter server</category>
  <guid>https://alex.smola.org/posts/31-distributing-data/</guid>
  <pubDate>Tue, 15 Jan 2013 08:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/31-distributing-data/bipartite.png" medium="image" type="image/png" height="68" width="144"/>
</item>
<item>
  <title>100 Terabytes, 5 Billion Documents, 10 Billion Parameters, 1 Billion Inserts/s</title>
  <link>https://alex.smola.org/posts/30-100-terabytes/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/30-100-terabytes/ps.png" class="img-fluid" alt="Scatter plot comparing distributed ML systems by number of cores vs shared parameters, with Parameter Server achieving the highest scale"></p>
<p>We’ve been busy building the next generation of a Parameter Server and it’s finally ready. Check out the <a href="https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf">OSDI 2014 paper</a> by Li et al.; It’s quite different from our previous designs, the main improvements being fault tolerance and self repair, a much improved network protocol, flexible consistency models, and a much more general interface.</p>
<p>In the next few posts I’ll explain the engineering decisions that went into this system which is capable of solving problems as diverse as very high throughput sketching, topic models, and optimization. And yes, it is <a href="https://github.com/dmlc/ps-lite">open source</a>.</p>



 ]]></description>
  <category>parameter server</category>
  <category>distributed learning</category>
  <guid>https://alex.smola.org/posts/30-100-terabytes/</guid>
  <pubDate>Sat, 15 Dec 2012 08:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/30-100-terabytes/ps.png" medium="image" type="image/png" height="110" width="144"/>
</item>
<item>
  <title>Beware the bandwidth gap - speeding up optimization</title>
  <link>https://alex.smola.org/posts/28-bandwidth-gap/</link>
  <description><![CDATA[ 





<p>Disks are slow and RAM is fast. Everyone knows that. But many optimization algorithms don’t take advantage of this. More to the point, disks currently stream at about 100-200 MB/s, solid state drives stream at over 500 MB/s with 1000x lower latency than disks, and main memory reigns supreme at about 10-100 GB/s bandwidth (depending on how many memory banks you have). This means that it is 100 times more expensive to retrieve instances from disk rather than recycling them once they’re already in memory. CPU caches are faster yet with 100-1000 GB/s of bandwidth. Everyone knows this. If not, read Jeff Dean’s <a href="http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf">slides</a>. Page 13 is pure gold.</p>
<p>Update (11/2022): some things have gotten a lot faster over the past decade. For instance, SSDs now regularly deliver multiple GB/s. The number of requests has increased dramaticaly, too, to several 100k IOPS (IO Operations Per Second). For a recent comparison check out this <a href="https://www.anandtech.com/bench/SSD21/">Anandtech benchmark</a>. At the same time, memory bandwidth has increased to 100-1000 GB/s, depending on the system. Many variants of Jeff Dean’s latency numbers have sprung up. One of the more recent ones is the one by <a href="https://colin-scott.github.io/personal_website/research/interactive_latency.html">Colin Scott</a>.</p>
<p>Ok, so what does this mean for machine learning? If you can keep things in memory, you can do things way faster. This is one of the key ideas behind <a href="https://href.li/?http://spark.apache.org/">Spark</a>. It’s a wonderful alternative to Hadoop. In other words, if your data fits into memory, you’re safe and you can process data way faster. A lot of datasets that are considered big in academia fit this bill (in 2022 you can easily rent or buy 256-1024GB instances). But what about real big data? Essentially you have two options - have the systems designer do the hard work or change your algorithm. This post is about the latter. And yes, there’s a good case to be made about who should do the work: the machine learners or the folks designing the computational infrastructure (I think it’s both).</p>
<p>So here’s the problem: Many online algorithms load data from disk, stream it through memory as efficiently as possible and discard it after seeing it once, only to pick it up later for another pass through the data. That is, these algorithms are disk bound rather than CPU bound. Several solvers try to address this by making the disk representation more efficient, e.g.&nbsp;<a href="https://github.com/cjlin1/liblinear">LibLinear</a> or <a href="https://github.com/VowpalWabbit/vowpal_wabbit">VowpalWabbit</a>, both of which use their own internal representation for efficiency. While this still makes for quite efficient code that can stream up to 10-100TB of data per hour, in any given pass, main memory is still much faster. This has led to the misconception that many machine learning algorithms are disk bound. But, they aren’t …</p>
<p>What if we could re-use data that’s in memory? For instance, use a ringbuffer where the disk writes into it (much more slowly) and the CPU reads from it (100 times more rapidly). The problem is what to do with an observation that we’ve already processed. A naive strategy would be to pretend that it is a new instance, i.e.&nbsp;we could simply update on it more than once. But this is very messy since we need to keep track of how many times we’ve seen the instance before, and it creates nonstationarity in the training set.</p>
<p>A much cleaner strategy is to switch to dual variables, similar to the updates in the <a href="https://doi.org/10.1007/11776420_32">Dualon</a> of Shalev-Shwartz and Singer. This is what <a href="https://www.u-tokyo.ac.jp/focus/en/people/people100370.html">Shin Matsushima</a> did in our <a href="https://dl.acm.org/doi/10.1145/2339530.2339559">dual cached loops</a> paper (the StreamSVM implementation unfortunately has fallen into disrepair in the past decade). Essentially, it keeps data in memory in a ringbuffer and updates the dual variables. This way, we’re guaranteed to make progress at each step, even if we’re revisiting the same observation more than once. To see what happens have a look at the graph below:</p>
<p><img src="https://alex.smola.org/posts/28-bandwidth-gap/stream.png" class="img-fluid" alt="Log-scale plot of SVM objective value vs. time (seconds) comparing StreamSVM and liblinear with 2 GB memory against liblinear with 200 GB memory"></p>
<p>It’s just as fast as LibLinear provided that it’s all in memory. Algorithmically, what happens in the SVM case is that one updates the Lagrange multipliers αi, while simultaneously keeping an estimate of the parameter vector w available.</p>
<p>That said, this strategy is more general: reuse data several times for optimization while it is in memory. If possible, perform successive updates by changing variables of an optimization that is well-defined regardless of the order in which (and how frequently) data is seen.</p>



 ]]></description>
  <category>optimization</category>
  <category>caching</category>
  <guid>https://alex.smola.org/posts/28-bandwidth-gap/</guid>
  <pubDate>Mon, 15 Oct 2012 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/28-bandwidth-gap/stream.png" medium="image" type="image/png" height="108" width="144"/>
</item>
<item>
  <title>The Weisfeiler-Leman algorithm and estimation on graphs</title>
  <link>https://alex.smola.org/posts/27-weisfeiler-lehman/</link>
  <description><![CDATA[ 





<div class="columns">
<div class="column" style="width:20%;">
<p><img src="https://alex.smola.org/posts/27-weisfeiler-lehman/isomorph1.png" class="img-fluid" alt="Bipartite-like graph with 8 colored nodes (a, b, c, d, g, h, i, j) connected by crossing edges, illustrating a graph isomorphism example"></p>
</div><div class="column" style="width:40%;">
<p><img src="https://alex.smola.org/posts/27-weisfeiler-lehman/isomorph2.png" class="img-fluid" alt="Graph with 8 numbered colored nodes arranged as two concentric squares, showing inner and outer ring connections as an isomorphism example"></p>
</div>
</div>
<p>The Weisfeiler-Leman algorithm and estimation on graphs Imagine you have two graphs <img src="https://latex.codecogs.com/png.latex?G"> and <img src="https://latex.codecogs.com/png.latex?G%E2%80%B2"> and you’d like to check how similar they are. If all vertices have unique attributes this is quite easy:</p>
<ul>
<li><strong>forall</strong> vertices <img src="https://latex.codecogs.com/png.latex?v%20%5Cin%20G%20%5Ccup%20G%E2%80%B2"> <strong>do</strong>
<ul>
<li>check that <img src="https://latex.codecogs.com/png.latex?v%20%5Cin%20G"> and that <img src="https://latex.codecogs.com/png.latex?v%20%5Cin%20G%E2%80%B2"></li>
<li>check that the neighbors of <img src="https://latex.codecogs.com/png.latex?v"> are the same in <img src="https://latex.codecogs.com/png.latex?G"> and <img src="https://latex.codecogs.com/png.latex?G%E2%80%B2"></li>
</ul></li>
</ul>
<p>This algorithm can be carried out in linear time in the size of the graph. Unfortunately, many graphs do not have vertex attributes, let alone unique vertex attributes. In fact, graph isomorphism, i.e.&nbsp;the task of checking whether two graphs are identical, is a hard problem (it is still an open research question how hard it really is). In this case the above algorithm cannot be used since we have no idea which vertices we should match up.</p>
<p>The <a href="https://www.iti.zcu.cz/wl2018/pdf/wl_paper_translation.pdf">Weisfeiler-Leman</a> algorithm (see David Bieber’s <a href="https://davidbieber.com/post/2019-05-10-weisfeiler-lehman-isomorphism-test/">blog post</a> for a pretty visualization) is a mechanism for assigning fairly unique attributes efficiently. Note that it isn’t guaranteed to work, as discussed in this <a href="https://arxiv.org/abs/1101.5211">paper</a> by Douglas in 2011. This would solve the graph isomorphism problem after all. The idea is to assign fingerprints to vertices and their neighborhoods repeatedly. We assume that vertices have an attribute to begin with. If they don’t then simply assign all of them the attribute 1. Each iteration proceeds as follows:</p>
<ul>
<li><strong>forall</strong> vertices <img src="https://latex.codecogs.com/png.latex?v%20%5Cin%20G"> <strong>do</strong>
<ul>
<li>compute a hash of <img src="https://latex.codecogs.com/png.latex?(a_v,a_%7Bv_1%7D,%20%5Cldots%20a_%7Bv_n%7D)"> where <img src="https://latex.codecogs.com/png.latex?a_%7Bv_i%7D"> are the attributes of the neighbors of vertex <img src="https://latex.codecogs.com/png.latex?v">.</li>
<li>use the hash as vertex attribute for <img src="https://latex.codecogs.com/png.latex?v"> in the next iteration.</li>
</ul></li>
</ul>
<p>The algorithm terminates when this iteration has converged in terms of unique assignments of hashes to vertices. Note that it is not guaranteed to work for all graphs. In particular, it fails for graphs with a high degree of symmetry, e.g.&nbsp;chains, complete graphs, tori and stars. However, whenever it converges to a unique vertex attribute assignment it provides a certificate for graph isomorphism. Moreover, the sets of vertex attributes can be used to show that two graphs are not isomorphic (it suffices to verify that the sets differ at any stage).</p>
<p><a href="https://www.jmlr.org/papers/volume12/shervashidze11a/shervashidze11a.pdf">Shervashidze et al.&nbsp;2012</a> use this idea to define a similarity measure between graphs. Basically the idea is that graphs are most similar if many of their vertex identifiers match since this implies that the associated subgraphs match. Formally they compute a kernel using</p>
<p><img src="https://latex.codecogs.com/png.latex?k(G,G%E2%80%B2)%20=%20%5Csum_%7Bi=1%7D%5Ed%20%5Clambda_d%20%5Csum_%7Bv%20%5Cin%20V%7D%20%5Csum_%7Bv'%20%5Cin%20V'%7D%20%5Cdelta(a(v,i),%20a(v',i))"></p>
<p>Here <img src="https://latex.codecogs.com/png.latex?a(v,i)"> denote the vertex attribute of <img src="https://latex.codecogs.com/png.latex?v"> after WL iteration <img src="https://latex.codecogs.com/png.latex?i">. Morevoer, <img src="https://latex.codecogs.com/png.latex?%5Clambda_d"> are nonnegative coefficients that weigh how much the similarity at level d matters. Rather than a brute-force computation of the above test for equality we can thus sort vertex attribute sets. Note that vertices that have different attributes at any given iteration will never have the same attribute thereafter. This means that we can compare the two sets at all depths at at most <img src="https://latex.codecogs.com/png.latex?O(d%20%5Ccdot%20(%7CV%7C+%7CV'%7C))"> cost.</p>
<p>A similar trick is possible if we want to regress between vertices on the same graph since we can use the set of attributes that a vertex obtains during the iterations as features. Finally, we can make our life even easier if we don’t compute kernels at all and use a linear classifier on the vertex attributes directly.</p>



 ]]></description>
  <category>Weisfeiler Leman</category>
  <category>graphs</category>
  <category>kernels</category>
  <guid>https://alex.smola.org/posts/27-weisfeiler-lehman/</guid>
  <pubDate>Sat, 15 Sep 2012 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/27-weisfeiler-lehman/isomorph2.png" medium="image" type="image/png" height="144" width="144"/>
</item>
<item>
  <title>In defense of keeping data private</title>
  <link>https://alex.smola.org/posts/26-private-data/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/26-private-data/HIPAA.jpg" class="img-fluid" alt="HIPAA logo: caduceus medical symbol beside the text 'HIPAA - Health Insurance Portability and Accountability Act' in blue"></p>
<p>In defense of keeping data private This is going to be contentious. And it somewhat goes against a lot of things that researchers hold holy. And it goes against my plan of keeping philosophy out of this blog. But it must be said since remaining silent has the potential of damaging science with proposals that sound good and are bad.</p>
<p>The proposal is that certain conferences make it mandatory to publish datasets that were used for the experiments. This is a very bad idea and two things are getting confused here: scientific progress and common access. These two are not identical. Reproducibility is often confused with common access. To make these things a bit more clear, here’s an example where it’s more obvious:</p>
<p>CERN is a monster machine. There’s only one of its kind in the world. There are limited resources and it’s impossible for any arbitrary researcher to reproduce their experiments, simply because of the average physicist being short of the tens of billions of Dollars that it took to build it. Access to the accelerator is also limited. It requires qualification and resource planning. So, even if we think this is open, it isn’t really as open as it looks. And yes, working at CERN gives you an unfair advantage over all the researchers who don’t.</p>
<p>Likewise take medical research. Patient records are covered by HIPAA privacy constraints and there is absolutely no way for such records to be publicly released. The participants sign an entire chain of documents that tie them to not releasing such data publicly. In other words, common access is impossible. Reproducibility would require that someone, who wants to test a contentious result, needs to sign corresponding privacy documents before accessing the data. And yes, working with the ‘right’ hospitals gives you an unfair advantage over researchers who didn’t work building this relationship.</p>
<p>Lastly, user data on the internet. Users have every right for their comments, content, images, mails, etc. to be treated with the utmost respect and to be published only when it is in their interest and with their permission to do so. I believe that there is a material difference between data being made available for analytics purposes in a personalization system and data being made available ‘in the raw’ for any researcher to play with. The latter allows for individuals to inspect particular records and learn that Alice mailed Bob a love letter. Something that would make Charlie very upset if he found out. Hence common access is a non-starter.</p>
<p>There are very clear financial penalties for releasing private data (hello <a href="https://en.wikipedia.org/wiki/Cambridge_Analytica">Cambridge Analytica</a>) - users would leave the service. Moreover, it would give a competitor an advantage over the releasing party. Since the data is largely collected by private parties at their expense it is not possible.</p>
<p>As for reproducibility - this is an issue. But provided that in case of a contentious result it is possible for a trusted researcher to check them, possibly after signing an NDA, this can be addressed. And yes, working for one of these companies gives you an unfair advantage.</p>
<p>In summary, while desirable, I strongly disagree with a mandatory publications policy. Yes, every effort should be made personally by researchers to see whether some data is releasable. And for publicly funded research this may well be the right thing to do. But to mandate it for industry would essentially do two things - it will make industrial research even more secretive than it already is (and that’s a terrible thing). And secondly, it will make academic research less relevant for real problems (I’ve seen my fair share and am guilty of my fair share of such papers).</p>



 ]]></description>
  <category>data privacy</category>
  <category>social networks</category>
  <guid>https://alex.smola.org/posts/26-private-data/</guid>
  <pubDate>Wed, 15 Aug 2012 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/26-private-data/HIPAA.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>MLSS Purdue</title>
  <link>https://alex.smola.org/posts/25-mlss-purdue/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/25-mlss-purdue/mlss.jpg" class="img-fluid" alt="Group photo of roughly 100 MLSS 2011 participants posed on steps outside a Purdue University building, copyright 2011 M.D. Truong"></p>
<p>The videos from MLSS 2011 in Purdue are now available online. Unfortunately the 2011 MLSS website is gone and with it all the playlists and PDFs. The only thing left is a group photo and <a href="https://www.stat.purdue.edu/news/2011/mlss_2011.html">departmental announcement</a>. But you can still find the <a href="https://www.youtube.com/results?search_query=mlss+purdue">videos</a> courtesy of YouTube. Some of them have aged very well.</p>



 ]]></description>
  <category>MLSS</category>
  <category>Purdue</category>
  <guid>https://alex.smola.org/posts/25-mlss-purdue/</guid>
  <pubDate>Sun, 15 Jul 2012 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/25-mlss-purdue/mlss.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Random numbers in constant storage</title>
  <link>https://alex.smola.org/posts/24-random-numbers/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/24-random-numbers/BMT.svg" class="img-fluid" alt="Box-Muller transform visualisation: colored points on a uniform unit square (u1, u2) mapped to a 2D Gaussian (z0, z1), with marginal PDFs shown on the axes"></p>
<p>Many algorithms require random number generators to work. For instance, locality sensitive hashing requires one to compute the random projection matrix <img src="https://latex.codecogs.com/png.latex?P"> in order to compute the hashes <img src="https://latex.codecogs.com/png.latex?z%20=%20P%20x">. Likewise, fast eigenvalue solvers in large matrices often rely on a random matrix, e.g.&nbsp;the work by <a href="https://arxiv.org/abs/0909.4061">Halko, Martinsson and Tropp</a>, SIAM Review 2011, which assumes that at some point we multiply a matrix M by a matrix P with Gaussian random entries.</p>
<p>The problem with these methods is that if we want to perform this projection operation in many places, we need to distribute the matrix <img src="https://latex.codecogs.com/png.latex?P"> to several machines. This is undesirable since a) it introduces another stage of synchronization between machines and b) it requires space to store the matrix <img src="https://latex.codecogs.com/png.latex?P"> in the first place. The latter is often bad since memory access can be much slower than computation, depending on how the memory is being accessed. The prime example here is multiplication with a sparse matrix which would require random memory access.</p>
<p>One way to circumvent this is to share the random seed and then recompute the random matrix from scratch. But this means that we’re critically relying on the implementation of a random number generator. Even worse, we still need to store the entire matrix. What if we could simply access <em>any</em> element of the matrix at will without overhead?</p>
<p>Here’s where hashing comes to the rescue. To motivate things consider the case where the entries of <img src="https://latex.codecogs.com/png.latex?P"> are all drawn from the uniform distribution <img src="https://latex.codecogs.com/png.latex?U%5B0,1%5D">. For a hash function h with range <img src="https://latex.codecogs.com/png.latex?%5C%7B0,%20%5Cldots%20N-1%5C%7D"> simply set <img src="https://latex.codecogs.com/png.latex?U%5Bi,j%5D%20=%20h(i,j)/N">. Since hash functions map <img src="https://latex.codecogs.com/png.latex?(i,j)"> pairs to uniformly distributed, uncorrelated numbers in the range <img src="https://latex.codecogs.com/png.latex?%5C%7B0,%20%5Cldots%20N-1%5C%7D"> this essentially amounts to uniformly distributed random numbers that can be recomputed on the fly.</p>
<p>A slightly more involved example is how to draw Gaussian random variables. We may e.g.&nbsp;resort to the <a href="https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform">Box-Müller</a> transform which shows how to convert two uniformly distributed random numbers into two Gaussians (the image on top is copied from the Wikpedia article). While being quite wasteful (we use two random numbers rather than one), we simply use two uniform hashes and then compute</p>
<p><img src="https://latex.codecogs.com/png.latex?P%5Bi,j%5D=%5Csqrt%7B%E2%88%922%20%5Clog%20%5Cfrac%7Bh(i,j,1)%7D%7BN%7D%7D%20%5Ccos%5Cleft(2%20%5Cpi%20%5Cfrac%7Bh(i,j,2)%7D%7BN%7D%5Cright)"></p>
<p>Since this is known to generate Gaussian random variables from uniform random variables this will give us Gaussian distributed hashes. Similar tricks work for other random variables. It means that things like Random Kitchen Sinks, Locality Sensitive Hashing, and related projection methods never really need to store the ‘random’ projection coefficients whenever memory is at a premium or whenever it would be too costly to synchronize the random numbers.</p>
<p>Update - recently someone proposed to <a href="https://resolver.caltech.edu/CaltechAUTHORS:20220714-224704502">use only zeros and ones in the initialization of a deep network</a> with the express purpose of making things reproducible. The paper has a fair amount of analysis in it but all of this can be made redundant simply by using hash functions instead. It requires at most sharing of a <a href="https://en.wikipedia.org/wiki/Salt_(cryptography)">‘salt’</a> and an agreed-upon convention for referencing weights in the network (e.g.&nbsp;weight and coordinate). Presto a deterministic initialization that is reproducible without the need for fancy math.</p>



 ]]></description>
  <category>random numbers</category>
  <category>hashing</category>
  <guid>https://alex.smola.org/posts/24-random-numbers/</guid>
  <pubDate>Fri, 15 Jun 2012 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/24-random-numbers/BMT.svg" medium="image" type="image/svg+xml"/>
</item>
<item>
  <title>The Neal Kernel and Random Kitchen Sinks</title>
  <link>https://alex.smola.org/posts/22-neal-kernel/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/22-neal-kernel/randombinary.png" class="img-fluid" alt="Diagram showing a kernel matrix approximated as a sum of outer products of random binary feature vectors, with the sum converging to the full kernel"></p>
<p>So you read a book on <a href="https://www.amazon.com/Learning-Kernels-Regularization-Optimization-Computation/dp/0262194759">Reproducing Kernel Hilbert Spaces</a> and you’d like to try out this <a href="https://gpytorch.ai/">kernel thing</a>. But you’ve got a lot of data and most algorithms will give you an expansion that requires a number of kernel functions linear in the amount of data. Not good if you’ve got millions to billions of instances.</p>
<p>You could try out low rank expansions such as the Nystrom method of <a href="https://papers.nips.cc/paper/2000/hash/19de10adbaa1b2ee13f77f679fa1483a-Abstract.html">Seeger and Williams, 2000</a>, the randomized Sparse Greedy Matrix Approximation of <a href="https://dl.acm.org/doi/10.5555/645529.657980">Smola and Schölkopf, 2000</a> (the Nyström method is a special case where we only randomize by a single term), or the very efficient positive diagonal pivoting trick of <a href="http://www.ai.mit.edu/projects/jmlr/papers/volume2/fine01a/fine01a.pdf">Scheinberg and Fine, 2001</a>. Alas, all those methods suffer from a serious problem: at training you need to multiply by the inverse of the reduced covariance matrix, which is <img src="https://latex.codecogs.com/png.latex?O(d%5E2)"> cost for a <img src="https://latex.codecogs.com/png.latex?d"> dimensional expansion. An example of an online algorithm that suffers from the same problem is this (NIPS award winning) paper of <a href="https://doi.org/10.1162/089976602317250933">Csato and Opper, 2002</a>. Assuming that we’d like to have d grow with the sample size this is not a very useful strategy. Instead, we want to find a method which has <img src="https://latex.codecogs.com/png.latex?O(d)"> cost for <img src="https://latex.codecogs.com/png.latex?d"> attributes yet shares good regularization properties that can be properly analyzed.</p>
<p>Enter Radford Neal’s <a href="https://glizen.com/radfordneal/ftp/pin.pdf">seminal paper</a> from 1994 on Gaussian Processes (a famous NIPS reject). In it he shows that a Neural Network with an infinite number of nodes and a Gaussian Prior over coefficients converges to a GP. More specifically, we get the kernel</p>
<p><img src="https://latex.codecogs.com/png.latex?k(x,x%E2%80%B2)=%5Cmathbb%7BE%7D_c%20%5B%5Cphi_c(x)%20%5Cphi_c(x%E2%80%B2)%5D"></p>
<p>Here <img src="https://latex.codecogs.com/png.latex?%5Cphi_c(x)"> is a function parametrized by <img src="https://latex.codecogs.com/png.latex?c">, e.g.&nbsp;the location of a basis function, the degree of a polynomial, or the direction of a Fourier basis function. There is also a discussion regarding RKHS in a paper by <a href="https://doi.org/10.1016/S0893-6080(98)00032-X">Smola, Schölkof and Müller, 1998</a> that discusses this phenomenon in regularization networks. These ideas were promptly forgotten by its authors. One exception is the <a href="https://link.springer.com/chapter/10.1007/3-540-36755-1_44">empirical kernel map</a> where one uses a <a href="ftp://ftp.cs.wisc.edu/math-prog/talks/afosr.ps">generic design matrix</a> that is generated through the observations directly.</p>
<p>It was not until the paper by <a href="https://papers.nips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html">Rahimi and Recht, 2008</a> on random kitchen sinks that this idea regained popularity. In a nutshell the algorithm works as follows: Draw <img src="https://latex.codecogs.com/png.latex?d"> values <img src="https://latex.codecogs.com/png.latex?c_i"> from the distribution over <img src="https://latex.codecogs.com/png.latex?c">. Use the corresponding basis functions in a linear model with quadratic penalty on the expansion coefficients. This method works whenever the basis functions are well bounded. For instance, for the Fourier basis the functions are bounded by <img src="https://latex.codecogs.com/png.latex?1">. The proof of convergence of the explicit function expansion to the kernel is then a simple consequence of Chernoff bounds.</p>
<p>In the random kitchen sinks paper Rahimi and Recht discuss RBF kernels and binary indicator functions. However, this works more generally for any set of well behaved set of basis functions used in generating a random design matrix. A few examples:</p>
<ul>
<li>Fourier basis with Gaussian parameters. Take functions of the form <img src="https://latex.codecogs.com/png.latex?e%5E%7Bi%20%5Comega%5E%5Ctop%20x%7D"> where the coefficients <img src="https://latex.codecogs.com/png.latex?%5Comega"> are drawn from a Gaussian. This is the random kitchen sinks paper. Obviously you can use hash functions rather than an actual random number generator. This ensures that you don’t need to store all parameters <img src="https://latex.codecogs.com/png.latex?%5Comega">.</li>
<li>Pick random separating hyperplanes. This will effectively give you functions of bounded variation. Use the empirical kernel map, i.e.&nbsp;we use some function <img src="https://latex.codecogs.com/png.latex?k(x,x')"> for which we employ for <img src="https://latex.codecogs.com/png.latex?x'"> a random subset of the data we wish to train on.</li>
<li>Pick suitable binary features obtained via binning (the figure on top, taken from Rahimi and Recht, 2008, shows quite impressive results in lower dimensions).</li>
</ul>



 ]]></description>
  <category>kernel</category>
  <category>random features</category>
  <guid>https://alex.smola.org/posts/22-neal-kernel/</guid>
  <pubDate>Sun, 15 Apr 2012 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/22-neal-kernel/randombinary.png" medium="image" type="image/png" height="31" width="144"/>
</item>
<item>
  <title>Big Learning: Algorithms, Systems, and Tools for Learning at Scale</title>
  <link>https://alex.smola.org/posts/21-big-learning/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/21-big-learning/biglearn.png" class="img-fluid" alt="NIPS 2011 workshop banner: the Alhambra palace in Granada, Spain, with text 'Big Learning: Algorithms, Systems, and Tools for Learning at Scale'"></p>
<p>We’re organizing a workshop at NIPS 2011. Submission are solicited for a two day workshop December 16-17 in Sierra Nevada, Spain.</p>
<p>This workshop will address tools, algorithms, systems, hardware, and real-world problem domains related to large-scale machine learning (“Big Learning”). The Big Learning setting has attracted intense interest with active research spanning diverse fields including machine learning, databases, parallel and distributed systems, parallel architectures, and programming languages and abstractions. This workshop will bring together experts across these diverse communities to discuss recent progress, share tools and software, identify pressing new challenges, and to exchange new ideas. Topics of interest include (but are not limited to):</p>
<ul>
<li><strong>Hardware Accelerated Learning:</strong> Practicality and performance of specialized high-performance hardware (e.g.&nbsp;GPUs, FPGAs, ASIC) for machine learning applications.</li>
<li><strong>Applications of Big Learning:</strong> Practical application case studies; insights on end-users, typical data workflow patterns, common data characteristics (stream or batch); trade-offs between labeling strategies (e.g., curated or crowd-sourced); challenges of real-world system building.</li>
<li><strong>Tools, Software, &amp; Systems:</strong> Languages and libraries for large-scale parallel or distributed learning. Preference will be given to approaches and systems that leverage cloud computing (e.g.&nbsp;Hadoop, DryadLINQ, EC2, Azure), scalable storage (e.g.&nbsp;RDBMs, NoSQL, graph databases), and/or specialized hardware (e.g.&nbsp;GPU, Multicore, FPGA, ASIC).</li>
<li><strong>Models &amp; Algorithms:</strong> Applicability of different learning techniques in different situations (e.g., simple statistics vs.&nbsp;large structured models); parallel acceleration of computationally intensive learning and inference; evaluation methodology; trade-offs between performance and engineering complexity; principled methods for dealing with large number of features;</li>
</ul>
<p>Submissions should be written as extended abstracts, no longer than 4 pages (excluding references) in the NIPS <img src="https://latex.codecogs.com/png.latex?%5CLaTeX"> style. Relevant work previously presented in non-machine-learning conferences is strongly encouraged. Exciting work that was recently presented is allowed, provided that the extended abstract mentions this explicitly.</p>
<p>Update - the website no longer exists but YouTube has a <a href="https://www.youtube.com/playlist?list=PL53DF7722DC0A0742">list of all the talks</a>.</p>



 ]]></description>
  <category>workshop</category>
  <category>big learning</category>
  <guid>https://alex.smola.org/posts/21-big-learning/</guid>
  <pubDate>Mon, 15 Aug 2011 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/21-big-learning/biglearn.png" medium="image" type="image/png" height="98" width="144"/>
</item>
<item>
  <title>Introduction to Graphical Models</title>
  <link>https://alex.smola.org/posts/20-graphical-models/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/20-graphical-models/rooster.png" class="img-fluid" alt="Illustration of the chicken-and-egg cycle: a rooster and an egg connected by circular arrows, representing cyclic dependencies in graphical models."></p>
<p>Here are the slides [<a href="../../talks/purdue.key">Keynote</a>, <a href="../../talks/purdue.pdf">PDF</a>] for a basic course on Graphical Models for the Internet that I’m giving at MLSS 2011 in Purdue that Vishy Vishwanathan is organizing. The selection is quite biased, limited, and subjective, but it’s meant to complement the other classes at the summer school.</p>
<p>The slides are likely to grow, so in case of doubt, check for updates. Comments are most welcome. And yes, it’s a horribly incomplete overview, due to space and time constraints.</p>



 ]]></description>
  <category>graphical models</category>
  <category>MLSS</category>
  <guid>https://alex.smola.org/posts/20-graphical-models/</guid>
  <pubDate>Wed, 10 Aug 2011 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/20-graphical-models/rooster.png" medium="image" type="image/png" height="112" width="144"/>
</item>
<item>
  <title>Distributed synchronization with the distributed star</title>
  <link>https://alex.smola.org/posts/19-distributed-star/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/19-distributed-star/hubble.jpeg" class="img-fluid" alt="Hubble Space Telescope image of a dense globular star cluster with thousands of stars concentrated at center, surrounded by a field of blue, orange, and white stars."></p>
<p>Here’s a simple synchronization paradigm between many computers that scales with the number of machines involved and which essentially keeps cost at <img src="https://latex.codecogs.com/png.latex?O(1)"> per machine. For lack of a better name I’m going to call it the distributed star since this is what the communication looks like. It’s quite similar to how memcached stores its (key,value) pairs.</p>
<p>Assume you have n computers, each of which have a copy of a large parameter vector <img src="https://latex.codecogs.com/png.latex?w"> (typically several GB) and we would like to keep these copies approximately synchronized.</p>
<p>A simple version would be to pause the computers occasionally, have them send their copies to a central node, and then return with a consensus value. Unfortunately this takes <img src="https://latex.codecogs.com/png.latex?O(%7Cw%7C%20%5Clog%20n)"> time if we aggregate things on a tree (we can reduce it by streaming data through but this makes the code a lot more tricky). Furthermore we need to stop processing while we do so. The latter may not even be possible and any local computation is likely to benefit from having most up-to-date parameters.</p>
<p>Instead, we use the following: assume that we can break up the parameter vector into smaller (key, value) pairs that need synchronizing. We now have each computer send its local changes for each key to a central server, update the parameters there, and later receive information about global changes. So far this algorithm looks stupid - after all, when using n machines it would require <img src="https://latex.codecogs.com/png.latex?O(%7Cw%7C%20n)"> time to process since the central server is the bottleneck. This is where the distributed star comes in. Instead of keeping all data on a single server, we use the well known distributed hashing trick and send it to a machine n from a pool P of servers:</p>
<p><img src="https://latex.codecogs.com/png.latex?n(%5Cmathrm%7Bkey%7D,P)%20=%20%5Cmathop%7B%5Cmathrm%7Bargmin%7D%7D_%7Bn%20%5Cin%20P%7D%20~%20h(%5Cmathrm%7Bkey%7D,n)"></p>
<p>Here <img src="https://latex.codecogs.com/png.latex?h"> is the hash function. Such a system spreads communication evenly and it leads to an <img src="https://latex.codecogs.com/png.latex?O(%7Cw%7Cn/%7CP%7C)"> load per machine. In particular, if we make each of the computers involved in the local computation also members of the pool, i.e.&nbsp;if we have <img src="https://latex.codecogs.com/png.latex?n=%7CP%7C"> we get an <img src="https://latex.codecogs.com/png.latex?O(%7Cw%7C)"> cost for keeping terms synchronized regardless of the number of machines involved.</p>
<p>Obvious approximations: we assume that all machines are on the same switch. Moreover we assume that the times to open a TCP/IP connection are negligible (we keep them open after the first message) relative to the work to transmit the data.</p>
<p>The reason I’m calling this a distributed star is that for each key we have a star communication topology, it’s just that we use a different star for each key. If anyone in systems knows what this thing is really called, I’d greatly appreciate feedback. Memcached uses the same setup, alas it doesn’t have versioned writes and callbacks, so we had to build our own system using <a href="https://zeroc.com/products/ice">ICE</a>.</p>
<p>PS: thanks to the Hubble Space telescope for the image of a star distribution in a bright cluster.</p>



 ]]></description>
  <category>distributed synchronization</category>
  <category>hashing</category>
  <guid>https://alex.smola.org/posts/19-distributed-star/</guid>
  <pubDate>Fri, 05 Aug 2011 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/19-distributed-star/hubble.jpeg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Speeding up Latent Dirichlet Allocation</title>
  <link>https://alex.smola.org/posts/18-fast-lda/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/18-fast-lda/data_flow.png" class="img-fluid" alt="Pipeline diagram for distributed LDA: tokens and topics feed into a file combiner, which fans out to multiple parallel samplers, then flows through count updater, diagnostics and optimization, output to file, and finally topics."></p>
<p>The code to our LDA implementation on Hadoop is released on <a href="https://github.com/shravanmn/Yahoo_LDA">Github</a> under the Mozilla Public License. It’s seriously fast and scales very well to 1000 machines or more (don’t worry, it runs on a single machine, too). We believe that at present this is the fastest implementation you can find, in particular if you want to have a) 1000s of topics, b) a large dictionary, c) a large number of documents, and d) Gibbs sampling. It handles quite comfortably a billion documents. <a href="https://www.linkedin.com/in/shravan-matthur/">Shravan Narayanamurthy</a> deserves all the credit for the code. The paper describing an earlier version of the system appeared in <a href="http://www.vldb.org/pvldb/vldb2010/pvldb_vol3/R63.pdf">VLDB 2010</a>.</p>
<p>Some background: Latent Dirichlet Allocation by <a href="https://href.li/?http://jmlr.csail.mit.edu/papers/volume3/blei03a/blei03a.pdf">Blei, Jordan and Ng, 2003</a> is a great tool for aggregating terms beyond what simple clustering can do. While the original paper showed exciting results it wasn’t terribly scalable. A significant improvement was the collapsed sampler of <a href="https://doi.org/10.1073/pnas.0307752101">Griffiths and Steyvers, 2004</a>. The key idea was that in an exponential families model with conjugate prior you can integrate out the natural parameter, thus providing a sampler that mixed much more rapidly. It uses the following update equation to sample the topic for a word.</p>
<p><img src="https://latex.codecogs.com/png.latex?p(t%7Cd,w)%20%5Cpropto%20%5Cfrac%7Bn%5E%E2%88%97(t,d)%20+%20%5Calpha_t%7D%7Bn%5E*(d)%20+%20%5Csum_%7Bt'%7D%20%CE%B1_%7Bt'%7D%7D%20+%20%5Cfrac%7Bn%5E%E2%88%97(t,w)%20+%20%5Cbeta_w%7D%7Bn%5E%E2%88%97(t)%20+%20%5Csum_%7Bw'%7D%20%5Cbeta_%7Bw'%7D%7D"></p>
<p>Here <img src="https://latex.codecogs.com/png.latex?t"> denotes the topic, <img src="https://latex.codecogs.com/png.latex?d"> the document, <img src="https://latex.codecogs.com/png.latex?w"> the word, and <img src="https://latex.codecogs.com/png.latex?n(t,d)">, <img src="https://latex.codecogs.com/png.latex?n(d)">, <img src="https://latex.codecogs.com/png.latex?n(t,w)">, <img src="https://latex.codecogs.com/png.latex?n(t)"> denote the number of words which satisfy a particular (topic, document), (document), (topic, word), (topic) combination respectively. The starred quantities such as <img src="https://latex.codecogs.com/png.latex?n%5E*(t,d)"> simply mean that we use the counts where the current word for which we need to resample the topic is omitted.</p>
<p>Unfortunately the above formula is quite slow when it comes to drawing from a large number of topics. Worst of all, it is nonzero throughout. A rather ingenious trick was proposed by <a href="https://dl.acm.org/doi/10.1145/1557019.1557121">Yao, Mimno, and McCallum, 2009</a>. It uses the fact that the relevant terms in the sum are sparse and only the <img src="https://latex.codecogs.com/png.latex?%5Calpha"> and <img src="https://latex.codecogs.com/png.latex?%5Cbeta">-dependent terms are dense (and obviously the number of words per document doesn’t change, hence we can drop that, too). Dropping the common denominator <img src="https://latex.codecogs.com/png.latex?n%5E*(d)%20+%20%5Csum_%7Bt'%7D%20%CE%B1_%7Bt'%7D"> we arrive at</p>
<p><img src="https://latex.codecogs.com/png.latex?p(t%7Cd,w)%20%5Cpropto%20%5Cfrac%7B%5Calpha_t%20%5Cbeta_w%7D%7Bn%5E%E2%88%97(t)%20+%20%5Csum_%7Bw'%7D%20%5Cbeta_%7Bw'%7D%7D%20+%20n%5E%E2%88%97(t,d)%20%5Cfrac%7Bn%5E%E2%88%97(t,w)%20+%20%5Cbeta_w%7D%7Bn%5E%E2%88%97(t)%20+%20%5Csum_%7Bw'%7D%20%5Cbeta_%7Bw'%7D%7D%20+%20n%5E*(t,w)%20%5Cfrac%7B%5Calpha_t%7D%7Bn%5E%E2%88%97(t)%20+%20%5Csum_%7Bw'%7D%20%5Cbeta_%7Bw'%7D%7D"></p>
<p>Out of these three terms, only the first one is dense, all others are sparse. Hence, if we knew the sum over <img src="https://latex.codecogs.com/png.latex?t"> for all three summands we could design a sampler which first samples which of the blocks is relevant and then which topic within each of these blocks. This is efficient since the first term doesn’t actually depend on <img src="https://latex.codecogs.com/png.latex?n(t,w)"> or <img src="https://latex.codecogs.com/png.latex?n(t,d)"> but rather only on <img src="https://latex.codecogs.com/png.latex?n(t)"> which can be updated efficiently after each new topic assignment. In other words, we are able to update dense term in <img src="https://latex.codecogs.com/png.latex?O(1)"> operations after each sampling step and the remaining terms are all sparse. This trick gives a 10-50 times speedup in the sampler over a dense representation.</p>
<p>To combine several machines we have two alternatives: one is to perform one sampling pass over the data and then reconcile the samplers. This was proposed by <a href="https://www.jmlr.org/papers/volume10/newman09a/newman09a.pdf">Newman, Asuncion, Smyth, and Welling, 2009</a>. While the approach proved to be feasible, it has a number of disadvantages. It only exercises the network while the CPU sits idle and vice versa. Secondly, a deferred update makes for slower mixing. Instead, one can simply have each sampler communicate with a distributed central storage continuously. In a nutshell, each node sends the differential to the global statekeeper and receives from it the latest global value. The key point is that this occurs <em>asynchronously</em> and moreover that we are able to decompose the state over several machines such that the available bandwidth grows with the number of machines involved. More on such distributed schemes in a later post.</p>



 ]]></description>
  <category>LDA</category>
  <category>sampler</category>
  <guid>https://alex.smola.org/posts/18-fast-lda/</guid>
  <pubDate>Mon, 01 Aug 2011 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/18-fast-lda/data_flow.png" medium="image" type="image/png" height="43" width="144"/>
</item>
<item>
  <title>Bloom Filters</title>
  <link>https://alex.smola.org/posts/17-bloom-filter/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/17-bloom-filter/bloom.png" class="img-fluid" alt="Diagram showing a Bloom filter: an empty bit array (top), then two items x1 and x2 hashing to multiple positions setting bits to 1 (middle), and a query y1 y2 checking the same positions (bottom)."></p>
<p>Bloom filters are one of the really ingenious and simple building blocks for randomized data structures. A great summary is the paper by <a href="https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf">Broder and Mitzenmacher, 2005</a>. The figure above is from their paper. In this post I will briefly review its key ideas since it forms the basis of the Count-Min sketch of <a href="http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf">Cormode and Muthukrishnan, 2003</a> it will also be necessary for an accelerated version of the graph kernel of <a href="https://www.jmlr.org/papers/volume12/shervashidze11a/shervashidze11a.pdf">Shervashidze et al., 2011</a>, and finally, a similar structure will be needed to compute data streams over time for a real-time sketching service.</p>
<p>At its heart a Bloom filter uses a bit vector of length <img src="https://latex.codecogs.com/png.latex?N"> and a set of <img src="https://latex.codecogs.com/png.latex?k"> hash functions mapping arbitrary keys <img src="https://latex.codecogs.com/png.latex?x"> into their hash values <img src="https://latex.codecogs.com/png.latex?h_i(x)%20%5Cin%20%5C%7B0,%20%5Cldots%20N-1%5C%7D"> where <img src="https://latex.codecogs.com/png.latex?i%20%5Cin%20%5C%7B0,%20%5Cldots%20k-1%5C%7D"> denotes the hash function. The Bloom filter allows us to perform approximate set membership tests where we have no false negatives but we may have a small number of false positives.</p>
<ul>
<li><strong>Initialize(b)</strong><br>
Set all <img src="https://latex.codecogs.com/png.latex?b%5Bi%5D=0"></li>
<li><strong>Insert(b,x)</strong><br>
For all <img src="https://latex.codecogs.com/png.latex?i%20%5Cin%20%5C%7B0,%20%5Cldots%20k%5C%7D"> set <img src="https://latex.codecogs.com/png.latex?b%5Bh_i(x)%5D=1"></li>
<li><strong>Query(b, x)</strong><br>
Return <strong>true</strong> if <img src="https://latex.codecogs.com/png.latex?b%5Bh_i(x)%5D=1"> for all <img src="https://latex.codecogs.com/png.latex?i%20%5Cin%20%5C%7B0,%20%5Cldots%20%5C%7D">. Return <strong>false</strong> otherwise.</li>
</ul>
<p>Furthermore, unions and intersections between sets are easily achieved by performing bit-wise OR and AND operations on the bloom hashes of the corresponding sets respectively. Since these are bit-wise operations, they can be very fast, essentially operating at memory speed.</p>
<p>It is clear that if we inserted <img src="https://latex.codecogs.com/png.latex?x"> into the Bloom filter the query will return true, since all relevant bits in <img src="https://latex.codecogs.com/png.latex?b"> are set to <img src="https://latex.codecogs.com/png.latex?1">. But we could get unlucky. To analyze the probability of a false positive take the probability of a bit being 1. After inserting <img src="https://latex.codecogs.com/png.latex?m"> items using <img src="https://latex.codecogs.com/png.latex?k"> hash functions over a range of <img src="https://latex.codecogs.com/png.latex?N"> we have</p>
<p><img src="https://latex.codecogs.com/png.latex?%5CPr(b%5Bi%5D=1)%20=%201%E2%88%92%5Cleft(1%E2%88%92%5Cfrac%7B1%7D%7BN%7D%5Cright)%5E%7Bkm%7D%20%5Capprox%201%E2%88%92%5Cexp%5Cleft(%E2%88%92%5Cfrac%7Bkm%7D%7BN%7D%5Cright)"></p>
<p>For a false positive to occur we need to have all k bits associated with the hash functions to be 1. Ignoring the fact that the hash functions might collide, i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?h_i(x)%20=%20h_j(x)"> for some <img src="https://latex.codecogs.com/png.latex?i%20%5Cneq%20j">, the probability of false positives is given by</p>
<p><img src="https://latex.codecogs.com/png.latex?p%20%5Capprox%20%5Cleft(1%E2%88%92%5Cexp%5Cleft(%E2%88%92%5Cfrac%7Bkm%7D%7BN%7D%5Cright)%5Cright)%5Ek"></p>
<p>Taking derivatives with respect to <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7Bkm%7D%7BN%7D"> shows that the minimum is obtained for <img src="https://latex.codecogs.com/png.latex?%5Clog%202">, that is <img src="https://latex.codecogs.com/png.latex?k%20=%5Cfrac%7BN%7D%7Bm%7D%20%5Clog%202">. In other words, <img src="https://latex.codecogs.com/png.latex?k"> decreases with the fill rate <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7BN%7D%7Bm%7D"> of the Bloom filter. This makes sense since a higher fill rate leads to a higher collision probability and we shouldn’t try to ‘overfill’ the filter. Plugging the optimal value back into <img src="https://latex.codecogs.com/png.latex?p"> yields <img src="https://latex.codecogs.com/png.latex?p%20%5Capprox%202%5E%7B-k%7D">, i.e.&nbsp;the quality of the array decreases exponentially with the fill rate, albeit slowly at rate <img src="https://latex.codecogs.com/png.latex?%5Cexp(-%5Cfrac%7BN%7D%7Bm%7D%20%5Clog%5E2%202)">.</p>
<p>One of the really nice properties of the Bloom filter is that all memory is used to store the information about the set rather than an index structure storing the keys of the items. The downside is that it is impossible to read out <img src="https://latex.codecogs.com/png.latex?b"> without knowing the queries. Also note that it is impossible to remove items from the Bloom filter once they’ve been inserted. After all, we do not know whether some of the bits might have collided with another key, hence setting the corresponding bits to 0 would cause false negatives.</p>



 ]]></description>
  <category>Bloom filter</category>
  <category>hashing</category>
  <guid>https://alex.smola.org/posts/17-bloom-filter/</guid>
  <pubDate>Sat, 30 Jul 2011 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/17-bloom-filter/bloom.png" medium="image" type="image/png" height="85" width="144"/>
</item>
<item>
  <title>Real simple covariate shift correction</title>
  <link>https://alex.smola.org/posts/16-simple-shift/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/16-simple-shift/shift.png" class="img-fluid" alt="Scatter plot showing covariate shift: blue training samples cluster on the left, black-cross test samples appear on the right, with the learned linear function diverging from the true nonlinear curve"></p>
<p>Imagine you want to design some algorithm to detect cancer. You get data of healthy and sick people; you train your algorithm; it works fine, giving you high accuracy and you conclude that you’re ready for a successful career in medical diagnostics. Not so fast. Many things could go wrong. In particular, the distributions that you work with for training and those in the wild might differ considerably.</p>
<p>This happened to an unfortunate startup I had the opportunity to consult for many years ago. They were developing a blood test for prostate cancer, a disease that affects mainly older men. Luckily for them they’d managed to obtain a fair amount of blood samples from patients. It is considerably more difficult, though, to obtain blood samples from healthy men, mainly for ethical reasons. For instance, what do you do with positives? To compensate for that, they asked a large number of students on campus to donate blood and they performed their test. Then they asked me whether I could help them build a classifier to detect the disease. I told them that it would be very easy to distinguish between both datasets with probably near perfect accuracy. After all, the test subjects differed in age, hormone level, physical activity, diet, alcohol consumption, and many more factors unrelated to the disease. This was unlikely to be the case with real patients: Their sampling procedure had caused an extreme case of covariate shift that couldn’t be corrected by conventional means. In other words, training and test data were so different that nothing useful could be done and they had wasted significant amounts of money.</p>
<p>In general the situation is not quite so dire. Assume that we want to estimate some dependency <img src="https://latex.codecogs.com/png.latex?p(y%7Cx)"> for which we have labeled data <img src="https://latex.codecogs.com/png.latex?(x_i,y_i)">. Alas, the observations <img src="https://latex.codecogs.com/png.latex?x_i"> are drawn from some distribution <img src="https://latex.codecogs.com/png.latex?q(x)"> rather than the ‘proper’ distribution <img src="https://latex.codecogs.com/png.latex?p(x)">. If we adopt a risk minimization approach, that is, if we want to solve</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmathop%7B%5Cmathrm%7Bminimize%7D%7D_f%20%5Cfrac%7B1%7D%7Bm%7D%20%5Csum_%7Bi=1%7D%5Em%20l(x_i,y_i,f(x_i))%20%5Cfrac%7B%CE%BB%7D%7B2%7D%20%5C%7Cf%5C%7C%5E2"></p>
<p>we will need to re-weight each instance by the ratio of probabilities that it would have been drawn from the correct distribution. That is, we need to reweight things by <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7Bp(x_i)%7D%7Bq(x_i)%7D">. This is the ratio of how frequently the instances would have occurred in the correct set vs.&nbsp;how frequently it occurred with the sampling distribution <img src="https://latex.codecogs.com/png.latex?q">. It is sometimes also referred to as the Radon-Nikodym derivative. Such a method is called importance sampling and the following derivation shows why it is valid:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cint%20f(x)%20dp(x)%20=%20%5Cint%20f(x)%20%5Cfrac%7Bdp(x)%7D%7Bdq(x)%7D%20dq(x)"></p>
<p>Alas, we do not know <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7Bdp(x)%7D%7Bdq(x)%7D">. Vefore we can do anything useful we need to estimate the ratio. Many methods are available, e.g.&nbsp;some rather fancy operator theoretic ones which try to recalibrate the expectation operator directly using a minimum-norm or a maximum entropy principle. However, there exists a much more pedestrian, yet quite effective approach that will give almost as good results: logistic regression.</p>
<p>After all, we know how to estimate probability ratios. This is achieved by learning a classifier to distinguish between data drawn from <img src="https://latex.codecogs.com/png.latex?p"> and data drawn from <img src="https://latex.codecogs.com/png.latex?q">. If it is impossible to distinguish between the two distributions then it means that the associated instances are equaly likely to come from either oneof the two distributions. On the other hand, any instances that can be well discriminated should be significantly over/underweighted accordingly. For simplicity’s sake assume that we have an equal number of instances from both distributions, denoted by <img src="https://latex.codecogs.com/png.latex?x_i%20%5Csim%20p(x)"> and <img src="https://latex.codecogs.com/png.latex?x_i'%20%5Csim%20q(x)"> respectively. Now denote by <img src="https://latex.codecogs.com/png.latex?z_i"> labels which are <img src="https://latex.codecogs.com/png.latex?1"> for data drawn from <img src="https://latex.codecogs.com/png.latex?p"> and <img src="https://latex.codecogs.com/png.latex?-1"> for data drawn from <img src="https://latex.codecogs.com/png.latex?q">. Then the probability in a mixed dataset is given by</p>
<p><img src="https://latex.codecogs.com/png.latex?p(z=1%7Cx)=%5Cfrac%7Bp(x)%7D%7Bp(x)+q(x)%7D"></p>
<p>Hence, if we use a logistic regression approach which yields <img src="https://latex.codecogs.com/png.latex?p(z=1%7Cx)=%5Cfrac%7B1%7D%7B1+%20e%5E%7B-f(x)%7D%7D">, it follows (after some simple algebra) that</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7Bp(z=+1%7Cx)%7D%7Bp(z=%E2%88%921%7Cx)%7D%20=%20e%5E%7Bf(x)%7D."></p>
<p>Now we only need to solve the logistic regression problem (or use some convenient AutoML tool such as <a href="http://auto.gluon.ai">AutoGluon</a>)</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmathop%7B%5Cmathrm%7Bminimize%7D%7D_f%20%5Cfrac%7B1%7D%7B2m%7D%20%5Csum_%7B(x,z)%7D%20%5Clog%5Cleft%5B1+%5Cexp(%E2%88%92zf(x))%5Cright%5D+%5Cfrac%7B%5Clambda%7D%7B2%7D%20%5C%7Cf%5C%7C%5E2"></p>
<p>to obtain <img src="https://latex.codecogs.com/png.latex?f">. Subsequently we can use <img src="https://latex.codecogs.com/png.latex?e%5E%7Bf(x_i)%7D"> as covariate shift correction weights in training our actual classifier. The good news is that we can use an off-the-shelf tool to deal with a decidedly nonstandard estimation problem. Note that the weights <img src="https://latex.codecogs.com/png.latex?e%5E%7Bf(x_i)%7D"> can be quite a bit off. This is fixable, e.g.&nbsp;by using an estimator that isn’t prone to this, or by using <a href="https://www.stat.cmu.edu/~ryantibs/papers/weightedcp.pdf">conformal methods</a>. But that’s a topic for another day.</p>
<p>PS: The figure up top (which has been copied all over the internet, often without attribution) is likely due to <a href="http://www.ms.k.u-tokyo.ac.jp/sugi/">Masashi Sugiyama</a>.</p>



 ]]></description>
  <category>covariate shift</category>
  <category>classification</category>
  <guid>https://alex.smola.org/posts/16-simple-shift/</guid>
  <pubDate>Mon, 25 Jul 2011 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/16-simple-shift/shift.png" medium="image" type="image/png" height="97" width="144"/>
</item>
<item>
  <title>Memory Latency, Hashing, Optimal Golomb Rulers and Feistel Networks</title>
  <link>https://alex.smola.org/posts/14-memory-latency/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/14-memory-latency/golomb.svg" class="img-fluid" alt="Ten isometric floor-plan diagrams of a conference room showing all distinct sub-room sizes (1, 2, 3, 5, 6, 7, 8, 9, 11) achievable with the optimal Golomb ruler [0, 2, 7, 8, 11]"></p>
<p>In many problems involving hashing we want to look up a range of elements from a vector where the elements are indicated by a hash function <img src="https://latex.codecogs.com/png.latex?h">. For instance, we might want to evaluate <img src="https://latex.codecogs.com/png.latex?v%5Bh(i,j)%5D"> for arbitrary <img src="https://latex.codecogs.com/png.latex?i"> and for a range of <img src="https://latex.codecogs.com/png.latex?j%20%5Cin%20%5C%7B0,%20%5Cldots%20n-1%5C%7D">. This happens for matrix multiplication, multiclass classification, collaborative filtering, multitask learning and many related problems.</p>
<p>While this works just fine in terms of estimation performance, traversing all values of <img src="https://latex.codecogs.com/png.latex?j"> leads to an algorithm which is horrible in terms of memory access patterns. Modern DRAM chips are much faster (over 10x) when it comes to reading values in sequence rather than when carrying out random reads. Update: memory latency hasn’t improved over the past decade. In fact, DDR5 latency can often be <em>higher</em> than DDR4 latency. Furthermore, random access destroys the benefit of a CPU cache. This leads to algorithms which are efficient in terms of their memory footprint, yet which are slow in terms of their actual runtime behavior. One way to address this is to bound the range of <img src="https://latex.codecogs.com/png.latex?h(i,j)"> for different values of <img src="https://latex.codecogs.com/png.latex?j"> via one of the following strategies.</p>
<section id="dont-hash-the-secondary-key" class="level3">
<h3 class="anchored" data-anchor-id="dont-hash-the-secondary-key">Don’t hash the secondary key</h3>
<p>Decompose <img src="https://latex.codecogs.com/png.latex?h(i,j)=h(i)+j%20%5Cmathop%7B%5Cmathrm%7Bmod%7D%7D%20N">. This is computationally very cheap, it has excellent sequential access properties but it leads to horrible collisions should there ever be two <img src="https://latex.codecogs.com/png.latex?i"> and <img src="https://latex.codecogs.com/png.latex?i'"> for which <img src="https://latex.codecogs.com/png.latex?%7Ch(i)%E2%88%92h(i')%7C%20%5Cleq%20n">.</p>
</section>
<section id="small-secondary-hash" class="level3">
<h3 class="anchored" data-anchor-id="small-secondary-hash">Small secondary hash</h3>
<p>Decompose <img src="https://latex.codecogs.com/png.latex?h(i,j)=h(i)+h'(j)"> where <img src="https://latex.codecogs.com/png.latex?h'(j)"> has a small range of values. This leads to less catastrophic collisions for near-collisions <img src="https://latex.codecogs.com/png.latex?%7Ch(i)%20-%20h(i')%7C%20%5Cleq%20n">. Nonetheless it is a bad idea since now we have a nontrivial probability of collision as soon as the range of <img src="https://latex.codecogs.com/png.latex?h'(j)"> is less than <img src="https://latex.codecogs.com/png.latex?n%5E2"> due to the birthday paradox. Moreover, for adjacent values <img src="https://latex.codecogs.com/png.latex?h(i)"> and <img src="https://latex.codecogs.com/png.latex?h(i')"> we will get many collisions.</p>
</section>
<section id="optimal-golomb-ruler" class="level3">
<h3 class="anchored" data-anchor-id="optimal-golomb-ruler">Optimal Golomb ruler</h3>
<p>Decompose <img src="https://latex.codecogs.com/png.latex?h(i,j)=h(i)+g(j)"> where <img src="https://latex.codecogs.com/png.latex?g(j)"> is an <a href="https://en.wikipedia.org/wiki/Golomb_ruler">Optimal Golomb ruler</a>. The latter is an increasing sequence of integers for which any pairwise distance occurs exactly once. In other words, the condition <img src="https://latex.codecogs.com/png.latex?g(a)%E2%88%92g(b)=g(c)%E2%88%92g(d)"> implies that <img src="https://latex.codecogs.com/png.latex?a=c"> and <img src="https://latex.codecogs.com/png.latex?b=d">. For a more intuitive definition consider the conference room in the diagram above. Any room size can only be obtained in one manner. <a href="https://hunch.net/~jl/">John Langford</a> proposed this to address the problem. In fact, it solves the collision problem perfectly since there are a) no collisions for a fixed <img src="https://latex.codecogs.com/png.latex?i"> and b) for neighboring values <img src="https://latex.codecogs.com/png.latex?h(i)"> and <img src="https://latex.codecogs.com/png.latex?h(i')"> we will get at most one collision (due to the Golomb ruler property). Alas, this only works up to <img src="https://latex.codecogs.com/png.latex?n=28"> since finding an Optimal Golomb Ruler is hard. While it is curently unknown whether finding such rulers is actually NP hard, only <img src="https://latex.codecogs.com/png.latex?n=27"> and <img src="https://latex.codecogs.com/png.latex?n=28"> were discovered in the past decade.</p>
</section>
<section id="cryptographic-hash" class="level3">
<h3 class="anchored" data-anchor-id="cryptographic-hash">Cryptographic hash</h3>
<p>An alternative that works for larger <img src="https://latex.codecogs.com/png.latex?n"> and that is sufficiently simple to compute is to use cryptography. After all, all we want is that the hash function <img src="https://latex.codecogs.com/png.latex?h'(j)"> has small range and that it doesn’t have any self collisions or any systematic collisions. We can achieve this by encrypting <img src="https://latex.codecogs.com/png.latex?j"> using the key <img src="https://latex.codecogs.com/png.latex?i"> to generate an encrypted message of <img src="https://latex.codecogs.com/png.latex?N"> possible values. In other words we use <img src="https://latex.codecogs.com/png.latex?h(i,j)=%20h(i)%20+%20%5Cmathop%7B%5Cmathrm%7Bcrypt%7D%7D(j%7Ci,N)">.</p>
<p>Since it is an encryption of <img src="https://latex.codecogs.com/png.latex?j">, the mapping is invertible and we won’t have collisions for a given value of <img src="https://latex.codecogs.com/png.latex?j">. Furthermore, for different <img src="https://latex.codecogs.com/png.latex?i"> the encodings will be uncorrelated (after all, <img src="https://latex.codecogs.com/png.latex?i"> is the key). Finally, we can control the range <img src="https://latex.codecogs.com/png.latex?N%3En"> simply by choosing the encryption algorithm. In this case the random memory access is of bounded range, hence the CPU cache will not suffer from too many misses. A particularly nice algorithm is the [Feist]el cipher](https://en.wikipedia.org/wiki/Feistel_cipher) which works as follows: define the iterative map</p>
<p><img src="https://latex.codecogs.com/png.latex?f(x,y)=%20(y,x%20%5Cmathop%7B%5Cmathrm%7BXOR%7D%7D%20h(y))"></p>
<p>As always, <img src="https://latex.codecogs.com/png.latex?h"> is a hash function. After 4 iterations <img src="https://latex.codecogs.com/png.latex?(x,y)%20%5Crightarrow%20f(x,y)"> we obtain an encoding of <img src="https://latex.codecogs.com/png.latex?(x,y)"> that is cryptographically hard. Now use <img src="https://latex.codecogs.com/png.latex?x=i"> and <img src="https://latex.codecogs.com/png.latex?y=j"> to obtain the desired result. This encoding can be effective whenever computation is a lot faster than memory latency.</p>
<p>PS: The title image is by <a href="https://commons.wikimedia.org/w/index.php?curid=14818399">Cmglee</a> (CC BY-SA 3.0). It depicts a conference room with 10 different configurations, based on an Optimal Golomb ruler <img src="https://latex.codecogs.com/png.latex?%5B0,%202,%207,%208,%2011%5D">.</p>


</section>

 ]]></description>
  <category>hashing</category>
  <category>feistel network</category>
  <category>latency</category>
  <guid>https://alex.smola.org/posts/14-memory-latency/</guid>
  <pubDate>Fri, 15 Jul 2011 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/14-memory-latency/golomb.svg" medium="image" type="image/svg+xml"/>
</item>
<item>
  <title>Collaborative Filtering considered harmful</title>
  <link>https://alex.smola.org/posts/13-harmful-cofi/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/13-harmful-cofi/diehard.jpg" class="img-fluid" alt="Five Die Hard films on Blu-ray: Die Hard, Die Hard 2, Die Hard with a Vengeance, Live Free or Die Hard, and A Good Day to Die Hard"></p>
<p>Much excellent work has been published on collaborative filtering, in particular in terms of recovering missing entries in a matrix. The Netflix contest has contributed a significant amount to the progress in the field.</p>
<p>Alas, reality is not quite as simple as that. Very rarely will we ever be able to query a user about arbitrary movies, books, or other objects. Instead, user ratings are typically expressed as preferences rather than absolute statements: a preference for <em>Die Hard</em>, given a generic set of movies only tells us that the user appreciates action movies; however, a preference for <em>Die Hard</em> over <em>Terminator</em> or <em>Rocky</em> suggests that the user might favor Bruce Willis over other action heroes. In other words, the context of user choice is vital when estimating user preferences.</p>
<p>If we attempt to estimate scores <img src="https://latex.codecogs.com/png.latex?s%5B%7Bu,i%7D%5D"> of user <img src="https://latex.codecogs.com/png.latex?u"> regarding item <img src="https://latex.codecogs.com/png.latex?i"> it is important to use the context within which the ratings have been obtained. For instance, if we are given a sequence of items <img src="https://latex.codecogs.com/png.latex?(i_1,%20%5Cldots%20i_n)"> out of which item <img src="https://latex.codecogs.com/png.latex?i%5E%E2%88%97"> was selected we might want to consider a logistic model of the form:</p>
<p><img src="https://latex.codecogs.com/png.latex?%E2%88%92%5Clog%20p(i%5E%E2%88%97%7Ci_1,%20%5Cldots%20i_n)=%20%5Clog%20%5Cleft%7C%5Csum_i%20%5Cexp(s%5Bu,i%5D)%20%5Cright%7C%20%E2%88%92s%5Bu,i%5E%E2%88%97%5D"></p>
<p>The option of no action is easy to add, simply by adding the null score <img src="https://latex.codecogs.com/png.latex?s%5Bu,0%5D"> which captures the event of no action by a user. <a href="https://www.linkedin.com/in/eeshyang">Shuang Hong Yang</a> tried out this idea to get a significant performance improvement on a number of collaborative filtering datasets. Bottom line - make sure that the problem you’re solving is actually the one that a) generated the data and b) that will help you in practice. That is, in many cases matrix completion is not the problem you want to solve, even though it might win you benchmarks. Obviously the above model is still a gross oversimplification and you’re best advised using the actual interaction order for ranking. But that’s a story for another day.</p>



 ]]></description>
  <category>collaborative filtering</category>
  <category>search</category>
  <guid>https://alex.smola.org/posts/13-harmful-cofi/</guid>
  <pubDate>Sun, 10 Jul 2011 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/13-harmful-cofi/diehard.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Why?</title>
  <link>https://alex.smola.org/posts/12-why/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/12-why/why.jpeg" class="img-fluid" alt="Rage-comic style stick figure with arms outstretched and distressed expression, captioned 'WHY?!?!' in red text"></p>
<p>Some readers might wonder why I’m writing this blog. Here’s an (incomplete) list:</p>
<ul>
<li>It’s fun.</li>
<li>There are lots of fantastic blogs discussing the philosophy and big questions of machine learning (e.g.&nbsp;John Langford’s <a href="https://hunch.net">hunch.net</a>) but I couldn’t find many covering simple tricks of the trade.</li>
<li>Scientific papers sometimes obscure simple ideas. In the most extreme case, a paper will get rejected if the idea is presented in too simple terms (it happened to me more than once and the paper was praised once the simple parts had been obfuscated). Also, they need to come with ample evidence for why an idea works, strong theoretical guarantees and lots of experiments. This is all needed as a safeguard and it’s really really important. But it often hides the basic idea.</li>
<li>Some ideas are really cute and useful but not big enough to write a paper about. It’s pointless to write 10 pages if the idea can be fully covered in 1 page. We’d need a journal of 1 page ideas to deal with this.</li>
<li>Many practitioners are scared to pick up a paper with many equations but they might be willing to spend 10 minutes reading a blog post.</li>
</ul>



 ]]></description>
  <category>why</category>
  <guid>https://alex.smola.org/posts/12-why/</guid>
  <pubDate>Tue, 05 Jul 2011 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/12-why/why.jpeg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Hashing for Collaborative Filtering</title>
  <link>https://alex.smola.org/posts/11-hash-cofi/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/11-hash-cofi/hashcoffee.jpeg" class="img-fluid" alt="Hand pouring espresso over a large mound of cotton candy atop a black coffee cup on a wooden board, with flowers beside it"></p>
<p>This is a follow-up on the hashing for linear functions post. It’s based on the <a href="http://proceedings.mlr.press/v9/karatzoglou10a/karatzoglou10a.pdf">HashCoFi paper</a> that <a href="https://www.weimo.de/">Markus Weimer</a>, <a href="https://alexiskz.wordpress.com/">Alexandros Karatzoglou</a> and I wrote for AISTATS’10. It deals with the issue of running out of memory when you want to use collaborative filtering for very large problems. Here’s the setting:</p>
<p>Assume you want to do Netflix-style collaborative filtering, i.e.&nbsp;you want to estimate entries in a ratings matrix of (user, movie) pairs. A rather effective approach is to use matrix factorization, that is, to approximate <img src="https://latex.codecogs.com/png.latex?M%20=U%5E%5Ctop%20V"> where <img src="https://latex.codecogs.com/png.latex?M"> is the ratings matrix, <img src="https://latex.codecogs.com/png.latex?U"> is the (tall and skinny) matrix of features for each user, stacked up, and <img src="https://latex.codecogs.com/png.latex?V"> is the counterpart for movies. This works well for the Netflix prize since the number of users and movies is comparatively small.</p>
<p>In reality we might have, say 100 million users for which we might want to recommend products. One option is to distribute all these users over several servers (similar to what a distributed hash table mapping does, e.g.&nbsp;for <a href="https://libmemcached.org/">libmemcached</a>). Alternatively, if we want to keep it all on one server, we’re facing the problem of having to store <img src="https://latex.codecogs.com/png.latex?10%5E8%20%5Ccdot%20100%20%5Ccdot%204=4%20%5Ccdot%2010%5E%7B10%7D"> bytes, i.e.&nbsp;40 GB if we assume to allocate 400 Bytes per user (that’s a rather small footprint). That is 100 dimensions per user. Usually this is too big for all but the biggest servers (update - by now you can get super-sized X2i instances on AWS with up to <a href="https://aws.amazon.com/ec2/instance-types/x2i/">4TB capacity</a>). Even worse, suppose that we have user-churn. That is, new users might be arriving while old users disappear (obviously we don’t know whether they’ll ever come back again so we don’t really want to de-allocate the memory devoted to them). Obviously we cannot add more RAM. One possible solution is to store the data on disk and request it whenever a user arrives. This will cost us 5-10ms latency. An SSD will improve this dramatically but it still limits throughput. Moreover, it’ll require cache management algorithms to interact with the collaborative filtering code.</p>
<p>Here’s a simple alternative: apply the hashing trick that we used for vectors to matrices. Recall that in the exact case we compute matrix entries via</p>
<p><img src="https://latex.codecogs.com/png.latex?M%5Bi,j%5D=%20%5Csum_k%20U%5Bi,k%5DV%5Bj,k%5D"></p>
<p>Now denote by <img src="https://latex.codecogs.com/png.latex?h_u"> and <img src="https://latex.codecogs.com/png.latex?h_v"> hash functions mapping pairs of integers to a given hash range <img src="https://latex.codecogs.com/png.latex?%5C%7B0,%20%5Cldots%20N-1%5C%7D">. Moreover, let <img src="https://latex.codecogs.com/png.latex?%5Csigma_u"> and <img src="https://latex.codecogs.com/png.latex?%5Csigma_v"> be corresponding Rademacher hash functions which return a binary hash in <img src="https://latex.codecogs.com/png.latex?%5C%7B%5Cpm%201%5C%7D">. Now replace the above sum via</p>
<p><img src="https://latex.codecogs.com/png.latex?M%5Bi,j%5D=%20%5Csum_k%20%20u%5Bh_u(i,k)%5D%5Csigma_u(i,k)%20v%5Bh_v(j,k)%5D%20%5Csigma_v(j,k)"></p>
<p>What happened is that now all access into <img src="https://latex.codecogs.com/png.latex?U"> is replaced by access into a vector <img src="https://latex.codecogs.com/png.latex?u"> of length <img src="https://latex.codecogs.com/png.latex?N"> (and the same holds true for <img src="https://latex.codecogs.com/png.latex?V">). Why does this work: firstly, we can prove that if we construct <img src="https://latex.codecogs.com/png.latex?u"> and <img src="https://latex.codecogs.com/png.latex?v"> from <img src="https://latex.codecogs.com/png.latex?U"> and <img src="https://latex.codecogs.com/png.latex?V"> via</p>
<p><img src="https://latex.codecogs.com/png.latex?u%5Bk%5D=%5Csum_%7Bh_u(i,j)=k%7D%20%5Csigma_u(i,j)%20U%5Bi,j%5D%20%5Ctext%7B%20and%20%7D%20v%5Bk%5D=%20%5Csum_%7Bh_v(i,j)%20=k%7D%20%5Csigma(i,j)%20V%5Bi,j%5D"></p>
<p>then the approximate version of <img src="https://latex.codecogs.com/png.latex?M%5Bi,j%5D"> converges to the correct <img src="https://latex.codecogs.com/png.latex?M%5Bi,j%5D"> with variance <img src="https://latex.codecogs.com/png.latex?O(1/N)"> and moreover that the estimate is unbiased. Getting the exact expressions is a bit tedious and they’re described in the paper. In practice, things are even better than this rate: since we never use <img src="https://latex.codecogs.com/png.latex?U"> and <img src="https://latex.codecogs.com/png.latex?V"> but always <img src="https://latex.codecogs.com/png.latex?u"> and <img src="https://latex.codecogs.com/png.latex?v"> we simply optimize with respect to the compressed representation.</p>
<p>One of the advantages of the compressed representation is that we never really need to have any knowledge of all the rows of <img src="https://latex.codecogs.com/png.latex?U">. In particular, rather than mapping user IDs to rows in <img src="https://latex.codecogs.com/png.latex?U"> we simply use the user ID as the hash key. If a new user appears, memory is effectively allocated to the new user by means of the hash function. If a user disappears, his parameters will simply get overwritten if we perform stochastic gradient descent with respect to the <img src="https://latex.codecogs.com/png.latex?u"> and <img src="https://latex.codecogs.com/png.latex?v"> vectors. The same obviously holds for movies or any other entity one would like to recommend.</p>
<p>Bottom line - we now can have fast (in memory) access to user parameters regardless of the number of users. The downside is that the latency is still quite high: remember that the hash function requires us to access <img src="https://latex.codecogs.com/png.latex?u%5Bh_u(i,k)%5D"> for many different values of <img src="https://latex.codecogs.com/png.latex?k">. This means that each access in <img src="https://latex.codecogs.com/png.latex?k"> is a cache miss, i.e.&nbsp;it’ll cost us 100-200ns RAM latency rather than the 10-20ns we’d pay for burst reads. How to break this latency barrier is the topic of one of the next posts.</p>
<p>PS: The image above depicts a cup of Hash coffee being prepared. The coffee shop in Melbourne has since gone out of business. Hashing, hasn’t, though.</p>



 ]]></description>
  <category>hashing</category>
  <category>collaborative filtering</category>
  <guid>https://alex.smola.org/posts/11-hash-cofi/</guid>
  <pubDate>Fri, 01 Jul 2011 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/11-hash-cofi/hashcoffee.jpeg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Priority Sampling</title>
  <link>https://alex.smola.org/posts/10-priority-sampling/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/10-priority-sampling/priority.png" class="img-fluid" alt="Bar chart showing priority sampling: items to the left of a threshold are selected, with bars indicating original weight, priority, and weight estimate per item"></p>
<p><a href="https://research.google/people/TamasSarlos/">Tamas Sarlos</a> pointed out a much smarter strategy on how to obtain a sparse representation of a (possibly dense) vector: Priority Sampling by <a href="https://nickduffield.net/download/papers/priority.pdf">Duffield, Lund and Thorup, 2006</a>. The idea is quite ingenious and (surprisingly so) essentially optimal, as <a href="https://people.cs.rutgers.edu/~szegedy/">Mario Szegedy</a> <a href="https://dl.acm.org/doi/10.1145/1132516.1132539">showed</a>. Here’s the algorithm (please read the previous blog on <a href="../8-sparse-vector/">vector sparsification</a> for some motivation):</p>
<ul>
<li>For each <img src="https://latex.codecogs.com/png.latex?x_i"> compute a priority <img src="https://latex.codecogs.com/png.latex?%5Cpi_i=%20%5Cfrac%7Bx_i%7D%7Ba_i%7D"> where <img src="https://latex.codecogs.com/png.latex?a_i%20%5Csim%20U(0,1%5D"> is drawn from a uniform distribution.</li>
<li>Let <img src="https://latex.codecogs.com/png.latex?%5Ctau"> be the <img src="https://latex.codecogs.com/png.latex?k+1"> largest such priority.</li>
<li>Pick all <img src="https://latex.codecogs.com/png.latex?k"> indices <img src="https://latex.codecogs.com/png.latex?i_1,%20%5Cldots%20i_k"> which satisfy <img src="https://latex.codecogs.com/png.latex?%5Cpi_i%20%3E%20%5Ctau"> and assign them the value <img src="https://latex.codecogs.com/png.latex?s_i=%20%5Cmathop%7B%5Cmathrm%7Bmax%7D%7D(x_i,%5Ctau)">.</li>
<li>Set all other coordinates <img src="https://latex.codecogs.com/png.latex?s_i"> to <img src="https://latex.codecogs.com/png.latex?0">.</li>
</ul>
<p>This provides an estimator with the following properties:</p>
<ul>
<li>The variance is no larger than that of the <em>best</em> <img src="https://latex.codecogs.com/png.latex?k+1">-sparse estimator.</li>
<li>The entries <img src="https://latex.codecogs.com/png.latex?s_i"> satisfy <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%5Bs_i%5D=x_i">, i.e.&nbsp;the expectation is preserved.</li>
<li>The covariance vanishes, i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%5Bs_i%20s_j%5D=%20x_i%20x_j">.</li>
</ul>
<p>Note that we assumed that all <img src="https://latex.codecogs.com/png.latex?x_i%20%5Cgeq%200">. If not, simply apply the same algorithm to <img src="https://latex.codecogs.com/png.latex?%7Cx_i%7C"> and return signed versions of the estimate.</p>



 ]]></description>
  <category>sampling</category>
  <category>sparsity</category>
  <guid>https://alex.smola.org/posts/10-priority-sampling/</guid>
  <pubDate>Thu, 30 Jun 2011 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/10-priority-sampling/priority.png" medium="image" type="image/png" height="101" width="144"/>
</item>
<item>
  <title>Random elements from a stream</title>
  <link>https://alex.smola.org/posts/9-random-elements/</link>
  <description><![CDATA[ 





<p><img src="https://alex.smola.org/posts/9-random-elements/bear.jpg" class="img-fluid" alt="Brown bear standing at the top of a waterfall, catching a salmon in its mouth while another salmon leaps nearby"></p>
<p>This is a classic trick when dealing with data streams. It shows how to draw a random element from a sequence of instances without knowing beforehand how long the sequence is and which symbols occur. Let us first assume that we knew the identities of all symbols. In this case finding a random symbol would be easy. All we require is that for each symbol <img src="https://latex.codecogs.com/png.latex?s"> we draw a random variable <img src="https://latex.codecogs.com/png.latex?%5Cxi_s%20%5Csim%20U%5B0,1%5D"> from some distribution and subsequently we choose the symbol</p>
<p><img src="https://latex.codecogs.com/png.latex?s%5E*=%5Cmathop%7B%5Cmathrm%7Bargmin%7D%7D_s%20~%20%5Cxi_s."></p>
<p>Since each <img src="https://latex.codecogs.com/png.latex?s"> has equal probability of being associated with the smallest value <img src="https://latex.codecogs.com/png.latex?%5Cxi_s"> it follows that the draw is uniformly random. The trouble with this is that we now need to store one floating point number per key <img src="https://latex.codecogs.com/png.latex?s"> and a method to look up its value, e.g.&nbsp;via a dictionary. This is just as tedious as recording the entire stream.</p>
<p>Here’s a better way. Assume that instead of requesting a random variable <img src="https://latex.codecogs.com/png.latex?%5Cxi_s"> we simply compute the hash <img src="https://latex.codecogs.com/png.latex?h(s)"> of <img src="https://latex.codecogs.com/png.latex?s"> and we set</p>
<p><img src="https://latex.codecogs.com/png.latex?s%5E*=%5Cmathop%7B%5Cmathrm%7Bargmin%7D%7D_s%20~%20h(s)."></p>
<p>For a draw from the space of hash functions this again is uniform. The advantage is that we essentially determined all the random bits when selecting <img src="https://latex.codecogs.com/png.latex?h"> rather than at the time when we want to compute its value <img src="https://latex.codecogs.com/png.latex?h(s)">. The second advantage is that we can now simply keep track of what is the currently smallest value of <img src="https://latex.codecogs.com/png.latex?h(s)"> and update as we go along. We have the following algorithm:</p>
<pre><code>INIT
   hstar = MAXINT 
   n = 0 
   sstar = NONE
FOR ALL incoming s DO
   IF h(s) = hstar:
      n = n + 1
   ELSE IF h(s) &lt; hstar:
      n = 1
      hstar = h(s)
      sstar = s
RETURN (sstar, n)</code></pre>
<p>This algorithm will provide item counts for a random element of the sequence. If you want more than one sample, simply keep a list of the symbols with the <img src="https://latex.codecogs.com/png.latex?k"> smallest hash values and their associated counts. Such algorithms can be used to compute the variance or other moments of a sequence.</p>
<p>PS: The picture is from the <a href="https://www.pentictonlakesideresort.com/">Penticton Lakeside Resort</a> where you can probably see such bears (no, I never visited).</p>



 ]]></description>
  <category>sampling</category>
  <category>stream</category>
  <guid>https://alex.smola.org/posts/9-random-elements/</guid>
  <pubDate>Sat, 25 Jun 2011 07:00:00 GMT</pubDate>
  <media:content url="https://alex.smola.org/posts/9-random-elements/bear.jpg" medium="image" type="image/jpeg"/>
</item>
</channel>
</rss>
