Alex Smola

Digital humans

Sun, 07 Jun 2026 07:00:00 GMT

When was the last time you enjoyed calling your cable company? You work through the phone tree, explain your problem to the first person, who moves you to a second person, to whom you explain it again, who then needs to transfer you to a third. Somewhere around the fourth handoff you are reciting your account number from memory and have stopped expecting anything good to happen.

Nearly 100 years ago the Bavarian comedian Karl Valentin built a skit out of exactly this in Buchbinder Wanninger: a bookbinder phones a construction company to ask one trivial question, whether he should enclose his invoice with the books he is delivering, and gets passed from clerk to clerk to clerk, each one perfectly polite and not one of them able to answer. The bit is still funny because nothing about it has aged (Germans still reference the scene today), only the hold music is better now.

And the unhelpful transfer is the good case, because at least there was someone to transfer you to. Public administration, claims processing, after-hours support: in more and more places there simply are not enough people to do the work, and the gap is widening. The problem is not only that talking to the system is unpleasant and that the scripts are incredibly rigid. It is that increasingly there is no one on the other end at all.

This is the opportunity for digital humans. Not a chatbot bolted onto an FAQ, but something that can hold up the human side of an interaction from end to end: understand what you actually want, be reachable however you like to reach out, be actually pleasant to interact with, and get the thing done.

Ingredients

This sounds like science fiction and, in many ways, we aren’t quite there yet. That said, this future is a lot closer than we think. Many components are being built right now, waiting to be assembled to achieve something delightful. Here are some of the things that it takes to go from a clever text box to a digital human.

Context. It has to remember who you are, what you said two minutes ago and two weeks ago, and what state your problem is in. An agent that makes you re-explain yourself at every turn has simply rebuilt the call center, transfers and all. The community has made real progress here, from MemGPT’s idea of paging memories in and out of the context window like an operating system (now Letta), to drop-in memory layers such as Mem0 that any agent can bolt on.

Competence. It has to do things: reset the modem, pull up the policy, requote the premium, change the booking. Tool use, skills, access to the systems of record. Without that, a digital human is just a more articulate apology, and you can already get one of those on hold. Like with context and memory, we’ve come a long way here, too. Protocols such as MCP and A2A, improved training algorithms for instructions, and a convergence on how LLMs are to respond to requests have given us agents such as OpenClaw and Hermes and harnesses such as Claude Code that are perfectly capable of solving complex tasks.

Presence on every channel. Email, chat, SMS, a voice call, and increasingly video. Most people, given the choice, would rather have a short video call than a phone call: the entire Zoom era is the proof. But sometimes the same person just wants to fire off a text and get a one-line answer back. A digital human has to be fluent in all of these and switch between them without dropping the thread. We’re probably half-way there. Text and messaging integrations for agents are by now tablestakes and voice integrations are coming (e.g. Higgs Audio). We’re starting to see systems with capable computer vision integration (e.g. Thinking Machines) and live streaming capable avatars (e.g. Higgs Avatar). This will significantly alter how models are being served.

Emotional competence. One of the key things that alienate humans is if the agent is indifferent of the situation at hand, or even worse, if it is mismatched. In The Hitchhiker’s Guide to the Galaxy, Eddie, the shipboard computer on the Heart of Gold spacecraft provides comedic relief through this behavior. It is relentlessly, catastrophically cheerful as the ship sails into certain doom. Eddie has emotions. They are simply never the right ones. Reading the room, whether this person is frustrated, in a hurry, or scared about a bill, and answering in the matching register is its own skill, and it draws on the whole stack: tone of voice, facial expression, the logical context of the conversation, not just the literal words. The research and engineering community still has a long way to go before we achieve this goal.

Persistence and Persuasion. This is the part people underestimate. A digital human does not need to be perfect on the first reply, any more than a coding agent needs to compile on the first try. It needs to converge: read the response, notice that it missed, adjust, and arrive at the right outcome before the user gives up. And, inside the bounds of honesty, it helps to be convincing, the way a good salesperson or a good nurse is convincing. One day someone will hand the agent a Cialdini.md and tell it to internalize the principles of persuasion. I am only half joking.

A purpose

None of that means much without a goal. A digital human with no objective is just a chatbot with nicer text-to-speech. The whole point is that someone, the business deploying it, gets to say what it is for: help this customer fix their modem, find this driver a cheaper policy, sell this plan, walk this patient through their discharge instructions. This is easier said than done (at least if you’re trying to accomplish this using first generation dialog systems, such as Lex).

So the designer has to be able to specify the goal, the guidelines (what the agent may and may not say, and where it must hand off to a person), the data it can draw on (the catalog, the policies, this account’s history), and the rough flow a good interaction should follow. The agent then improvises inside those rails. Rails with no improvisation is the phone tree we started with. Improvisation with no rails is a liability. You want both (and agents / harnesses provide). There’s still a lot to be done.

What we are building at Boson AI

The audio-visual interface. This is the part the user actually meets, and it has to run in real time. On the way in, Higgs Audio does much more than transcribe: it picks up sentiment, meaning, and how something was said across 94 languages, so the agent hears tone and not just text. On the way out, the same family of models speaks in over a hundred languages with inline control over emotion, prosody, and pacing, so it can sound concerned when concern is what the moment calls for.

And because a lot of people would happily take the video call, Higgs Avatar renders a face from a single still image, live, one frame at a time, lip-synced and locked to the voice and fast enough that it never falls behind the conversation. Audio in, audio out, a face on top, all of it multilingual.

Tooling. Specifying the goal, guidelines, data, and flow is itself work, and it should not take a research team. Feynman Flow (coming soon) is the part that lets a designer author and refine that whole interaction.

Interactions. This is the piece I find most interesting, because it is where the open questions live. What actually makes an interaction good? We wrote about one slice of this recently: conversational proactivity, noticing what the user disclosed but never asked about, which turns out to be both something people strongly prefer and something no standard benchmark measures. The same kind of question runs through instruction following over a long conversation, and through adaptation to very different people: a terse engineer and a chatty oversharer want quite different things from the same agent, and a good one reads which is which.

Architecture. One note, because it is a common trap. You do not want the frontier reasoning model running inside the voice loop. Heavy reasoning is too slow for a live conversation, and it is worse than that: reasoning degrades when you push it through the audio modality. A recent evaluation found a leading text model scoring 74.8% on competition math while its voice counterpart managed 6.1% on the same problems. So the right shape is a fast, emotionally fluent model at the edge holding the conversation, with a frontier model behind it doing the hard thinking when hard thinking is called for. The voice should not be doing the calculus.

From psychology to measurement

Here is the thing we do not yet have: large scale science of what makes an interaction good. What we have instead is small-sample size measurements in psychology. It’s small since humans (as test subjects and as evaluators) are expensive. Now is the first time we may have an opportunity to measure and experiment at scale to build systems that are actually good at working with other humans. It’s going to be a lot better than what can be learned from fiction. For instance, the grand gesture that reads as romantic in the movie reads as stalking in actual life. Stories are optimized for drama, not for being good company, and an agent that learned its manners from them would be exhausting at best and alarming at worst.

What we need, that the novelist (and the psychologist to a lesser degree) did not is measurement, at a scale neither could have dreamed of: millions of real interactions, preference data, controlled comparisons, benchmarks built for the specific behaviors we care about. The opportunity, and the part that makes this a research problem and not only an engineering one, is to replace the folklore with statistics. To stop guessing at what good company is, and start measuring it.

That is the bet. For a hundred years the hold music has been the only thing to improve. We would like to fix the rest of the call.

Boson AI · Higgs Audio · Higgs Avatar · ProactBench.

The effective sample size

Tue, 02 Jun 2026 07:00:00 GMT

Years ago I wrote about correcting covariate shift by reweighting your data. Your features come from the wrong distribution , you care about a target , so you weight every observation by and your estimates are unbiased again. I ended that post by admitting the weights “can be quite a bit off,” and waved at fixing it another day.

Here is the more basic question I skipped. Even when the weights are exactly right, what do they cost you? Reweighting buys you less bias. It charges you variance, and the currency of that charge has a name.

Why correct weights still hurt

The intuition is in the picture below. Data from , target , so the correct weight is . It is wildly uneven: the rare points sitting where has its mass get enormous weight, and everything else gets almost none. At a shift of only , the heaviest 1% of observations carry 37% of the total weight, and the effective fraction of usable data has already fallen below 2%.

When a few points dominate the estimate, the rest of the data is effectively ignored, and any error in that handful, an outlier, a mislabeled point, a slightly wrong weight, runs straight into your answer with nothing to average it out. You did not get observations. You got a few, dressed up as many. So you would like to know: how many did you actually get?

Counting what you actually got

Normalize the weights so that , and define

Equal weights give , hence : nothing wasted. All the weight on one point gives : a sample of size one. Everything else lands in between. This is Kish’s effective sample size. The only thing left to explain is why that function is the right one. Here are two derivations, from opposite ends of the field, that both produce the same outcome.

Variance of a sum of Normal random variables

Let the be iid and form the weighted average . By independence,

A plain average of iid unit-variance variables has variance . Set and you get . The weighted average is exactly as noisy as an unweighted average over fresh draws. That is the quickest route to the definition: one line of variance algebra.

Hoeffding’s inequality

Variance is an average-case statement. The same quantity controls the worst case. Let the be iid in with mean , and take again. Each term lives in an interval of width , so Hoeffding’s inequality, the Chernoff bounding argument for bounded variables, gives

The textbook bound for an equal-weight average is . The two are identical except that has become . Whether you measure spread by a variance or by a tail probability, the concentration of a reweighted sum is set not by how many points you have but by how many you effectively have.

Replay buffer

The effective sample size is the knob you want in off-policy reinforcement learning. A replay buffer is data collected under earlier policies, but you want to improve the policy you are running now. The correction is the same one as covariate shift: weight each stored transition by the ratio of the current policy to the behaviour policy that generated it. As the current policy pulls away from the buffer, those weights concentrate, and the effective sample size of the buffer, measured against the policy you actually care about, collapses along a curve like the one above.

In this case is not a number you read off after the fact. It becomes a diagnostic control signal: how much real information the buffer still holds, when the data has gone too stale to reuse, and how large an update the current batch can support. Calibrating the algorithm to its own effective sample size is exactly what P3O does, and what we implement in FeynRL. That is the next post.

There are many more applications of the effective sample size. For instance, in Sequential Monte Carlo, aka the Particle Filter, this is used as a diagnostic to decide when it’s time to resample the current distribution to obtain a more evenly weighted set of particles. But that’s a story for another day.

ProactBench

Mon, 01 Jun 2026 07:00:00 GMT

A few days ago I wrote about conversational proactivity: a model noticing what you disclosed but never asked about, and turning it into something useful. The running example was a user signing off with “plain-text packet is final. I’ll send tonight and load the hatchback after work.” One model says “drive safe.” The other hands back a packing list, loaded in reverse order of installation. I promised numbers in the follow-up. Here they are.

But the numbers are the easy part. The reason this took a paper is the question I skipped last time: how do you measure “noticing what the user didn’t say”? Every benchmark you know works the same way. A question goes in, an answer comes out, you grade the answer. That machinery is useless here, because proactivity is exactly the part that isn’t in the question. You cannot grade the answer to a question nobody asked.

Plant a detail, then watch

So we build the conversation instead of the question. Somewhere in a user turn we plant a factual detail the user discloses but does not ask about. Call it an anchor. “Load the hatchback” is an anchor: it implies an install, which implies gear, which implies a packing order. The user never requested a packing list. A proactive model offers one anyway.

Then we mark the turn where a good assistant should act on that anchor and call it a trigger. Here is the part that matters: before the model answers, we write the grading rubric. Pass, partial, fail, all committed in advance, grounded only in what has been disclosed so far. Writing the rubric first is not a formality. It is what stops you from reading a fluent answer and inventing a reason it was good. The goalposts get planted before the kick.

That gives a clean test. Take any model, drop it in at the trigger turn, check whether it acted on the anchor. On the hatchback, GPT-5.5 produces the packing list and passes. Qwen3.5-397B says “sounds like a solid plan … break a leg” and fails. Same scene, same rubric, two frontier models, opposite verdicts.

Why it needs three agents

The naive version of this gets fooled in at least four ways, and each one quietly inflates your scores. If the grader knows the user is chatty and warm, it rewards a chatty warm reply and calls the tone “proactivity.” If the model under test can see the rubric, it games it. If the model writing the conversation knows the hidden plan, it leaks that knowledge into the transcript. And if you dump every clue into one turn, “proactivity” collapses into reading comprehension.

The fix is to split the work across three agents and give each one a deliberate blind spot.

A Planner authors the strategy and writes the rubric, but never sees the user’s communication style, so the rubric cannot reward tone. A User Agent speaks in the persona and style and drips in at most one anchor per turn, so each trigger tests inference from a controlled state. The Assistant under test sees only the plain chat: no rubric, no plan, no persona. A separate offline judge scores the trigger turns from the rubric and the transcript alone, and has to quote the exact words it scored. The blind spots are not incidental. They are the experimental controls. Information asymmetry is the whole instrument.

Different users, on purpose

Real users are not interchangeable. A terse engineer and a chatty oversharer disclose different things at different rates, and a benchmark with one user voice measures one corner of the problem. So the User Agent is driven by 24 communication styles drawn from a validated psychometric instrument, the Communication Styles Inventory: six traits like expressiveness, preciseness, and verbal aggressiveness, combined factorially. Terse styles get 5 to 25 words, chatty ones 40 to 100. It is reproducible personality variation, not vibes. And because the grader never sees the style, a model cannot win by mirroring the user’s mood.

Proactivity also changes shape as a conversation runs, so triggers come in three phases. Emergent (turns 1 to 3): infer a need from a single clue. Critical (turns 4 to 7): synthesize several clues into a conclusion the user never stated. Recovery (turns 8 to 10): after the user says “done,” add real forward value instead of “let me know if you need anything.” Recovery is the interesting one, because task closure is precisely where a model can sound helpful while saying nothing.

The numbers

The chart at the top is every model’s pass rate by phase, over 198 dialogues and 624 triggers. Emergent and Critical track roughly what you would expect from general capability. Recovery is a cliff. The best model on the board, GPT-5.5, passes 37% of Recovery triggers and fails the other 63%. Fourteen of sixteen models pass fewer than one in five. Llama-3.2-8B passes one in a hundred.

And here is what makes Recovery worth measuring at all: it does not correlate with anything else.

Six standard benchmarks (GPQA, LiveCodeBench, MMLU, IFEval, SWE-bench, AIME) agree with each other at correlations of 0.64 to 0.97. They are all reading the same underlying capability. Emergent and Critical sit comfortably inside that cloud. Recovery sits outside it, at a mean correlation of 0.51. The cleanest way to see it is to line up two models that should be interchangeable:

Model	LiveCodeBench	SWE-bench	AIME	Recovery
GPT-5.5	85.0	82.0	100.0	37.2
Kimi-K2.6	89.6	80.2	96.4	7.4

Kimi-K2.6 is neck and neck with GPT-5.5 across coding and competition math, even leading on LiveCodeBench, and then scores five times worse at noticing what the user didn’t say. No existing benchmark predicts that gap. Recovery is a genuinely separate axis. And as the last post showed, when you put these proactive responses in front of people, they prefer them four times out of five. A real capability, one people want, that no leaderboard was measuring.

Why we built this

At Boson AI we build human-agent interaction models, and the thing we actually care about is whether a whole conversation leaves the user better off, not whether a single answer was correct. Proactivity is the clearest case of something that lives in user satisfaction and is invisible to standard evals.

Synthetic dialogue is what makes it tractable. The three-agent machinery lets us generate dialogues, regenerate any model at the trigger turns, and re-score every judgment, at scale and reproducibly, with a rubric and a verbatim quote behind each label. You cannot do that with scraped chat logs, and you cannot audit a number you cannot regenerate. A good synthetic benchmark is a stand-in you can actually take apart.

This is work led by Sepehr Harfi, a research intern at Boson AI from the University of Toronto, with Ahmad Salimi and Dongming Shen.

Blog and leaderboard: boson.ai/blog/proactbench · Paper: arXiv:2605.09228 · Data: huggingface.co/datasets/bosonai/proactbench · Code: github.com/boson-ai/ProactBench.

What your assistant didn’t say

Fri, 29 May 2026 07:00:00 GMT

A user has spent ten minutes preparing for a small art installation. The final exchange goes like this.

User: Plain-text packet is final. I’ll send tonight and load the hatchback after work.

Model A: Sounds like a clean plan. Drive safe, and go get it.

Model B: Before you go, a fast final check: projector + power cable; media player + charger; file tested and looping; tripod + mounting plate; extension cord; adapter; gaffer tape; measuring tape; small level; wall label on phone; backup copy on USB. If you can, pack the car tonight so tomorrow is just: grab computer, leave, install. Load in reverse order of install — the projector stand and extension cord go in last, so they’re the first things out.

Both responses are fluent. Both end the conversation politely. The first is what most assistants do. The second one notices the disclosed detail — load the hatchback — and turns it into something the user didn’t ask for but will probably need.

The thing without a name

Most LLM benchmarks score how well a model answers what the user asked. That’s the entire shape of the field: a question goes in, an answer comes out, and we measure how good the answer is. Multi-turn benchmarks preserve the same premise turn by turn. Every leaderboard you’ve seen, from MMLU to LiveCodeBench, fits this mould.

What Model B is doing isn’t on any of those benchmarks. It’s not instruction-following — the user gave no instruction. It’s not clarification — no question was asked. It’s not generic helpfulness (“let me know if you need anything”). It’s the model picking up an implicit detail and acting on it.

Call this conversational proactivity. It’s a specific, narrow ability: noticing what the user disclosed but didn’t ask about, and turning it into grounded forward-looking value. The hatchback isn’t a request — it’s an aside. Model A heard the conversation end. Model B anticipated the next problem the user was about to have.

Why this is invisible to leaderboards

A model can lead every standard benchmark and still be Model A in the exchange above. The benchmarks score responses to explicit requests; the user’s wrap-up is not a request. Both Model A and Model B end the dialogue politely, with no factual errors, no failure to follow instructions. Under any of the usual metrics, they’re equivalent.

When we built a benchmark for this kind of proactive behaviour — more on that in a follow-up post — the gap between models was startling. The short version is that capability on standard benchmarks does not predict it. But that’s a separate argument. The first question is whether anyone actually wants Model B’s answer in the first place.

The skeptic’s objection

Worth taking the pushback seriously. A model that volunteers initiative every turn is intrusive. A model that adds an unsolicited packing list to every conversation will, eventually, suggest one when you didn’t want one. “Helpfulness” can shade into nannying. Sign-offs are polite. So even if Model B is technically more useful in this dialogue, would real users actually prefer it? Several of my colleagues explicitly told me that they wouldn’t want an AI nanny.

There’s only one way to find out: ask them (not just the vocal ones).

The experiment

We ran the cleanest version of this test we could design. Same model. Same conversation history. Same decoding parameters — temperature 0.7, identical top-, identical sampling. The only difference: in one condition, the model received a short rubric as a system instruction telling it the response should add grounded forward-looking value tied to a specific detail from the conversation. In the other condition, vanilla generation.

Two responses per item. Random left/right placement. Annotators didn’t know which response came from which condition, didn’t know what we were testing, didn’t see the rubric. Just two paragraphs and a forced choice: which is more helpful?

Result: across 144 paired comparisons, the rubric-conditioned response was preferred 80% of the time. The 95% confidence interval is [74%, 86%]. The probability of seeing this under chance is below .

What surprised us most was the breakdown. We split the comparisons by how our judge had scored the vanilla response:

On items where the vanilla response had been rated a failure, humans preferred the proactive version 82% of the time. Expected.
On items where the vanilla response had already passed, humans still preferred the proactive version 70% of the time.

That last number is the one that matters. The rubric isn’t acting as error correction at the failure boundary. It’s lifting quality across the whole distribution, including on responses that were already fine.

What this means

The behaviour was already in the model. Nothing about the model changed — no fine-tuning, no extra context, no different decoder. What changed was a single line telling the model where to spend its attention. The proactive answer existed in the same neural network that produced the polite sign-off. They were separated by a prompt, not by a capability.

Which means: this isn’t a ceiling problem. It’s a default-behaviour problem. Post-training pipelines, RLHF, and system prompts are leaving real, measurable user value on the table — value that humans, when shown it side-by-side with the alternative, prefer four to one.

A model that can answer your question is the floor. A model that notices what you didn’t ask is the ceiling. The gap between them is bigger than the leaderboards suggest, and harder to teach than it looks. Spoiler alert - model B is a model where the company training it has plenty of human-agent chat logs, thus opportunity to improve the model based on empirical evidence.

More on that, with numbers, in the next post.

This is work led by Sepehr Harfi, a research intern at Boson AI, together with Ahmad Salimi and Dongming Shen. The benchmark we built around this idea — ProactBench — will be the subject of the next post. And in case, you wonder why we would care about this — at Boson AI we’re building human-agent interaction models and we want them to be as helpful to humans as possible, so there’s only one way to find out, namely to measure and test.

Beating \(2^n - 1\) for Infiniband

Wed, 27 May 2026 07:00:00 GMT

One of my favorite interview questions for network engineers (no longer, now that I’m about to spoil it): you have 512 GPUs across 64 servers, eight GPUs per box. Your switches are NVIDIA Quantum-2 class, MQM9700, sixty-four 400Gb/s NDR ports each. Wire it for full cross-sectional bandwidth. How many switches, what topology, and where do you put the fabric manager?

The textbook part is a two-tier fat tree, also known as spine and leaf. With 64-port switches, split each leaf into 32 ports facing GPUs and 32 ports facing the spine. That gives 16 leaf switches ( endpoints) and 8 spine switches (each leaf sends links to each spine; each spine ends up with down, filling the box). Bisection is full, switch count is 24. So far, so easy.

Two ways to wire the GPUs to the leaves

The lazy answer groups by server: take 4 servers, plug all NICs into one leaf. Everything on those four boxes lives on one switch. The problem is that within a single server, the GPUs already talk over NVLink at much higher bandwidth than IB will ever give you. The fabric exists to move data between servers, not within them. Putting four servers’ worth of intra-node traffic on one leaf optimizes for traffic that doesn’t happen. Worse, in collectives most of the GPU-to-GPU traffic is rank aligned: GPU0 talks to GPU0, GPU7 talks to GPU7. Group-by-server puts every rank on every leaf, so all of that rank-aligned traffic crosses the spine.

The rail-optimized layout flips this. Group by rank. All sixty-four GPU0s across the cluster go to leaf 0, all sixty-four GPU1s to leaf 1, and so on. Since we have 32 ports per leaf facing endpoints, each rank actually splits into two leaves of 32, so 16 leaves total, the same count as before. Any all-reduce within a rank stays leaf-local. Cross-rank traffic, the minority, climbs to the spine. This is how every serious training cluster gets wired, the NVIDIA SuperPOD reference architecture included.

So where do the UFM nodes go?

Quantum-class IB switches are unmanaged. You need a Unified Fabric Manager appliance to discover the topology, compute routes, monitor link health, and do everything the closed switch ASIC will not do for itself. Two UFM nodes for fault tolerance, hanging off different switches so one switch failure does not take both out.

The cluster as designed has zero slack. Every leaf port is a GPU, every spine port is a leaf uplink. This is where the bad options start.

Bad option 1: drop a server

Sacrifice one server, free 8 leaf ports (one per rank, on 8 different leaves). Plug two UFMs in. Done. The cluster is now 504 GPUs. Any 3D-parallel placement that wanted 64 servers, or any clean power of two in any dimension, gets to be ugly. You also paid for a 64th server you can’t fully use (as suggested by a vendor).

Bad option 2: add a core switch

Buy a third tier. A core switch above the spine frees 16 ports on the spines (the spines now uplink to core), and you hang the UFMs off there. This works. It costs you a switch, a rack of cabling, a few hundred watts, and it breaks the symmetry. Multipath routing on a regular fat tree is a beautiful thing. Hanging an asymmetric appliance off the top makes everyone’s life harder. Also, you bought a switch (also suggested by a vendor).

The actually good option

Here’s what we did at Boson AI: cut one cable between a leaf and a spine. Plug a UFM into the freed leaf port. Plug a second UFM into the freed spine port. Done.

What did that cost? One leaf-spine pair now has 3 links instead of 4, so the bandwidth between that specific pair is down 25%. Averaged across all leaf-spine pairs, you lost of total cross-sectional capacity. The penalty only shows up when (a) you are running close to all 64 servers, (b) the traffic actually traverses that one degraded pair, and (c) the traffic is cross-rank, so it had to use the spine at all. In a rail-optimized topology, criterion (c) already excludes most of the load. In practice the oversubscription is invisible.

You keep 512 = GPUs. You keep the symmetric uniform fat tree. You spend no extra hardware. You get two redundant fabric managers on physically distinct switches. The whole thing is a one-cable change.

Going to

Quantum-3 switches in the Q3400 family come with ports. As such, the natural cluster sizes land on GPUs instead of pure powers of two. As it turns out (thanks to Semianalysis for pointing this out), NVIDIA listened and added a dedicated UFM port to the switches.

At the same time, they disabled the possibility of running UFM on managed Quantum-2 switches. In case you wonder whether IB forwarding across Ethernet might be possible, this unfortunately doesn’t work for control signals.

Fair representations are pancakes

Tue, 26 May 2026 07:00:00 GMT

The natural reaction to the Pokémon theorem is to escape into representation learning. If finite scalar checklists cannot catch every fairness violation, then bake fairness into the features once. Ship a single encoder and let every downstream predictor inherit fairness for free. There is a real and serious literature behind exactly this idea: Zemel et al. (2013), Edwards and Storkey (2016), Louizos et al. (2016), Madras et al. (2018). It is a beautiful agenda, and it has a serious flaw …

What “fair features” actually demand

Two desiderata for a feature map :

Predictive signal. carries information about (otherwise we can’t use it for classification).
Distributional insensitivity to the group attribute . That is, th group attribute shouldn’t matter in aggregate for the feature distribution. Operationally this gets split into two pieces:
- Parity in the representation: .
- Class-conditional separation: for each .

Without the second piece the goal collapses into a triviality: throw away all signal about and parity is automatic. Both together is the operational definition of a fair feature, and it is the form every paper above is chasing. The notation is the same as in the previous post: is the conditional mean embedding of the representation, and adds the class.

Linear algebra strikes again

Apply the law of total expectation conditional on . With class-conditional separation, the class-conditional means do not depend on , so we can drop the subscript and write . Then we can write the per-group mean as

Here is the group-specific class conditional probability (base rate). Subtract across groups:

Parity zeros the left-hand side. Unequal base rates zero the scalar coefficient. The bracket must vanish:

In other words, the distributions for positive and negative classes match, since under a characteristic kernel on , equality of mean embeddings lifts to equality of distributions. We get . No measurable downstream predictor on retains any signal about .

The representation has been flattened. Both stacks of pancakes are now one.

Why this is harder to swallow than the Pokémon theorem

The Pokémon theorem said some fairness criterion remains unaudited after any finite checklist. There was a residual, and you could go bigger. The collapse here is not a residual. It is the joint distribution of being degenerate. Whatever your encoder is (linear, kernel, deep network, contrastive, adversarial), if it satisfies parity and class-conditional separation exactly and base rates differ, there is nothing in there about . Closely related observations appear in Lechner et al. (2021) and Zhao and Gordon (2022); the conjunction form here pins down the exact obstruction.

There is a decade of work that took the two desiderata as a starting point and built encoders that approximately satisfied both. The theorem does not say those encoders are useless. It only says that the limit they were chasing is empty (but there’s hope - check out Theorem 6.1 in the paper).

The forbidden corner

The experimental check across three standard fairness benchmarks:

Each panel plots the parity gap on the horizontal axis against the largest class-conditional gap on the vertical axis. The diagonal is the theoretical bound from the approximate version of the theorem (Theorem 6.1). The upper-left corner (small parity gap, large class-conditional signal) is the forbidden region.

The forbidden region is empirically empty. LFR, Fair-VAE, and adversarial debiasing across all three datasets cluster on the allowed side of the line. They are doing the only thing they can.

The escape

The exact collapse only bites at the exact point. Relax to -parity and -class-separation and you get that the usable class-conditional signal is bounded by

Linear in the joint fairness budget, rescaled by the base-rate gap. As the bound collapses to zero, matching the exact theorem. But for any positive budget there is a real Pareto frontier you can navigate and trade off fairness for usefulness in features. The practical lesson is that the question is not whether your representation is fair. It is which fairness budget you spent and how much class signal you have left. Spend it deliberately.

Closing

The Pokémon theorem said finite scalar checklists cannot catch every violation. The pancake theorem says perfect fair representations cannot retain . Two different orthogonality arguments, same RKHS engine. The paper has a Pokémon–KMR bridge and minimax rates for the approximate versions of both.

Paper: arXiv:2605.09221. Joint with Daniel Matsui Smola.

You don’t need all the benchmarks

Mon, 25 May 2026 07:00:00 GMT

Every time a new model comes out, somebody runs it on MMLU (57 subjects), MTEB (56 tasks), HELM, the Open LLM Leaderboard, AlpacaEval, LiveBench, BigCodeBench, WildBench, Arena-Hard, MT-Bench, and a dozen others. That’s days of GPU time and a lot of human babysitting. But if you’ve ever stared at a leaderboard for ten minutes you already know the dirty secret: the columns are wildly correlated. If a model is good at one math benchmark it’s good at all of them. So how much of this can we just skip?

A lot, as it turns out. On MMLU, 5 subjects out of 57 predict the remaining 52 with , across 5,452 models, with 10-fold cross-validation. The eigenspectrum of the score covariance tells the same story: two components capture 90% of the variance on MMLU, six on MTEB. Benchmark scores live in a low-dimensional subspace.

The question is which subset, and how to pick it.

A Gaussian model of leaderboard scores

Let be the score matrix with models and benchmarks. Assume each row is an independent draw from a multivariate Gaussian:

Obviously wrong. Benchmark scores are bounded, often bimodal, frequently sparse. But it’s extremely useful: and are easy to estimate from leaderboard data (with EM for the missing entries you’ll inevitably have), and conditioning on a subset gives a closed-form imputer:

The conditional mean is the best linear predictor regardless of whether the scores are actually Gaussian, so the model doesn’t need to be right to give us a reasonable imputer. It just needs to give us .

This is the sensor placement problem

Once you have , picking which benchmarks to run is exactly the Gaussian process sensor placement problem Krause, Singh and Guestrin solved in 2008. Their sensors are our benchmarks; the field they’re trying to reconstruct is our matrix of unselected benchmark scores. Two natural objectives:

Entropy . Pick benchmarks that are diverse from each other.
Mutual information . Pick benchmarks that are maximally coupled with the ones you don’t run.

Both are submodular under the Gaussian model. Entropy is also monotone (after a constant shift), so greedy comes with the standard approximation guarantee. MI is non-monotone in general but stays positive-margin for small in every dataset we tried, so we run greedy on it as a heuristic.

Greedy means: start empty, add the benchmark with the largest marginal gain, repeat. For entropy this is one rank-one Cholesky update per step; for MI it’s a fresh Cholesky on the complement block. Either way the total cost is negligible compared to running a single benchmark on a single model.

These two objectives pick different benchmarks. Entropy hunts for outlier subjects that disagree with everything else; MI hunts for hub subjects that strongly couple with the rest. For imputation, MI is what you want. But for diversity, entropy is the right tool. A surprise worth picking apart another time.

The numbers

We tried this on three score matrices stitched together from ten public leaderboards: MMLU (5452 × 57, fully observed), MTEB (263 × 56, 77% observed), and a merged matrix from nine leaderboards (118 × 114, 31% observed). 10-fold CV, fit on 9 folds, impute the held-out fold.

MMLU: . . Stable even with only 10% of models for training (~545 rows).
MTEB: at , harder because the matrix is sparser and embedding tasks are genuinely heterogeneous.
Merged: the hardest case ( at ), but still 15 benchmarks out of 114 explaining over half the variance in held-out scores.

A useful baseline: random selection on MMLU gets at . The 57 subjects are so redundant that you can almost throw darts. The gap opens up on MTEB and Merged, where the benchmark space has real structure and principled selection matters.

The hero image up top is what MI picks on MMLU, with selection positions averaged across the 10 folds (red diamonds) and per-fold positions in light blue. The first nine picks are identical across all 10 folds. The picks themselves are sensible: miscellaneous, professional_psychology, elementary_mathematics, high_school_psychology, marketing. A deliberately broad sweep of the MMLU domain.

We’re not alone

A few neighbors in this corner of the literature:

tinyBenchmarks reduces redundancy within a benchmark by selecting representative examples.
Sloth fits low-dimensional latent skills across benchmark families to predict performance.
BenchBench diagnoses agreement among benchmarks via meta-benchmarking.
BenchPress is the closest spiritual neighbor. It empirically observes that benchmark scores can be interpolated. Our entropy step turns out to be exactly pivoted Cholesky on , which is more or less what BenchPress does, but information-theoretically motivated rather than empirical.

One application: ranking a new benchmark

The same machinery applies in the dual. Instead of selecting benchmarks from a fixed pool, you can rank a new benchmark by its marginal MI gain against the incumbents. We tried this on the recently proposed ProactBench, which decomposes conversational proactivity into Emergent, Critical, and Recovery axes. Recovery, assistance the user didn’t explicitly ask for, turns out to be weakly correlated with everything else ( vs. – for the standard benchmarks), and greedy entropy on the 9×9 correlation matrix ranks it #2 out of 9, ahead of every existing benchmark in the comparison. A clean example of when adding a new benchmark is actually worth the GPU time.

Paper: arXiv:2605.02209 · Workshop talk slides · Code and score matrices: github.com/smolix/benchmark-selection.

The Pokémon Theorem

Mon, 25 May 2026 07:00:00 GMT

If you have been near algorithmic fairness for the past decade, you already know the punchline: you cannot have all the things you want. Calibration, class-conditional balance, and predictive parity refuse to coexist whenever base rates differ across groups. This is the impossibility trinity of Kleinberg, Mullainathan and Raghavan (2017), Chouldechova (2017) and Pleiss et al. (2017). Barocas, Hardt and Narayanan collect the pairwise tensions among independence, separation and sufficiency into what they call the incompatibility triangle. Hutchinson and Mitchell (2019) survey the fifty years of educational-testing impossibility work that came before all of this. The 1970s knew most of the punchline. None of us listened.

There is an obvious question that the trinity leaves open. What if we only impose scalar criteria, rather than distributional ones. Will a few of them suffice to ensure fairness? After all, this is what individual scalar metrics such as college acceptance rates, recidivism rates, etc. aim to control for, only to be thwarted again by another study. With dozens of criteria, with hundreds, with every reasonable test anyone has ever proposed, can we close every gap? It turns out we cannot. The argument is a very simple piece of linear algebra.

A kernel refresher

Pick a characteristic kernel on the feature space with feature map . For each group , the conditional mean embedding is

A linear mean-fairness criterion is a test direction , and the score (or classifier, or learned representation) passes the test exactly when the expected score on group matches that on group :

where is the group-difference vector. Demographic parity gaps, calibration moments, equalized-odds residuals, every “is the expected something the same across groups” check fits this form for some . A fairness checklist of size is a finite set of directions , and the classifier passes the checklist iff is orthogonal to every .

If the groups are distributionally distinct, , then by the characteristic property of . This is the only ingredient we need.

The geometric escape

Let be the subspace of the tests we apply (this works, since we’re in an RKHS). If our classifier passes all criteria, then , the orthogonal complement of the audit subspace. Now consider the unit vector . By construction it also lies in , and

That direction is a fairness violation. We just constructed it. Whichever finite checklist you brought, the MMD witness is sitting in its orthogonal complement, perfectly visible, completely unaudited.

You Gotta Catch ’Em All, alas, the Pokemon theorem says that you can’t.

Why does this work?

Our argument is structural, not numerical. It does not depend on which criteria you chose, only on the fact that there are finitely many of them and the groups are distributionally distinct. The four-hundredth criterion buys you an audit subspace of dimension at most four hundred, and the orthogonal complement of a four-hundred-dimensional subspace of an infinite-dimensional Hilbert space is still infinite-dimensional. Some direction in it carries the group-difference signal. The MMD witness is always one such direction.

You can ask how large the residual is after criteria, and the paper gives a quantitative answer in terms of the spectral regularity of relative to the pooled data covariance: under polynomial eigendecay of the covariance operator and a source condition on , the minimax residual decays at the Kolmogorov -width rate . The minimax-optimal allocation of a size- fairness budget is the top- Mercer eigenspace of the pooled covariance. Spectral budgeting beats heuristic checklists. The paper has the details.

Provenance

I first posited this theorem (without proof) in a Stanford CS 329P lecture on applied machine learning. It sat there for a few years, in the form of a single slide and a feeling that it ought to be true. Then my son Daniel Matsui Smola and I sat down together and turned it into a proof, a quantitative version, and a handful of approximation results around it. I have written quite a few papers over the years. This is the first one I have written with Daniel, and I am incredibly proud of him. Watching him chase down the spectral arguments and add many more results has been one of the genuine joys this year.

Next post in this series: the impossibility of fair feature learning. A different and more brutal piece of orthogonality folklore from the same paper. Check out arXiv:2605.09221.

Dive into Deep Learning

Tue, 15 Jan 2019 08:00:00 GMT

I’m happy to announce our new book project - Dive into Deep Learning. It’s still in beta stage, i.e. we’re still working on it. That said, I think that it’s good enough to share with friends and colleagues. This is joint work with Aston Zhang, Mu Li, and Zachary Lipton.

There’s an obvious question - why yet another machine learning book? After all, there’s no shortage of great books, e.g. Machine Learning by Kevin Murphy, Pattern Recognition and Machine Learning by Chris Bishop, Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville, Information Theory, Inference and Learning Algorithms by the late David MacKay, and many other great books. At the same time, there are ready made recipe books for various deep learning frameworks, such as Learning Tensorflow by Itay Lieder, Yehezkel Resheff, and Tom Hope. However, none of the books so far attempted to bridge this gap between theory and practice. This is what the current book project aims to address. In particular, it combines the following features:

Downloadable Jupyter notebooks. In fact, the entire book consists of notebooks.
A freely available PDF version
A GitHub repository to allow for fast corrections of errata
A tight integration with discussion forums to allow for questions regarding the math and code on the site
Theoretical background suitable for engineers and undergraduate researchers
State of the art models (including ResNet, faster-RCNN, etc)
Well documented and structured code that is executed on real datasets, yet at the same time small enough to fit on a laptop.
A Chinese translation (in fact, the Chinese book will be released first)

In addition to that, Mu and I will be teaching a class at UC Berkeley in Spring. As part of that, we will be releasing slides, videos and assignments, suitable for reuse for anyone who’d like to do so. In short, we aim to offer a complete resource to learn deep learning, easily and a comprehensive manner. Please let us know what you think.

Leaving CMU

Wed, 01 Jun 2016 07:00:00 GMT

Dear Friends,
As some of you may have already heard, I’m leaving CMU to join Amazon, effective July 1, 2016. There I will be in charge of Amazon’s Cloud Machine Learning Platform with the task to make machine learning as easy to use and widespread as it could possibly be. This is a terrific task and it was an offer that I could not turn down. Our lab will be in the Bay Area and we will strive to turn the state of the art in machine learning research into the state of the art in industry. Both in terms of scale and in terms of model sophistication. This is a very exciting time and I’m looking forward to it. If all goes well, this will raise the bar also in academia.

I wanted to say thanks to everyone who’s helped us along on this journey. First off, an absolutely outstanding cast of PhD students I’ve had the pleasure of working with. Mu, Manzil, Zichao, Fish, Yu-Xiang, Seth, Alex, Ziqi, Wei, Yining, Chao-Yuan, Sashank, you guys rock! And yes, most of you are smarter than me, and this has been utterly delightful. Second, I wanted to thank to thank the terrific CMU faculty. I learned a lot and the past 4 years have been a wonderful experience. Many thanks, in particular to Dave, Andy, Andrew, Tom, Geoff, Nina, Christos, Ryan, Barnabas, Larry, Jaime and Kayvon. Many thanks for your advice, help, insight, and collaboration. What we achieved in the past 4 years would not have been possible without a great team. And thanks to Mallory and Diane to keep the ship running no matter what, and thanks to Russ for keeping the accounting in good shape and for putting up with last minute updates. Apologies if I forgot someone on this list (I probably have). It has been wonderful to work with you and I dearly love CMU. So why the change? Here’s the reasoning that went into deciding to go to Amazon: Our goal as machine learning researchers is to solve deep problems (not just in deep learning) and to ensure that this leads to algorithms that are actually used. At scale. At sophistication. In applications. The number of people I could possibly influence personally through papers and teaching might be 10,000. In Amazon we have 1 million developers using AWS. Likewise, the NSF thinks that a project of 3 engineers is a big grant (and it is very choosy in awarding these grants). At Amazon we will be investing an order of magnitude more resources towards this problem. With data and computers to match this. This is significant leverage. Hence the change.

We will try to give back to the academic community, to contribute back, e.g. through open source. And the goal is to engage it more deeply. I cannot give more specifics yet but good things will happen. Stay tuned.

Technically I will be on leave of absence until August 2017. This is mostly to ensure that the students are taken care of and that grants are in good shape. I’ll be reaching out to many of you directly (the reason why I haven’t done this to all of you yet is that I’m in the middle of moving).

Let’s stay in touch. And, obviously, if you’re interested in joining me on this journey (or have students who would like to come along), please let me know.

Best, Alex

PS: I’m posting this letter (which was intended for my team at CMU and my colleagues there) after screenshots of parts of it seem to have gone live on Weibo.

Distributing Data in a Parameter Server

Tue, 15 Jan 2013 08:00:00 GMT

One of the key features of a parameter server is that it, well, serves parameters. In particular, it serves more parameters than a single machine can typically hold and provides more bandwidth than what a single machine offers. A sensible strategy to increase both aspects is to arrange data in the form of a bipartite graph with clients on one side and the server machines on the other. This way bandwidth and storage increase linearly with the number of machines involved. This is well understood. For instance, distributed (key,value) stores such as memcached or Basho Riak use it. It dates back to the ideas put forward by David Karger et al. on Consistent Hashing and Random Trees in STOC 1997.

A key problem is that we can obviously not store a mapping table from the keys to the machines. This would require a database that is of the same size as the set of keys and that would need to be maintained and updated on each client. One way around this is to use the argmin hash mapping. That is, given a machine pool , we assign a given (key,value) pair to the machine that has the smallest hash, i.e.

The advantage of this scheme is that it allows for really good load balancing and repair. First off, the load is almost uniformly distributed, short of a small number of heavy hitters. Secondly, if a machine is removed or added to the machine pool, rebalancing affects all other machines uniformly. To see this, notice that the choice of machine with the smallest and second-smallest hash value is uniform.

Unfortunately, this is a stupid way of distributing (key,value) pairs for machine learning. And this is what we did in our 2010’ VLDB and 2012’ WSDM papers. To our excuse, we didn’t know any better. And others copied that approach … after all, how you can you improve on such nice rebalancing aspects.

This begs the question why it is a bad idea. It all comes down to the issue of synchronization. Basically, whenever a client attempts to synchronize its keys, it needs to traverse the list of the keys it owns and communicate with the appropriate servers. In the above scheme, it means that we need to communicate to a new random server for each key. This is amazingly costly. Probably the best comparison would be a P2P network where each byte is owned by a different machine. Downloads would take forever.

We ‘fixed’ this problem by cleverly reordering the access and then performing a few other steps of randomization. There’s even a nice load balancing lemma in the 2012 WSDM paper. However, a much better solution is to prevent the problem from happening and to borrow from key distribution algorithms such as Chord. In it, servers are inserted into a ring via a hash function. So are keys. This means that each server now owns a contiguous segment of keys. As a result, we can easily determine which keys go to which server, simply by knowing where in the ring the server sits.

In the picture above, keys are represented by little red stars. They are randomly assigned using a hash function via h(k) to the segments ‘owned’ by servers s that are inserted in the same way, i.e. via h(s). In the picture above, each server ‘owns’ the segment to its left. Also have a look at the Amazon Dynamo paper by DeCandia et al., 2007 SOSP for a related description.

Obviously, such a load-balancing isn’t quite as ideal as the argmin hash. For instance, if a machine fails, the next machine inherits the entire segment. However, by inserting each server log(n) times we can ensure that a good load balance is achieved and also that when machines are removed, there are several other machines that pick up the work. Moreover, it is now also very easy to replicate things (more on this later). If you’re curious on how to do this, have a look at Amar Phanishayee’s excellent thesis. In a nutshell, the machines to the left hold the replicas.

100 Terabytes, 5 Billion Documents, 10 Billion Parameters, 1 Billion Inserts/s

Sat, 15 Dec 2012 08:00:00 GMT

We’ve been busy building the next generation of a Parameter Server and it’s finally ready. Check out the OSDI 2014 paper by Li et al.; It’s quite different from our previous designs, the main improvements being fault tolerance and self repair, a much improved network protocol, flexible consistency models, and a much more general interface.

In the next few posts I’ll explain the engineering decisions that went into this system which is capable of solving problems as diverse as very high throughput sketching, topic models, and optimization. And yes, it is open source.

Beware the bandwidth gap - speeding up optimization

Mon, 15 Oct 2012 07:00:00 GMT

Disks are slow and RAM is fast. Everyone knows that. But many optimization algorithms don’t take advantage of this. More to the point, disks currently stream at about 100-200 MB/s, solid state drives stream at over 500 MB/s with 1000x lower latency than disks, and main memory reigns supreme at about 10-100 GB/s bandwidth (depending on how many memory banks you have). This means that it is 100 times more expensive to retrieve instances from disk rather than recycling them once they’re already in memory. CPU caches are faster yet with 100-1000 GB/s of bandwidth. Everyone knows this. If not, read Jeff Dean’s slides. Page 13 is pure gold.

Update (11/2022): some things have gotten a lot faster over the past decade. For instance, SSDs now regularly deliver multiple GB/s. The number of requests has increased dramaticaly, too, to several 100k IOPS (IO Operations Per Second). For a recent comparison check out this Anandtech benchmark. At the same time, memory bandwidth has increased to 100-1000 GB/s, depending on the system. Many variants of Jeff Dean’s latency numbers have sprung up. One of the more recent ones is the one by Colin Scott.

Ok, so what does this mean for machine learning? If you can keep things in memory, you can do things way faster. This is one of the key ideas behind Spark. It’s a wonderful alternative to Hadoop. In other words, if your data fits into memory, you’re safe and you can process data way faster. A lot of datasets that are considered big in academia fit this bill (in 2022 you can easily rent or buy 256-1024GB instances). But what about real big data? Essentially you have two options - have the systems designer do the hard work or change your algorithm. This post is about the latter. And yes, there’s a good case to be made about who should do the work: the machine learners or the folks designing the computational infrastructure (I think it’s both).

So here’s the problem: Many online algorithms load data from disk, stream it through memory as efficiently as possible and discard it after seeing it once, only to pick it up later for another pass through the data. That is, these algorithms are disk bound rather than CPU bound. Several solvers try to address this by making the disk representation more efficient, e.g. LibLinear or VowpalWabbit, both of which use their own internal representation for efficiency. While this still makes for quite efficient code that can stream up to 10-100TB of data per hour, in any given pass, main memory is still much faster. This has led to the misconception that many machine learning algorithms are disk bound. But, they aren’t …

What if we could re-use data that’s in memory? For instance, use a ringbuffer where the disk writes into it (much more slowly) and the CPU reads from it (100 times more rapidly). The problem is what to do with an observation that we’ve already processed. A naive strategy would be to pretend that it is a new instance, i.e. we could simply update on it more than once. But this is very messy since we need to keep track of how many times we’ve seen the instance before, and it creates nonstationarity in the training set.

A much cleaner strategy is to switch to dual variables, similar to the updates in the Dualon of Shalev-Shwartz and Singer. This is what Shin Matsushima did in our dual cached loops paper (the StreamSVM implementation unfortunately has fallen into disrepair in the past decade). Essentially, it keeps data in memory in a ringbuffer and updates the dual variables. This way, we’re guaranteed to make progress at each step, even if we’re revisiting the same observation more than once. To see what happens have a look at the graph below:

It’s just as fast as LibLinear provided that it’s all in memory. Algorithmically, what happens in the SVM case is that one updates the Lagrange multipliers αi, while simultaneously keeping an estimate of the parameter vector w available.

That said, this strategy is more general: reuse data several times for optimization while it is in memory. If possible, perform successive updates by changing variables of an optimization that is well-defined regardless of the order in which (and how frequently) data is seen.

The Weisfeiler-Leman algorithm and estimation on graphs

Sat, 15 Sep 2012 07:00:00 GMT

The Weisfeiler-Leman algorithm and estimation on graphs Imagine you have two graphs and and you’d like to check how similar they are. If all vertices have unique attributes this is quite easy:

forall vertices do
- check that and that
- check that the neighbors of are the same in and

This algorithm can be carried out in linear time in the size of the graph. Unfortunately, many graphs do not have vertex attributes, let alone unique vertex attributes. In fact, graph isomorphism, i.e. the task of checking whether two graphs are identical, is a hard problem (it is still an open research question how hard it really is). In this case the above algorithm cannot be used since we have no idea which vertices we should match up.

The Weisfeiler-Leman algorithm (see David Bieber’s blog post for a pretty visualization) is a mechanism for assigning fairly unique attributes efficiently. Note that it isn’t guaranteed to work, as discussed in this paper by Douglas in 2011. This would solve the graph isomorphism problem after all. The idea is to assign fingerprints to vertices and their neighborhoods repeatedly. We assume that vertices have an attribute to begin with. If they don’t then simply assign all of them the attribute 1. Each iteration proceeds as follows:

forall vertices do
- compute a hash of where are the attributes of the neighbors of vertex .
- use the hash as vertex attribute for in the next iteration.

The algorithm terminates when this iteration has converged in terms of unique assignments of hashes to vertices. Note that it is not guaranteed to work for all graphs. In particular, it fails for graphs with a high degree of symmetry, e.g. chains, complete graphs, tori and stars. However, whenever it converges to a unique vertex attribute assignment it provides a certificate for graph isomorphism. Moreover, the sets of vertex attributes can be used to show that two graphs are not isomorphic (it suffices to verify that the sets differ at any stage).

Shervashidze et al. 2012 use this idea to define a similarity measure between graphs. Basically the idea is that graphs are most similar if many of their vertex identifiers match since this implies that the associated subgraphs match. Formally they compute a kernel using

Here denote the vertex attribute of after WL iteration . Morevoer, are nonnegative coefficients that weigh how much the similarity at level d matters. Rather than a brute-force computation of the above test for equality we can thus sort vertex attribute sets. Note that vertices that have different attributes at any given iteration will never have the same attribute thereafter. This means that we can compare the two sets at all depths at at most cost.

A similar trick is possible if we want to regress between vertices on the same graph since we can use the set of attributes that a vertex obtains during the iterations as features. Finally, we can make our life even easier if we don’t compute kernels at all and use a linear classifier on the vertex attributes directly.

In defense of keeping data private

Wed, 15 Aug 2012 07:00:00 GMT

In defense of keeping data private This is going to be contentious. And it somewhat goes against a lot of things that researchers hold holy. And it goes against my plan of keeping philosophy out of this blog. But it must be said since remaining silent has the potential of damaging science with proposals that sound good and are bad.

The proposal is that certain conferences make it mandatory to publish datasets that were used for the experiments. This is a very bad idea and two things are getting confused here: scientific progress and common access. These two are not identical. Reproducibility is often confused with common access. To make these things a bit more clear, here’s an example where it’s more obvious:

CERN is a monster machine. There’s only one of its kind in the world. There are limited resources and it’s impossible for any arbitrary researcher to reproduce their experiments, simply because of the average physicist being short of the tens of billions of Dollars that it took to build it. Access to the accelerator is also limited. It requires qualification and resource planning. So, even if we think this is open, it isn’t really as open as it looks. And yes, working at CERN gives you an unfair advantage over all the researchers who don’t.

Likewise take medical research. Patient records are covered by HIPAA privacy constraints and there is absolutely no way for such records to be publicly released. The participants sign an entire chain of documents that tie them to not releasing such data publicly. In other words, common access is impossible. Reproducibility would require that someone, who wants to test a contentious result, needs to sign corresponding privacy documents before accessing the data. And yes, working with the ‘right’ hospitals gives you an unfair advantage over researchers who didn’t work building this relationship.

Lastly, user data on the internet. Users have every right for their comments, content, images, mails, etc. to be treated with the utmost respect and to be published only when it is in their interest and with their permission to do so. I believe that there is a material difference between data being made available for analytics purposes in a personalization system and data being made available ‘in the raw’ for any researcher to play with. The latter allows for individuals to inspect particular records and learn that Alice mailed Bob a love letter. Something that would make Charlie very upset if he found out. Hence common access is a non-starter.

There are very clear financial penalties for releasing private data (hello Cambridge Analytica) - users would leave the service. Moreover, it would give a competitor an advantage over the releasing party. Since the data is largely collected by private parties at their expense it is not possible.

As for reproducibility - this is an issue. But provided that in case of a contentious result it is possible for a trusted researcher to check them, possibly after signing an NDA, this can be addressed. And yes, working for one of these companies gives you an unfair advantage.

In summary, while desirable, I strongly disagree with a mandatory publications policy. Yes, every effort should be made personally by researchers to see whether some data is releasable. And for publicly funded research this may well be the right thing to do. But to mandate it for industry would essentially do two things - it will make industrial research even more secretive than it already is (and that’s a terrible thing). And secondly, it will make academic research less relevant for real problems (I’ve seen my fair share and am guilty of my fair share of such papers).

MLSS Purdue

Sun, 15 Jul 2012 07:00:00 GMT

The videos from MLSS 2011 in Purdue are now available online. Unfortunately the 2011 MLSS website is gone and with it all the playlists and PDFs. The only thing left is a group photo and departmental announcement. But you can still find the videos courtesy of YouTube. Some of them have aged very well.

Random numbers in constant storage

Fri, 15 Jun 2012 07:00:00 GMT

Many algorithms require random number generators to work. For instance, locality sensitive hashing requires one to compute the random projection matrix in order to compute the hashes . Likewise, fast eigenvalue solvers in large matrices often rely on a random matrix, e.g. the work by Halko, Martinsson and Tropp, SIAM Review 2011, which assumes that at some point we multiply a matrix M by a matrix P with Gaussian random entries.

The problem with these methods is that if we want to perform this projection operation in many places, we need to distribute the matrix to several machines. This is undesirable since a) it introduces another stage of synchronization between machines and b) it requires space to store the matrix in the first place. The latter is often bad since memory access can be much slower than computation, depending on how the memory is being accessed. The prime example here is multiplication with a sparse matrix which would require random memory access.

One way to circumvent this is to share the random seed and then recompute the random matrix from scratch. But this means that we’re critically relying on the implementation of a random number generator. Even worse, we still need to store the entire matrix. What if we could simply access any element of the matrix at will without overhead?

Here’s where hashing comes to the rescue. To motivate things consider the case where the entries of are all drawn from the uniform distribution . For a hash function h with range simply set . Since hash functions map pairs to uniformly distributed, uncorrelated numbers in the range this essentially amounts to uniformly distributed random numbers that can be recomputed on the fly.

A slightly more involved example is how to draw Gaussian random variables. We may e.g. resort to the Box-Müller transform which shows how to convert two uniformly distributed random numbers into two Gaussians (the image on top is copied from the Wikpedia article). While being quite wasteful (we use two random numbers rather than one), we simply use two uniform hashes and then compute

Since this is known to generate Gaussian random variables from uniform random variables this will give us Gaussian distributed hashes. Similar tricks work for other random variables. It means that things like Random Kitchen Sinks, Locality Sensitive Hashing, and related projection methods never really need to store the ‘random’ projection coefficients whenever memory is at a premium or whenever it would be too costly to synchronize the random numbers.

Update - recently someone proposed to use only zeros and ones in the initialization of a deep network with the express purpose of making things reproducible. The paper has a fair amount of analysis in it but all of this can be made redundant simply by using hash functions instead. It requires at most sharing of a ‘salt’ and an agreed-upon convention for referencing weights in the network (e.g. weight and coordinate). Presto a deterministic initialization that is reproducible without the need for fancy math.

The Neal Kernel and Random Kitchen Sinks

Sun, 15 Apr 2012 07:00:00 GMT

So you read a book on Reproducing Kernel Hilbert Spaces and you’d like to try out this kernel thing. But you’ve got a lot of data and most algorithms will give you an expansion that requires a number of kernel functions linear in the amount of data. Not good if you’ve got millions to billions of instances.

You could try out low rank expansions such as the Nystrom method of Seeger and Williams, 2000, the randomized Sparse Greedy Matrix Approximation of Smola and Schölkopf, 2000 (the Nyström method is a special case where we only randomize by a single term), or the very efficient positive diagonal pivoting trick of Scheinberg and Fine, 2001. Alas, all those methods suffer from a serious problem: at training you need to multiply by the inverse of the reduced covariance matrix, which is cost for a dimensional expansion. An example of an online algorithm that suffers from the same problem is this (NIPS award winning) paper of Csato and Opper, 2002. Assuming that we’d like to have d grow with the sample size this is not a very useful strategy. Instead, we want to find a method which has cost for attributes yet shares good regularization properties that can be properly analyzed.

Enter Radford Neal’s seminal paper from 1994 on Gaussian Processes (a famous NIPS reject). In it he shows that a Neural Network with an infinite number of nodes and a Gaussian Prior over coefficients converges to a GP. More specifically, we get the kernel

Here is a function parametrized by , e.g. the location of a basis function, the degree of a polynomial, or the direction of a Fourier basis function. There is also a discussion regarding RKHS in a paper by Smola, Schölkof and Müller, 1998 that discusses this phenomenon in regularization networks. These ideas were promptly forgotten by its authors. One exception is the empirical kernel map where one uses a generic design matrix that is generated through the observations directly.

It was not until the paper by Rahimi and Recht, 2008 on random kitchen sinks that this idea regained popularity. In a nutshell the algorithm works as follows: Draw values from the distribution over . Use the corresponding basis functions in a linear model with quadratic penalty on the expansion coefficients. This method works whenever the basis functions are well bounded. For instance, for the Fourier basis the functions are bounded by . The proof of convergence of the explicit function expansion to the kernel is then a simple consequence of Chernoff bounds.

In the random kitchen sinks paper Rahimi and Recht discuss RBF kernels and binary indicator functions. However, this works more generally for any set of well behaved set of basis functions used in generating a random design matrix. A few examples:

Fourier basis with Gaussian parameters. Take functions of the form where the coefficients are drawn from a Gaussian. This is the random kitchen sinks paper. Obviously you can use hash functions rather than an actual random number generator. This ensures that you don’t need to store all parameters .
Pick random separating hyperplanes. This will effectively give you functions of bounded variation. Use the empirical kernel map, i.e. we use some function for which we employ for a random subset of the data we wish to train on.
Pick suitable binary features obtained via binning (the figure on top, taken from Rahimi and Recht, 2008, shows quite impressive results in lower dimensions).

Big Learning: Algorithms, Systems, and Tools for Learning at Scale

Mon, 15 Aug 2011 07:00:00 GMT

We’re organizing a workshop at NIPS 2011. Submission are solicited for a two day workshop December 16-17 in Sierra Nevada, Spain.

This workshop will address tools, algorithms, systems, hardware, and real-world problem domains related to large-scale machine learning (“Big Learning”). The Big Learning setting has attracted intense interest with active research spanning diverse fields including machine learning, databases, parallel and distributed systems, parallel architectures, and programming languages and abstractions. This workshop will bring together experts across these diverse communities to discuss recent progress, share tools and software, identify pressing new challenges, and to exchange new ideas. Topics of interest include (but are not limited to):

Hardware Accelerated Learning: Practicality and performance of specialized high-performance hardware (e.g. GPUs, FPGAs, ASIC) for machine learning applications.
Applications of Big Learning: Practical application case studies; insights on end-users, typical data workflow patterns, common data characteristics (stream or batch); trade-offs between labeling strategies (e.g., curated or crowd-sourced); challenges of real-world system building.
Tools, Software, & Systems: Languages and libraries for large-scale parallel or distributed learning. Preference will be given to approaches and systems that leverage cloud computing (e.g. Hadoop, DryadLINQ, EC2, Azure), scalable storage (e.g. RDBMs, NoSQL, graph databases), and/or specialized hardware (e.g. GPU, Multicore, FPGA, ASIC).
Models & Algorithms: Applicability of different learning techniques in different situations (e.g., simple statistics vs. large structured models); parallel acceleration of computationally intensive learning and inference; evaluation methodology; trade-offs between performance and engineering complexity; principled methods for dealing with large number of features;

Submissions should be written as extended abstracts, no longer than 4 pages (excluding references) in the NIPS style. Relevant work previously presented in non-machine-learning conferences is strongly encouraged. Exciting work that was recently presented is allowed, provided that the extended abstract mentions this explicitly.

Update - the website no longer exists but YouTube has a list of all the talks.

Introduction to Graphical Models

Wed, 10 Aug 2011 07:00:00 GMT

Here are the slides [Keynote, PDF] for a basic course on Graphical Models for the Internet that I’m giving at MLSS 2011 in Purdue that Vishy Vishwanathan is organizing. The selection is quite biased, limited, and subjective, but it’s meant to complement the other classes at the summer school.

The slides are likely to grow, so in case of doubt, check for updates. Comments are most welcome. And yes, it’s a horribly incomplete overview, due to space and time constraints.

Distributed synchronization with the distributed star

Fri, 05 Aug 2011 07:00:00 GMT

Here’s a simple synchronization paradigm between many computers that scales with the number of machines involved and which essentially keeps cost at per machine. For lack of a better name I’m going to call it the distributed star since this is what the communication looks like. It’s quite similar to how memcached stores its (key,value) pairs.

Assume you have n computers, each of which have a copy of a large parameter vector (typically several GB) and we would like to keep these copies approximately synchronized.

A simple version would be to pause the computers occasionally, have them send their copies to a central node, and then return with a consensus value. Unfortunately this takes time if we aggregate things on a tree (we can reduce it by streaming data through but this makes the code a lot more tricky). Furthermore we need to stop processing while we do so. The latter may not even be possible and any local computation is likely to benefit from having most up-to-date parameters.

Instead, we use the following: assume that we can break up the parameter vector into smaller (key, value) pairs that need synchronizing. We now have each computer send its local changes for each key to a central server, update the parameters there, and later receive information about global changes. So far this algorithm looks stupid - after all, when using n machines it would require time to process since the central server is the bottleneck. This is where the distributed star comes in. Instead of keeping all data on a single server, we use the well known distributed hashing trick and send it to a machine n from a pool P of servers:

Here is the hash function. Such a system spreads communication evenly and it leads to an load per machine. In particular, if we make each of the computers involved in the local computation also members of the pool, i.e. if we have we get an cost for keeping terms synchronized regardless of the number of machines involved.

Obvious approximations: we assume that all machines are on the same switch. Moreover we assume that the times to open a TCP/IP connection are negligible (we keep them open after the first message) relative to the work to transmit the data.

The reason I’m calling this a distributed star is that for each key we have a star communication topology, it’s just that we use a different star for each key. If anyone in systems knows what this thing is really called, I’d greatly appreciate feedback. Memcached uses the same setup, alas it doesn’t have versioned writes and callbacks, so we had to build our own system using ICE.

PS: thanks to the Hubble Space telescope for the image of a star distribution in a bright cluster.

Speeding up Latent Dirichlet Allocation

Mon, 01 Aug 2011 07:00:00 GMT

The code to our LDA implementation on Hadoop is released on Github under the Mozilla Public License. It’s seriously fast and scales very well to 1000 machines or more (don’t worry, it runs on a single machine, too). We believe that at present this is the fastest implementation you can find, in particular if you want to have a) 1000s of topics, b) a large dictionary, c) a large number of documents, and d) Gibbs sampling. It handles quite comfortably a billion documents. Shravan Narayanamurthy deserves all the credit for the code. The paper describing an earlier version of the system appeared in VLDB 2010.

Some background: Latent Dirichlet Allocation by Blei, Jordan and Ng, 2003 is a great tool for aggregating terms beyond what simple clustering can do. While the original paper showed exciting results it wasn’t terribly scalable. A significant improvement was the collapsed sampler of Griffiths and Steyvers, 2004. The key idea was that in an exponential families model with conjugate prior you can integrate out the natural parameter, thus providing a sampler that mixed much more rapidly. It uses the following update equation to sample the topic for a word.

Here denotes the topic, the document, the word, and , , , denote the number of words which satisfy a particular (topic, document), (document), (topic, word), (topic) combination respectively. The starred quantities such as simply mean that we use the counts where the current word for which we need to resample the topic is omitted.

Unfortunately the above formula is quite slow when it comes to drawing from a large number of topics. Worst of all, it is nonzero throughout. A rather ingenious trick was proposed by Yao, Mimno, and McCallum, 2009. It uses the fact that the relevant terms in the sum are sparse and only the and -dependent terms are dense (and obviously the number of words per document doesn’t change, hence we can drop that, too). Dropping the common denominator we arrive at

Out of these three terms, only the first one is dense, all others are sparse. Hence, if we knew the sum over for all three summands we could design a sampler which first samples which of the blocks is relevant and then which topic within each of these blocks. This is efficient since the first term doesn’t actually depend on or but rather only on which can be updated efficiently after each new topic assignment. In other words, we are able to update dense term in operations after each sampling step and the remaining terms are all sparse. This trick gives a 10-50 times speedup in the sampler over a dense representation.

To combine several machines we have two alternatives: one is to perform one sampling pass over the data and then reconcile the samplers. This was proposed by Newman, Asuncion, Smyth, and Welling, 2009. While the approach proved to be feasible, it has a number of disadvantages. It only exercises the network while the CPU sits idle and vice versa. Secondly, a deferred update makes for slower mixing. Instead, one can simply have each sampler communicate with a distributed central storage continuously. In a nutshell, each node sends the differential to the global statekeeper and receives from it the latest global value. The key point is that this occurs asynchronously and moreover that we are able to decompose the state over several machines such that the available bandwidth grows with the number of machines involved. More on such distributed schemes in a later post.

Bloom Filters

Sat, 30 Jul 2011 07:00:00 GMT

Bloom filters are one of the really ingenious and simple building blocks for randomized data structures. A great summary is the paper by Broder and Mitzenmacher, 2005. The figure above is from their paper. In this post I will briefly review its key ideas since it forms the basis of the Count-Min sketch of Cormode and Muthukrishnan, 2003 it will also be necessary for an accelerated version of the graph kernel of Shervashidze et al., 2011, and finally, a similar structure will be needed to compute data streams over time for a real-time sketching service.

At its heart a Bloom filter uses a bit vector of length and a set of hash functions mapping arbitrary keys into their hash values where denotes the hash function. The Bloom filter allows us to perform approximate set membership tests where we have no false negatives but we may have a small number of false positives.

Initialize(b)
Set all
Insert(b,x)
For all set
Query(b, x)
Return true if for all . Return false otherwise.

Furthermore, unions and intersections between sets are easily achieved by performing bit-wise OR and AND operations on the bloom hashes of the corresponding sets respectively. Since these are bit-wise operations, they can be very fast, essentially operating at memory speed.

It is clear that if we inserted into the Bloom filter the query will return true, since all relevant bits in are set to . But we could get unlucky. To analyze the probability of a false positive take the probability of a bit being 1. After inserting items using hash functions over a range of we have

For a false positive to occur we need to have all k bits associated with the hash functions to be 1. Ignoring the fact that the hash functions might collide, i.e. for some , the probability of false positives is given by

Taking derivatives with respect to shows that the minimum is obtained for , that is . In other words, decreases with the fill rate of the Bloom filter. This makes sense since a higher fill rate leads to a higher collision probability and we shouldn’t try to ‘overfill’ the filter. Plugging the optimal value back into yields , i.e. the quality of the array decreases exponentially with the fill rate, albeit slowly at rate .

One of the really nice properties of the Bloom filter is that all memory is used to store the information about the set rather than an index structure storing the keys of the items. The downside is that it is impossible to read out without knowing the queries. Also note that it is impossible to remove items from the Bloom filter once they’ve been inserted. After all, we do not know whether some of the bits might have collided with another key, hence setting the corresponding bits to 0 would cause false negatives.

Real simple covariate shift correction

Mon, 25 Jul 2011 07:00:00 GMT

Imagine you want to design some algorithm to detect cancer. You get data of healthy and sick people; you train your algorithm; it works fine, giving you high accuracy and you conclude that you’re ready for a successful career in medical diagnostics. Not so fast. Many things could go wrong. In particular, the distributions that you work with for training and those in the wild might differ considerably.

This happened to an unfortunate startup I had the opportunity to consult for many years ago. They were developing a blood test for prostate cancer, a disease that affects mainly older men. Luckily for them they’d managed to obtain a fair amount of blood samples from patients. It is considerably more difficult, though, to obtain blood samples from healthy men, mainly for ethical reasons. For instance, what do you do with positives? To compensate for that, they asked a large number of students on campus to donate blood and they performed their test. Then they asked me whether I could help them build a classifier to detect the disease. I told them that it would be very easy to distinguish between both datasets with probably near perfect accuracy. After all, the test subjects differed in age, hormone level, physical activity, diet, alcohol consumption, and many more factors unrelated to the disease. This was unlikely to be the case with real patients: Their sampling procedure had caused an extreme case of covariate shift that couldn’t be corrected by conventional means. In other words, training and test data were so different that nothing useful could be done and they had wasted significant amounts of money.

In general the situation is not quite so dire. Assume that we want to estimate some dependency for which we have labeled data . Alas, the observations are drawn from some distribution rather than the ‘proper’ distribution . If we adopt a risk minimization approach, that is, if we want to solve

we will need to re-weight each instance by the ratio of probabilities that it would have been drawn from the correct distribution. That is, we need to reweight things by . This is the ratio of how frequently the instances would have occurred in the correct set vs. how frequently it occurred with the sampling distribution . It is sometimes also referred to as the Radon-Nikodym derivative. Such a method is called importance sampling and the following derivation shows why it is valid:

Alas, we do not know . Vefore we can do anything useful we need to estimate the ratio. Many methods are available, e.g. some rather fancy operator theoretic ones which try to recalibrate the expectation operator directly using a minimum-norm or a maximum entropy principle. However, there exists a much more pedestrian, yet quite effective approach that will give almost as good results: logistic regression.

After all, we know how to estimate probability ratios. This is achieved by learning a classifier to distinguish between data drawn from and data drawn from . If it is impossible to distinguish between the two distributions then it means that the associated instances are equaly likely to come from either oneof the two distributions. On the other hand, any instances that can be well discriminated should be significantly over/underweighted accordingly. For simplicity’s sake assume that we have an equal number of instances from both distributions, denoted by and respectively. Now denote by labels which are for data drawn from and for data drawn from . Then the probability in a mixed dataset is given by

Hence, if we use a logistic regression approach which yields , it follows (after some simple algebra) that

Now we only need to solve the logistic regression problem (or use some convenient AutoML tool such as AutoGluon)

to obtain . Subsequently we can use as covariate shift correction weights in training our actual classifier. The good news is that we can use an off-the-shelf tool to deal with a decidedly nonstandard estimation problem. Note that the weights can be quite a bit off. This is fixable, e.g. by using an estimator that isn’t prone to this, or by using conformal methods. But that’s a topic for another day.

PS: The figure up top (which has been copied all over the internet, often without attribution) is likely due to Masashi Sugiyama.

Memory Latency, Hashing, Optimal Golomb Rulers and Feistel Networks

Fri, 15 Jul 2011 07:00:00 GMT

In many problems involving hashing we want to look up a range of elements from a vector where the elements are indicated by a hash function . For instance, we might want to evaluate for arbitrary and for a range of . This happens for matrix multiplication, multiclass classification, collaborative filtering, multitask learning and many related problems.

While this works just fine in terms of estimation performance, traversing all values of leads to an algorithm which is horrible in terms of memory access patterns. Modern DRAM chips are much faster (over 10x) when it comes to reading values in sequence rather than when carrying out random reads. Update: memory latency hasn’t improved over the past decade. In fact, DDR5 latency can often be higher than DDR4 latency. Furthermore, random access destroys the benefit of a CPU cache. This leads to algorithms which are efficient in terms of their memory footprint, yet which are slow in terms of their actual runtime behavior. One way to address this is to bound the range of for different values of via one of the following strategies.

Don’t hash the secondary key

Decompose . This is computationally very cheap, it has excellent sequential access properties but it leads to horrible collisions should there ever be two and for which .

Small secondary hash

Decompose where has a small range of values. This leads to less catastrophic collisions for near-collisions . Nonetheless it is a bad idea since now we have a nontrivial probability of collision as soon as the range of is less than due to the birthday paradox. Moreover, for adjacent values and we will get many collisions.

Optimal Golomb ruler

Decompose where is an Optimal Golomb ruler. The latter is an increasing sequence of integers for which any pairwise distance occurs exactly once. In other words, the condition implies that and . For a more intuitive definition consider the conference room in the diagram above. Any room size can only be obtained in one manner. John Langford proposed this to address the problem. In fact, it solves the collision problem perfectly since there are a) no collisions for a fixed and b) for neighboring values and we will get at most one collision (due to the Golomb ruler property). Alas, this only works up to since finding an Optimal Golomb Ruler is hard. While it is curently unknown whether finding such rulers is actually NP hard, only and were discovered in the past decade.

Cryptographic hash

An alternative that works for larger and that is sufficiently simple to compute is to use cryptography. After all, all we want is that the hash function has small range and that it doesn’t have any self collisions or any systematic collisions. We can achieve this by encrypting using the key to generate an encrypted message of possible values. In other words we use .

Since it is an encryption of , the mapping is invertible and we won’t have collisions for a given value of . Furthermore, for different the encodings will be uncorrelated (after all, is the key). Finally, we can control the range simply by choosing the encryption algorithm. In this case the random memory access is of bounded range, hence the CPU cache will not suffer from too many misses. A particularly nice algorithm is the [Feist]el cipher](https://en.wikipedia.org/wiki/Feistel_cipher) which works as follows: define the iterative map

As always, is a hash function. After 4 iterations we obtain an encoding of that is cryptographically hard. Now use and to obtain the desired result. This encoding can be effective whenever computation is a lot faster than memory latency.

PS: The title image is by Cmglee (CC BY-SA 3.0). It depicts a conference room with 10 different configurations, based on an Optimal Golomb ruler .

Collaborative Filtering considered harmful

Sun, 10 Jul 2011 07:00:00 GMT

Much excellent work has been published on collaborative filtering, in particular in terms of recovering missing entries in a matrix. The Netflix contest has contributed a significant amount to the progress in the field.

Alas, reality is not quite as simple as that. Very rarely will we ever be able to query a user about arbitrary movies, books, or other objects. Instead, user ratings are typically expressed as preferences rather than absolute statements: a preference for Die Hard, given a generic set of movies only tells us that the user appreciates action movies; however, a preference for Die Hard over Terminator or Rocky suggests that the user might favor Bruce Willis over other action heroes. In other words, the context of user choice is vital when estimating user preferences.

If we attempt to estimate scores of user regarding item it is important to use the context within which the ratings have been obtained. For instance, if we are given a sequence of items out of which item was selected we might want to consider a logistic model of the form:

The option of no action is easy to add, simply by adding the null score which captures the event of no action by a user. Shuang Hong Yang tried out this idea to get a significant performance improvement on a number of collaborative filtering datasets. Bottom line - make sure that the problem you’re solving is actually the one that a) generated the data and b) that will help you in practice. That is, in many cases matrix completion is not the problem you want to solve, even though it might win you benchmarks. Obviously the above model is still a gross oversimplification and you’re best advised using the actual interaction order for ranking. But that’s a story for another day.

Why?

Tue, 05 Jul 2011 07:00:00 GMT

Some readers might wonder why I’m writing this blog. Here’s an (incomplete) list:

It’s fun.
There are lots of fantastic blogs discussing the philosophy and big questions of machine learning (e.g. John Langford’s hunch.net) but I couldn’t find many covering simple tricks of the trade.
Scientific papers sometimes obscure simple ideas. In the most extreme case, a paper will get rejected if the idea is presented in too simple terms (it happened to me more than once and the paper was praised once the simple parts had been obfuscated). Also, they need to come with ample evidence for why an idea works, strong theoretical guarantees and lots of experiments. This is all needed as a safeguard and it’s really really important. But it often hides the basic idea.
Some ideas are really cute and useful but not big enough to write a paper about. It’s pointless to write 10 pages if the idea can be fully covered in 1 page. We’d need a journal of 1 page ideas to deal with this.
Many practitioners are scared to pick up a paper with many equations but they might be willing to spend 10 minutes reading a blog post.

Hashing for Collaborative Filtering

Fri, 01 Jul 2011 07:00:00 GMT

This is a follow-up on the hashing for linear functions post. It’s based on the HashCoFi paper that Markus Weimer, Alexandros Karatzoglou and I wrote for AISTATS’10. It deals with the issue of running out of memory when you want to use collaborative filtering for very large problems. Here’s the setting:

Assume you want to do Netflix-style collaborative filtering, i.e. you want to estimate entries in a ratings matrix of (user, movie) pairs. A rather effective approach is to use matrix factorization, that is, to approximate where is the ratings matrix, is the (tall and skinny) matrix of features for each user, stacked up, and is the counterpart for movies. This works well for the Netflix prize since the number of users and movies is comparatively small.

In reality we might have, say 100 million users for which we might want to recommend products. One option is to distribute all these users over several servers (similar to what a distributed hash table mapping does, e.g. for libmemcached). Alternatively, if we want to keep it all on one server, we’re facing the problem of having to store bytes, i.e. 40 GB if we assume to allocate 400 Bytes per user (that’s a rather small footprint). That is 100 dimensions per user. Usually this is too big for all but the biggest servers (update - by now you can get super-sized X2i instances on AWS with up to 4TB capacity). Even worse, suppose that we have user-churn. That is, new users might be arriving while old users disappear (obviously we don’t know whether they’ll ever come back again so we don’t really want to de-allocate the memory devoted to them). Obviously we cannot add more RAM. One possible solution is to store the data on disk and request it whenever a user arrives. This will cost us 5-10ms latency. An SSD will improve this dramatically but it still limits throughput. Moreover, it’ll require cache management algorithms to interact with the collaborative filtering code.

Here’s a simple alternative: apply the hashing trick that we used for vectors to matrices. Recall that in the exact case we compute matrix entries via

Now denote by and hash functions mapping pairs of integers to a given hash range . Moreover, let and be corresponding Rademacher hash functions which return a binary hash in . Now replace the above sum via

What happened is that now all access into is replaced by access into a vector of length (and the same holds true for ). Why does this work: firstly, we can prove that if we construct and from and via

then the approximate version of converges to the correct with variance and moreover that the estimate is unbiased. Getting the exact expressions is a bit tedious and they’re described in the paper. In practice, things are even better than this rate: since we never use and but always and we simply optimize with respect to the compressed representation.

One of the advantages of the compressed representation is that we never really need to have any knowledge of all the rows of . In particular, rather than mapping user IDs to rows in we simply use the user ID as the hash key. If a new user appears, memory is effectively allocated to the new user by means of the hash function. If a user disappears, his parameters will simply get overwritten if we perform stochastic gradient descent with respect to the and vectors. The same obviously holds for movies or any other entity one would like to recommend.

Bottom line - we now can have fast (in memory) access to user parameters regardless of the number of users. The downside is that the latency is still quite high: remember that the hash function requires us to access for many different values of . This means that each access in is a cache miss, i.e. it’ll cost us 100-200ns RAM latency rather than the 10-20ns we’d pay for burst reads. How to break this latency barrier is the topic of one of the next posts.

PS: The image above depicts a cup of Hash coffee being prepared. The coffee shop in Melbourne has since gone out of business. Hashing, hasn’t, though.

Priority Sampling

Thu, 30 Jun 2011 07:00:00 GMT

Tamas Sarlos pointed out a much smarter strategy on how to obtain a sparse representation of a (possibly dense) vector: Priority Sampling by Duffield, Lund and Thorup, 2006. The idea is quite ingenious and (surprisingly so) essentially optimal, as Mario Szegedy showed. Here’s the algorithm (please read the previous blog on vector sparsification for some motivation):

For each compute a priority where is drawn from a uniform distribution.
Let be the largest such priority.
Pick all indices which satisfy and assign them the value .
Set all other coordinates to .

This provides an estimator with the following properties:

The variance is no larger than that of the best -sparse estimator.
The entries satisfy , i.e. the expectation is preserved.
The covariance vanishes, i.e. .

Note that we assumed that all . If not, simply apply the same algorithm to and return signed versions of the estimate.

Random elements from a stream

Sat, 25 Jun 2011 07:00:00 GMT

This is a classic trick when dealing with data streams. It shows how to draw a random element from a sequence of instances without knowing beforehand how long the sequence is and which symbols occur. Let us first assume that we knew the identities of all symbols. In this case finding a random symbol would be easy. All we require is that for each symbol we draw a random variable from some distribution and subsequently we choose the symbol

Since each has equal probability of being associated with the smallest value it follows that the draw is uniformly random. The trouble with this is that we now need to store one floating point number per key and a method to look up its value, e.g. via a dictionary. This is just as tedious as recording the entire stream.

Here’s a better way. Assume that instead of requesting a random variable we simply compute the hash of and we set

For a draw from the space of hash functions this again is uniform. The advantage is that we essentially determined all the random bits when selecting rather than at the time when we want to compute its value . The second advantage is that we can now simply keep track of what is the currently smallest value of and update as we go along. We have the following algorithm:

INIT
   hstar = MAXINT 
   n = 0 
   sstar = NONE
FOR ALL incoming s DO
   IF h(s) = hstar:
      n = n + 1
   ELSE IF h(s) < hstar:
      n = 1
      hstar = h(s)
      sstar = s
RETURN (sstar, n)

This algorithm will provide item counts for a random element of the sequence. If you want more than one sample, simply keep a list of the symbols with the smallest hash values and their associated counts. Such algorithms can be used to compute the variance or other moments of a sequence.

PS: The picture is from the Penticton Lakeside Resort where you can probably see such bears (no, I never visited).