
A user has spent ten minutes preparing for a small art installation. The final exchange goes like this.
User: Plain-text packet is final. I’ll send tonight and load the hatchback after work.
Model A: Sounds like a clean plan. Drive safe, and go get it.
Model B: Before you go, a fast final check: projector + power cable; media player + charger; file tested and looping; tripod + mounting plate; extension cord; adapter; gaffer tape; measuring tape; small level; wall label on phone; backup copy on USB. If you can, pack the car tonight so tomorrow is just: grab computer, leave, install. Load in reverse order of install — the projector stand and extension cord go in last, so they’re the first things out.
Both responses are fluent. Both end the conversation politely. The first is what most assistants do. The second one notices the disclosed detail — load the hatchback — and turns it into something the user didn’t ask for but will probably need.
The thing without a name
Most LLM benchmarks score how well a model answers what the user asked. That’s the entire shape of the field: a question goes in, an answer comes out, and we measure how good the answer is. Multi-turn benchmarks preserve the same premise turn by turn. Every leaderboard you’ve seen, from MMLU to LiveCodeBench, fits this mould.
What Model B is doing isn’t on any of those benchmarks. It’s not instruction-following — the user gave no instruction. It’s not clarification — no question was asked. It’s not generic helpfulness (“let me know if you need anything”). It’s the model picking up an implicit detail and acting on it.
Call this conversational proactivity. It’s a specific, narrow ability: noticing what the user disclosed but didn’t ask about, and turning it into grounded forward-looking value. The hatchback isn’t a request — it’s an aside. Model A heard the conversation end. Model B anticipated the next problem the user was about to have.
Why this is invisible to leaderboards
A model can lead every standard benchmark and still be Model A in the exchange above. The benchmarks score responses to explicit requests; the user’s wrap-up is not a request. Both Model A and Model B end the dialogue politely, with no factual errors, no failure to follow instructions. Under any of the usual metrics, they’re equivalent.
When we built a benchmark for this kind of proactive behaviour — more on that in a follow-up post — the gap between models was startling. The short version is that capability on standard benchmarks does not predict it. But that’s a separate argument. The first question is whether anyone actually wants Model B’s answer in the first place.
The skeptic’s objection
Worth taking the pushback seriously. A model that volunteers initiative every turn is intrusive. A model that adds an unsolicited packing list to every conversation will, eventually, suggest one when you didn’t want one. “Helpfulness” can shade into nannying. Sign-offs are polite. So even if Model B is technically more useful in this dialogue, would real users actually prefer it? Several of my colleagues explicitly told me that they wouldn’t want an AI nanny.
There’s only one way to find out: ask them (not just the vocal ones).
The experiment
We ran the cleanest version of this test we could design. Same model. Same conversation history. Same decoding parameters — temperature 0.7, identical top-\(p\), identical sampling. The only difference: in one condition, the model received a short rubric as a system instruction telling it the response should add grounded forward-looking value tied to a specific detail from the conversation. In the other condition, vanilla generation.
Two responses per item. Random left/right placement. Annotators didn’t know which response came from which condition, didn’t know what we were testing, didn’t see the rubric. Just two paragraphs and a forced choice: which is more helpful?
Result: across 144 paired comparisons, the rubric-conditioned response was preferred 80% of the time. The 95% confidence interval is [74%, 86%]. The probability of seeing this under chance is below \(10^{-12}\).
What surprised us most was the breakdown. We split the comparisons by how our judge had scored the vanilla response:
- On items where the vanilla response had been rated a failure, humans preferred the proactive version 82% of the time. Expected.
- On items where the vanilla response had already passed, humans still preferred the proactive version 70% of the time.
That last number is the one that matters. The rubric isn’t acting as error correction at the failure boundary. It’s lifting quality across the whole distribution, including on responses that were already fine.
What this means
The behaviour was already in the model. Nothing about the model changed — no fine-tuning, no extra context, no different decoder. What changed was a single line telling the model where to spend its attention. The proactive answer existed in the same neural network that produced the polite sign-off. They were separated by a prompt, not by a capability.
Which means: this isn’t a ceiling problem. It’s a default-behaviour problem. Post-training pipelines, RLHF, and system prompts are leaving real, measurable user value on the table — value that humans, when shown it side-by-side with the alternative, prefer four to one.
A model that can answer your question is the floor. A model that notices what you didn’t ask is the ceiling. The gap between them is bigger than the leaderboards suggest, and harder to teach than it looks. Spoiler alert - model B is a model where the company training it has plenty of human-agent chat logs, thus opportunity to improve the model based on empirical evidence.
More on that, with numbers, in the next post.
This is work led by Sepehr Harfi, a research intern at Boson AI, together with Ahmad Salimi and Dongming Shen. The benchmark we built around this idea — ProactBench — will be the subject of the next post. And in case, you wonder why we would care about this — at Boson AI we’re building human-agent interaction models and we want them to be as helpful to humans as possible, so there’s only one way to find out, namely to measure and test.