Digital humans – Alex Smola

When was the last time you enjoyed calling your cable company? You work through the phone tree, explain your problem to the first person, who moves you to a second person, to whom you explain it again, who then needs to transfer you to a third. Somewhere around the fourth handoff you are reciting your account number from memory and have stopped expecting anything good to happen.

Nearly 100 years ago the Bavarian comedian Karl Valentin built a skit out of exactly this in Buchbinder Wanninger: a bookbinder phones a construction company to ask one trivial question, whether he should enclose his invoice with the books he is delivering, and gets passed from clerk to clerk to clerk, each one perfectly polite and not one of them able to answer. The bit is still funny because nothing about it has aged (Germans still reference the scene today), only the hold music is better now.

And the unhelpful transfer is the good case, because at least there was someone to transfer you to. Public administration, claims processing, after-hours support: in more and more places there simply are not enough people to do the work, and the gap is widening. The problem is not only that talking to the system is unpleasant and that the scripts are incredibly rigid. It is that increasingly there is no one on the other end at all.

This is the opportunity for digital humans. Not a chatbot bolted onto an FAQ, but something that can hold up the human side of an interaction from end to end: understand what you actually want, be reachable however you like to reach out, be actually pleasant to interact with, and get the thing done.

Ingredients

This sounds like science fiction and, in many ways, we aren’t quite there yet. That said, this future is a lot closer than we think. Many components are being built right now, waiting to be assembled to achieve something delightful. Here are some of the things that it takes to go from a clever text box to a digital human.

Context. It has to remember who you are, what you said two minutes ago and two weeks ago, and what state your problem is in. An agent that makes you re-explain yourself at every turn has simply rebuilt the call center, transfers and all. The community has made real progress here, from MemGPT’s idea of paging memories in and out of the context window like an operating system (now Letta), to drop-in memory layers such as Mem0 that any agent can bolt on.

Competence. It has to do things: reset the modem, pull up the policy, requote the premium, change the booking. Tool use, skills, access to the systems of record. Without that, a digital human is just a more articulate apology, and you can already get one of those on hold. Like with context and memory, we’ve come a long way here, too. Protocols such as MCP and A2A, improved training algorithms for instructions, and a convergence on how LLMs are to respond to requests have given us agents such as OpenClaw and Hermes and harnesses such as Claude Code that are perfectly capable of solving complex tasks.

Presence on every channel. Email, chat, SMS, a voice call, and increasingly video. Most people, given the choice, would rather have a short video call than a phone call: the entire Zoom era is the proof. But sometimes the same person just wants to fire off a text and get a one-line answer back. A digital human has to be fluent in all of these and switch between them without dropping the thread. We’re probably halfway there. Text and messaging integrations for agents are by now table stakes and voice integrations are coming (e.g. Higgs Audio). We’re starting to see systems with capable computer vision integration (e.g. Thinking Machines) and live streaming capable avatars (e.g. Higgs Avatar). This will significantly alter how models are being served.

Emotional competence. Few things alienate humans faster than an agent that is indifferent to the situation at hand or, even worse, emotionally mismatched. In The Hitchhiker’s Guide to the Galaxy, Eddie, the shipboard computer on the Heart of Gold spacecraft provides comedic relief through this behavior. It is relentlessly, catastrophically cheerful as the ship sails into certain doom. Eddie has emotions. They are simply never the right ones. Reading the room, whether this person is frustrated, in a hurry, or scared about a bill, and answering in the matching register is its own skill, and it draws on the whole stack: tone of voice, facial expression, the logical context of the conversation, not just the literal words. The research and engineering community still has a long way to go before we achieve this goal.

Here is Eddie in action, emotionally engaged and utterly mismatched:

Persistence and Persuasion. This is the part people underestimate. A digital human does not need to be perfect on the first reply, any more than a coding agent needs to compile on the first try. It needs to converge: read the response, notice that it missed, adjust, and arrive at the right outcome before the user gives up. And, inside the bounds of honesty, it helps to be convincing, the way a good salesperson or a good nurse is convincing. One day someone will hand the agent a Cialdini.md and tell it to internalize the principles of persuasion. I am only half joking.

A purpose

None of that means much without a goal. A digital human with no objective is just a chatbot with nicer text-to-speech. The whole point is that someone, the business deploying it, gets to say what it is for: help this customer fix their modem, find this driver a cheaper policy, sell this plan, walk this patient through their discharge instructions. This is easier said than done — anyone who tried to build it with first-generation dialog systems such as Lex carries the scars.

So the designer has to be able to specify the goal, the guidelines (what the agent may and may not say, and where it must hand off to a person), the data it can draw on (the catalog, the policies, this account’s history), and the rough flow a good interaction should follow. The agent then improvises inside those rails. Rails with no improvisation is the phone tree we started with. Improvisation with no rails is a liability. You want both, and modern agents and harnesses finally deliver both. There’s still a lot to be done.

What we are building at Boson AI

The audio-visual interface. This is the part the user actually meets, and it has to run in real time. On the way in, Higgs Audio does much more than transcribe: it picks up sentiment, meaning, and how something was said across 94 languages, so the agent hears tone and not just text. On the way out, the same family of models speaks in over a hundred languages with inline control over emotion, prosody, and pacing, so it can sound concerned when concern is what the moment calls for.

Here is what that sounds like:

And because a lot of people would happily take the video call, Higgs Avatar renders a face from a single still image, live, one frame at a time, lip-synced and locked to the voice and fast enough that it never falls behind the conversation. Audio in, audio out, a face on top, all of it multilingual.

Tooling. Specifying the goal, guidelines, data, and flow is itself work, and it should not take a research team. Feynman Flow (coming soon) is the part that lets a designer author and refine that whole interaction.

Interactions. This is the piece I find most interesting, because it is where the open questions live. What actually makes an interaction good? We wrote about one slice of this recently: conversational proactivity, noticing what the user disclosed but never asked about, which turns out to be both something people strongly prefer and something no standard benchmark measures. The same kind of question runs through instruction following over a long conversation, and through adaptation to very different people: a terse engineer and a chatty oversharer want quite different things from the same agent, and a good one reads which is which.

Architecture. One note, because it is a common trap. You do not want the frontier reasoning model running inside the voice loop. Heavy reasoning is too slow for a live conversation, and it is worse than that: reasoning degrades when you push it through the audio modality. A recent evaluation found a leading text model scoring 74.8% on competition math while its voice counterpart managed 6.1% on the same problems. So the right shape is a fast, emotionally fluent model at the edge holding the conversation, with a frontier model behind it doing the hard thinking when hard thinking is called for. The voice should not be doing the calculus.

From psychology to measurement

Here is the thing we do not yet have: large scale science of what makes an interaction good. What we have instead are small-sample measurements from psychology — small because humans, as test subjects and as evaluators, are expensive. Now is the first time we may have an opportunity to measure and experiment at scale to build systems that are actually good at working with other humans. It’s going to be a lot better than what can be learned from fiction. For instance, the grand gesture that reads as romantic in the movie reads as stalking in actual life. Stories are optimized for drama, not for being good company, and an agent that learned its manners from them would be exhausting at best and alarming at worst.

What we need — and what the novelist, and to a lesser degree the psychologist, never had — is measurement at a scale neither could have dreamed of: millions of real interactions, preference data, controlled comparisons, benchmarks built for the specific behaviors we care about. The opportunity, and the part that makes this a research problem and not only an engineering one, is to replace the folklore with statistics. To stop guessing at what good company is, and start measuring it.

That is the bet. For a hundred years the hold music has been the only thing to improve. We would like to fix the rest of the call.

Boson AI · Higgs Audio · Higgs Avatar · ProactBench.