Efficiency in LLMs

MLSS
LLMs
inference
Published

June 13, 2026

Outline for the MLLSS 26 at Columbia University

Next week I am teaching a tutorial on efficient LLM inference at the Machine Learning Summer School 2026 in NYC, hosted this year at Columbia University. The slides are below. There are about 150 of them, which sounds small, given how far the field has come.

It’s a good opportunity to review how we have this exciting convergent evolution of models, hardware, and algorithms for serving efficiency. Be prepared for a deep dive into chips, bandwidth but also randomized algorithms and architectures. My goal was to write a practitioner’s guide in six parts. The running example throughout is Qwen3, both the dense 8B and the 30B-A3B mixture of experts, at a 40k token context.

It is teaching first, so diagrams beat equations and equations beat walls of text, with source references to lots of papers. All number verified in June 2026 (and probably wrong by December).

Slides: Efficiency in LLMs (PDF, 47 MB) · Machine Learning Summer School 2026, Columbia.