Raising MXNet from the Attic

Apache projects don’t really die. They go to the Attic — and I mean that literally, the Apache Attic is a real place, the shelf where retired projects are filed away with a polite note that nobody is maintaining them anymore. MXNet went up there in 2023. Last real release 1.9, the promised 2.0 release never happened, a zombie, and then the lights went out.

That should have been the end of it. MXNet lost the framework war over half a decade ago; if you are starting something new in 2026 you reach for PyTorch or JAX and you don’t look back. But “don’t use it for new things” is not the same as “nobody needs it to run.” A whole edition of Dive into Deep Learning — the book a lot of people have used for teaching (and that I spent considerable amounts of time writing) — is written against MXNet (and PyTorch, JAX and Tensorflow). So are course notebooks, research repos, and a long tail of internal tools, all frozen against an API that quietly assumes the world still looks like 2021.

And the world stopped looking like 2021. The GPUs are Blackwell now, and they want CUDA 13. More to the point, there’s no combination of drivers that would let you run it on non-ancient hardware. macOS laptops use Apple Silicon now, rather than the Intel CPUs from before 2020. So you end up in the worst possible spot: legacy code, modern hardware, and a dead framework wedged in between that refuses to build on either one.

So I dug it back up. The result is the zombie you see on the project page — shambling, a little gray, and absolutely not something you should build a new house on. The README says it best: use at your own risk. But if you have an old library you need to keep alive, this might save your week.

Reviving the corpse

Here is the thing about reanimating something dead: it does not file bug reports. A live framework has maintainers who notice when a result drifts. A dead one just… compiles, mostly. Imports, usually. And then hands you back a tensor full of NaNs, or the wrong dtype, or — my personal favorite — the right answer on small inputs and garbage on large ones, so everything looks healthy right up until the moment it doesn’t.

On the other hand, there’s a significant advantage in working on a corpse: nobody cares if you screw up. You can perform surgery and the worst damage you can do is end up with an unusable library, and nobody will be upset. A perfect experimental subject for automatic code generation, featuring Claude and Codex (but mostly Claude). I wanted to see whether it’s possible to reanimate this using agentic coding, without writing a single line of code by myself (but plenty of instructions to agents). So this is as much a summary of the outcomes as of how I got there.

Just like with any dead piece of code, lots of things had stopped working and the rest had moved on (or died, too). CUDA 13. ONNX. cuDNN. oneDNN. Python. And then there were lots of old tests that had been commented out for being flaky even before that — some as early as 2020. Fortunately for me, most of the tests still worked, or were at least present. That let me start porting, using the tests as a way to drive out errors. A few of the patterns that worked:

Start with a small footprint. Port the core first, with the optional modules — ONNX, oneDNN, and friends — switched off, so there’s less surface to debug at once.
Get the unit tests green before anything else.
Re-enable the skipped tests. Many had been disabled years earlier as “flaky,” but flaky usually turned out to mean a real bug that had been hidden rather than fixed — so turning them back on surfaced genuine problems.
Re-run the full suite after every change, even the ones that look trivial. MXNet is full of side effects, and they surface in the least related places.
Drive it with real application code. D2L was a good stand-in, and it caught a few overzealous API “fixes” that Codex had introduced and the unit tests had missed.

Once the codebase had settled a bit, the goal shifted to actually stabilizing it. As a result, several long-standing cases where MXNet and PyTorch disagreed on convergence in experiments simply went away, and experimental stability improved. Some of the strategies I used:

GitHub issues and PRs are a great source of bugs. Agents are good at triaging and crawling them — but don’t trust the triage, verify it.
For each bug, prove it first. Write code that demonstrates the bug is real, turn that into a unit test, and only then fix it.
Run bug scans, repeatedly, with both Claude and Codex. They find different things.
Keep an issue list. After each triage I had the agents write down what they found — severity, diagnosis, difficulty to fix — and then pointed other agents at the list.
Hunt by category. Once the first scans settled, I aimed Claude specifically at concurrency and scheduling, memory management, and leaks. It surely didn’t catch all of them, but there are far fewer now.
Where there’s smoke, there’s fire. Engineers make mistakes in patterns, so a bug in one place makes it likely its siblings are lurking elsewhere. Agents were great at hunting those down — add them to the list and repeat.

What surprised me is how independently agents can work now — for hours, even days, at a stretch. That was convenient: I have a day job as Boson’s CEO and didn’t want to sink much time into this beyond learning the tools. Claude’s permission system is genuinely excellent, and it rarely went astray — the inglorious exception being the long parade of PRs it took before the GitHub CI was finally clean. (It had its moments, too: at one point Claude started filing root-cause updates in the voice of a struggling engineering manager, reporting glorious progress on narrowing down the cause, report after report, without ever actually solving it, until it gave up entirely.) Codex was more temperamental. It was often the stronger of the two at fixing a specific bug, but its judgement could be poor: I had to throw away an entire night’s session in which it had started changing data types to fp64, quietly introducing awful performance regressions in the process.

Here are a few of the bugs, out of hundreds:

The notebook that died on a single method call. The naive-Bayes chapter crashed because someone had renamed the operator behind .prod() years ago and never updated the shortcut. One line. Classic zombie: the muscle was there, the nerve was cut.
The RNN that worked until the batch got big. Past a batch size of about 512, a whole cluster of recurrent-network notebooks blew up with “bad number of inputs.” It turned out the math library was miscounting its own arguments near a magic threshold. The fix was to stop trusting its fast path above 256 inputs and fall back to the boring, correct one.
The op that hung one cold start in eight. A single GPU operation would, roughly an eighth of the time on a fresh process, deadlock forever — and worse, it did so while holding Python’s global lock, so you couldn’t even Ctrl-C your way out. A lock held across a driver call, waiting on a thing that was waiting on the lock.
The Mac that returned NaN from a pooling layer. Apple Silicon has no NVIDIA GPU and, it turns out, runs a different code path for almost everything — which is a wonderful way to discover bugs nobody on Linux had ever hit. Max-pooling initialized its accumulator to negative infinity instead of the smallest real number, and a convolution read from a buffer it forgot to zero. Both quietly poisoned the output.
The version landmine. One release of NVIDIA’s linear-algebra library worked on small matrices and returned NOT_INITIALIZED on large ones; a later one fixed that and added a crash instead. Reviving software on bleeding-edge silicon means half the job is pinning your dependencies to the one version that is neither broken in the old way nor the new.
The test that was red for two weeks. The nightly build had been failing every night, and the cause was one line that re-raised an exception the program had already caught and dealt with. The framework was, in effect, getting startled by its own ghost.

By the numbers

At Amazon a sizable number of engineers worked on MXNet before. The goal was to bring it to version 2.0 and then let it die a peaceful death. It never reached that milestone. I’m still amazed that over a number of nights with fairly little hands-on work, I managed to get it to run on modern hardware, using CUDA 13, nonetheless. Not everything is supported (e.g. TVM, DGL, and lots of downstream libraries are not supported, partly since they also rode off into the sunset of obsolescence). That said, here’s a summary of how much was changed.

The revival is one month of mostly-overnight work — 16 May to 17 June 2026 — built directly on top of MXNet’s very last commit (a dependency bump dated 26 January 2023; the project never committed anything again). In that month it took 98 pull requests and 879 commits, touched about 950 files, and added roughly 70,000 lines while removing 9,000 — around 113,000 lines of total churn once you count the back-and-forth. The single biggest pile of new code, more than 30,000 lines, went into tests — more than the engine, the Python bindings, or anything else. That is not an accident; re-teaching a zombie is mostly a matter of writing down what it is no longer allowed to forget.

For scale, I ran the same measurement over Apache’s own final stretch — from 2020 until the lights went out in early 2023, about three years of work by 185 contributors:

	Apache MXNet, 2020 → 2023	This fork, ~1 month
People	185	1 (plus agents)
Commits	1,397	879
Files touched	4,791	~950
Lines of churn	~950,000	~113,000

So in roughly a month, one person pointing agents at the corpse racked up about 60% of the commits, and an eighth of the line-churn, that 185 engineers contributed over three years — and then the result actually booted on a Blackwell GPU under CUDA 13, which the original 2.0 effort never lived to see. (For the pedants: a good chunk of Apache’s churn back then was demolition — clearing out old APIs ahead of the 2.0 that never shipped — whereas almost all of the fork’s lines are fixes and tests. And the oneDNN v3 port doesn’t even show up in these counts, since oneDNN lives in a submodule.)

Epilogue

I want to be clear about what this is and isn’t. It is not a comeback. You should not write new software against MXNet; the framework war is long over and it lost. But if you are holding an old codebase that the rest of the world has left behind, the zombie will carry it across to modern hardware, and that is exactly what it was raised to do. Check it out at github.com/smolix/mxnet. The API docs are here.

If you want the autopsy — what actually had to be cut, stitched, and replaced to get a 2021-era framework running on 2026 silicon — that’s the next post. Stay tuned.