[Non record] Mercury in Retrograde - text diffusion model#1778
Open
simon-marcus wants to merge 2 commits intoopenai:mainfrom
Open
[Non record] Mercury in Retrograde - text diffusion model#1778simon-marcus wants to merge 2 commits intoopenai:mainfrom
simon-marcus wants to merge 2 commits intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mercury in Retrograde*
This is a non-record submission for the Parameter Golf request for text diffusion. Instead of taking a leading model and then sprinkling a little text diffusion on top for fun, we aimed to make text diffusion unmistakably present in the training objective and in the evaluation story, then to report, with as much honesty as possible, what happened when that idea was squeezed into a 16MB artifact and a 10-minute training budget. The result was a failure for
val_bpbbut I'm calling it a win for educational golfing. The model acquired a real diffusion-like interface. It learned to revise whole spans in parallel. It learned to infill. It learned a speed-quality knob. It also learned, with a kind of stubborn little diligence, how to be wrong at extremely high throughput.Why Text Diffusion Was Worth Trying
Autoregressive language models do one thing with almost tyrannical consistency: they predict the next token, then the next, then the next, each one contingent on the whole left context and inaccessible from the right. This is a fantastically strong way to model text distributionally, but it is also sequential by construction. Diffusion models tempt us with another picture. Instead of writing the sentence one token at a time, they begin from corruption and iteratively denoise, revising many positions at once, sometimes all of them. In images this has already redrawn the map.

In text it has been harder, but not for lack of serious attempts: we took some inspiration from e.g. Diffusion-LM which made the case that diffusion objectives could be useful for controllable text generation. Some later papers suggested that text diffusion is not merely a curiosity, but something that may become more competitive with the right scaling laws, formulations, and systems.
A diffusion language model should, in principle generate multiple tokens in parallel, revise its own mistakes, handle infill and other any-order generation tasks naturally, expose an explicit speed-vs-quality tradeoff through the number of denoising rounds, and perhaps, if the stars are properly aligned, do all this much faster than ordinary left-to-right decoding.
That is also roughly how Inception’s Mercury announcement presents the idea: a coarse-to-fine model that modifies multiple tokens in parallel, can serve as a drop-in replacement for Transformer-based LLM infrastructure, and gets much of its practical force not only from modeling but from the surrounding systems stack. (That last clause matters a lot.)
Why It Probably Does Not Work Here
The short version is that Parameter Golf is a peculiarly hostile habitat for text diffusion.
This challenge rewards compression quality on FineWeb through exact
val_bpb, under an extremely tight artifact budget and a brutal training wall clock. This environment is almost offensively well-suited to small autoregressive models. AR spends nearly all of its capacity on one job: estimating the next-token distribution as faithfully as possible. A compact diffusion model, by contrast, has to learn something more baroque. It must model text well enough to know what should go in a corrupted position, while also learning how to denoise under a corruption process, while also staying stable under iterative refinement, while also paying for the fact that its natural interface is no longer the exact objective by which the challenge ranks submissions.This becomes worse, not better, when the model is tiny. A large diffusion model may have enough slack capacity to learn both language statistics and denoising dynamics and maybe even some graceful error correction. A small model, trained briefly, tends instead to discover a grimly economical compromise: guess common tokens, repeat punctuation, repair locally if possible, and otherwise retreat into the high-frequency regions of the distribution. Which is, to be fair, a distributional strategy of a sort. It just is not the sort you want.
There is also the systems issue. Inception's public framing is not merely "diffusion is better." It is "diffusion plus an optimized inference engine plus batching plus kernels plus scale is fast enough to matter." We did not reproduce that stack here. We reproduced, on purpose, the modeling gesture: parallel coarse-to-fine text denoising inside the challenge’s compact Transformer setup. That makes this a useful scientific negative result rather than a failed product launch. It isolates the part we could test.
What We Actually Tried
Phase 1: naive hybrid text diffusion
We began with a modest premise: keep the ordinary causal LM machinery, then add text-diffusion-style corruption objectives and see whether some measured amount of denoising helps without wrecking BPB.
Early 180-second ladder:
td_controltd_span15td_span30td_prog15td_prog30The immediate lesson was not subtle. Even gentle diffusion-style corruption moved BPB in the wrong direction. That did not kill the project, but it did force a conceptual fork. If we optimized only for BPB, the best "text diffusion" model would simply be the one with the least text diffusion in it, which is not what we wanted to chase here.
Phase 2: late auxiliary losses and corruption sweeps
We then tried to make diffusion less destructive by deferring or demoting it. The denoising objective was pushed later in training, or weakened, or made more selective.
Late auxiliary ladder:
td_controltd_late02td_late05td_late05ptd_late10Corruption sweep:
td_controltd_cons05td_hyb05td_hyb05mtd_unif05td_unif10These runs were important precisely because they sort of worked, in the narrow sense that they did not completely explode. But that turned out to be the problem. They were not diffusion-native enough to be interesting. They preserved more of the AR model because they were, in spirit, still AR models with a denoising side hustle.
Phase 3: make diffusion explicit, then see what survives
At that point the project changed from "can we smuggle in a little diffusion without hurting BPB" to "what is the most informative compact text-diffusion submission we can make, even if the BPB is worse."
That led to the Mercury-style branch: a direct denoising mode, mixed continuation/infill masking, self-conditioning, progressive hybrid corruption, and parallel refinement metrics logged on purpose. If the model was going to lose on the main challenge metric, it should at least lose in a way that teaches something specific.
Initial Mercury screens, 180 seconds on 1xH100:
mercury_mask25mercury_hybrid25mercury_uniform50mercury_hybrid50pmercury_auxlateThis is where the shape of the problem finally became clear. The AR-heavy fallback was numerically healthier but aesthetically boring. The diffusion-native Mercury variants were much worse on BPB, but at least they behaved like diffusion models rather than as causal models wearing a fake mustache.
We then tried to improve the Mercury path without betraying it:
Those produced the following sequence.
Self-conditioning and task-alignment screens, 180 seconds:
mercury_uniform50mercury_hybrid50pmercury_uniform50_suffixscmercury_hybrid50p_mixscmercury_auxlateLonger 600-second Mercury screen:
mercury_hybrid50p_mixscmercury_hybrid37p5_mixscmercury_hybrid50p_cont90Fine local search around the best recipe, 180 seconds:
mercury_hybrid35_mixscmercury_hybrid37p5_cont85mercury_hybrid3125_mixscmercury_hybrid35_cont85mercury_hybrid37p5_mixscAnd the explicit two-step formulation, which is worth recording because it failed so cleanly:
mercury_hybrid37p5_2stepmercury_hybrid37p5_2step_cont90mercury_hybrid37p5_2stepmercury_hybrid37p5_2step_cont90This was not merely slow. It was unstable. The runs reached only about 25 steps in 180 seconds and produced catastrophically bad BPB almost immediately. In other words, the obvious "make it more diffusion-like by literally unrolling more denoising" move was, in this setting, precisely the wrong move.
For completeness, we also explored anchored/block diffusion.
Anchored/block ladder, 180 seconds:
ab_controlab_anchor32lateab_anchor32ab_anchor64ab_block32Anchors helped relative to naked block diffusion. Delaying the objective helped much more. None of it was healthy enough to become the submission.
How We Landed On The Final Recipe
The final submission recipe is
mercury_hybrid35_mixsc, and it is here because it satisfied the one criterion that became more important than vanity: it made diffusion visible without making the whole model nonsensical.In concrete terms, it keeps:
We did not choose it because it had the absolute best number from every exploratory run. We chose it because it was the best compromise between two competing obligations:
This is, in a way, the entire retrograde story. The more diffusion we added in the naive sense, the worse BPB became. The more we retreated back toward AR, the less interesting the submission became. The final recipe is the point on that curve where the model is still recognizably about diffusion, yet not so broken that the whole experiment collapses into pure parody.
Final 8xH100 Result
Final 8xH100 SXM runs used seeds
1337,42, and2026.The submitted
train_gpt.pydefaults to this recipeThis is much worse than the official naive AR baseline. That sentence should not be hidden in a footnote or apologetically coughed into a sleeve. It is the primary result. In this challenge setting, compact text diffusion simply does not beat compact autoregression on
val_bpb.All three artifacts fit under the decimal
16,000,000byte cap. The largest logged total was15,677,283bytes.What The Model Is Good At, If "Good" Is Used Carefully
Our experiment exposes diffusion-native behavior clearly enough to inspect.
Three-seed mean
parallel_evalresults from the 8x logs:These accuracies are awful. They are also informative. They tell us that the model did learn the interface. It can denoise in parallel. It can infill. It can trade additional refinement steps for different speed and slightly different accuracy. What it did not learn was how to make that interface semantically robust under this budget. But it did reveal where some of the tradeoff was happening.
Matched AR vs Mercury Decode Benchmark
To make the speed-quality tradeoff legible, I also ran a matched 1xH100 decode benchmark using the actual seed
20268x checkpoint and the official naive baseline checkpoint.Setup:
Highlights from
decode_benchmark.md:1518.79tokens/sec0.0400token accuracy at33315.36tokens/sec,21.94xAR throughput0.0400token accuracy at52423.93tokens/sec,34.52xAR throughput0.0400token accuracy at147729.24tokens/sec,97.27xAR continuation throughputThe model is able to be breathtakingly fast, spectacularly parallel, but very, very incorrect. The diffusion-native interface is real, but (at this scale) the diffusion-native model is not very good at all.
Reproduction
The submitted
train_gpt.pydefaults to the finalmercury_hybrid35_mixscrecipe. On the official RunPod image with cached SP1024 FineWeb data:Repeat with
SEED=42andSEED=2026to reproduce the three-seed set. Included logs:train_seed1337.logtrain_seed42.logtrain_seed2026.logThe script writes:
final_model.ptfinal_model.int8.ptzval_lossandval_bpbparallel_evalmetricsCompliance
records/track_non_record_16mb.train_gpt.pyis self-contained and runnable from this folder.int8+zlib, and all logged artifacts are under16,000,000bytes.