Four self-directed rounds on a production GPU cluster. The autonomously produced model placed 8th of ~4,000 on NVIDIA's Nemotron Reasoning Challenge — one point behind the top human team.
The same autonomous system has since post-trained the 120B and 550B Nemotron models end-to-end — evidence the loop closes at that scale too. (No public human baseline exists there yet, so we report it as infrastructure evidence, not a competitiveness claim.)
Notes — NVIDIA Nemotron Reasoning Challenge; public leaderboard standing as of 2026-06-01. Rank 8 of ~4,000 entries.
The score isn't the point. Mid-campaign, the loop detected that its own internal dev metric had stopped tracking external performance on the visibly weakest domain — candidates pushed it to record highs without moving the real target — and revised its own search policy in response: it stopped asking for higher dev and instead asked for interventions that lowered the now-misleading proxy while improving the external target. We treat this as direct, auditable evidence that a scaled autonomous loop can produce discovery, not only optimisation: it detected that the measurement frame itself had become misleading and changed what counted as evidence.
After the human authored the initial substrate, the loop ran unattended. The leaderboard score improved monotonically across four rounds and converged one point below the best human team.
Most public autonomous-research demos run at roughly GPT-2-class (~124M) budgets, where an experiment takes minutes and a failed run is free to retry. That "retry cheaply" assumption is what closes the loop. At frontier post-training it breaks: one trial is a multi-week run on an H200 cluster, and the moves that are free at small scale become prohibitive.
| Element | GPT-2-class (~124M) | Frontier post-training (30B) | Gap |
|---|---|---|---|
| hypothesis | Narrow space: mostly architecture & optimiser coefficients. | Wide design space: synthetic-data construction, SFT/RL choice, loss design, schedules, data mixture, checkpoint selection. | ~10× |
| execution | Minutes per run; each experiment fixed to a ~5-minute budget. | Multi-week H200 runs over a full training stack, data pipeline, and vLLM evaluation. | ~10³× |
| strategy | Feedback in minutes, so broad sweeping and hill-climbing are affordable. | Feedback per trial after hours/days, noisy and distribution-shifted — budget must be allocated across a few hypotheses. | ~10²× |
| infra | Single PyTorch script, single GPU, one metric. | Multi-H200 orchestration, persistent storage, checkpoint management, eval harness, failure recovery. | ~10²× |
Ratios are illustrative order-of-magnitude estimates, not measurements.
It chose to 2× upsample one domain (5.9% → 11.1% of the mix) — a move the prevailing cannibalisation prior predicts should degrade other domains. It didn't: the change lifted performance off-target. This is a verifiable result the operator's prior would have excluded — the kind of outcome a within-prior optimiser is not built to produce.
Early rounds optimized the internal dev score. Later rounds drove that score to record highs (dev 0.93–0.94, a weak domain lifted from ~0.65 to 0.82) — yet the external target didn't move. The loop concluded the easiest proxy to improve was no longer the causal bottleneck, and inverted its objective: it began asking for interventions that might lower the proxy while improving the real target.
No "retry cheaply." Each of the four rounds spawned 8 identical full-stack agents that edited recipes, launched real GPU jobs, debugged failures, and selected checkpoints — and the campaign still converged monotonically. We report this as infrastructure evidence that the loop can be made to close at this scale.
We first built what looked right: specialised data / training / eval agents handing off mid-states, like a human research team. It did not scale — compounding from mid-states compounds unobserved variance, and corrupts the very signal selection depends on.
The configuration that worked is the inverse. One immutable, human-audited substrate that every round re-forks and no winner overwrites. Memory-free, identical workers instead of specialised roles. A bounded meta-agent that can only rewrite the next round's search policy — never the substrate it runs on. One principle ties them together:
Asymmetric freedom. Zero degrees of freedom on the axes that must stay invariant for trials to remain comparable — substrate, evaluation, baseline. Maximal freedom on the one axis where exploration pays — which semantic mutation to attempt. Freeze what must hold; spend all the freedom where it counts.
Full mechanics — the immutable substrate, memory-free workers, the constitutionally bounded meta-agent, and the design ablations — are in the tech report ↓
We'd rather you find these here than discover them later. The result is real and auditable, and it is narrow.
Today, humans build AI in three stages. The path to recursive self-improvement is to hand each stage to an autonomous AI researcher — and let the output make the next round of research better.
A-Evo Lab studies self-evolving agents under one thesis — AI-as-researcher: frontier agents and models play the researcher in the loop that builds better AI. We're building an autonomous researcher for each of the three stages, on one shared stack — A-Evolve — so the whole lab iterates fast. The 30B result on this page is the post-training node.
@techreport{shi2026aevolvetraining,
title = {A-Evolve-Training: Autonomous Post-Training
of a 30B Model},
author = {Shi, Zhan and He, Bing and Sang, Yisi and Lu, Hanqing},
institution = {A-EVO Lab, Amazon},
year = {2026},
note = {NVIDIA Nemotron-Reasoning Challenge; public
leaderboard standing as of 2026-06-01.}
}