A-Evolve-Training · Autonomous post-training of a 30B model

The result

Closing the gap to the top human, round by round

After the human authored the initial substrate, the loop ran unattended. The leaderboard score improved monotonically across four rounds and converged one point below the best human team.

AUTONOMOUS · LEADERBOARD AUTONOMOUS · INTERNAL DEV TOP HUMAN — 0.87

✦ hover / tap a round

Public leaderboard (LB) and internal dev score by round. ↳ hover any round for its search mode and what the loop learned. Note round 3: dev spikes to 0.93 while LB holds at 0.85 — the loop drove its own proxy to record highs without moving the real target. In round 4 it changed strategy and recovered external progress (LB 0.86). Round 0 is the human-authored substrate.

Why this is a different problem

Autonomous ML research has been demonstrated — at GPT-2 scale

Most public autonomous-research demos run at roughly GPT-2-class (~124M) budgets, where an experiment takes minutes and a failed run is free to retry. That "retry cheaply" assumption is what closes the loop. At frontier post-training it breaks: one trial is a multi-week run on an H200 cluster, and the moves that are free at small scale become prohibitive.

Element	GPT-2-class (~124M)	Frontier post-training (30B)	Gap
hypothesis	Narrow space: mostly architecture & optimiser coefficients.	Wide design space: synthetic-data construction, SFT/RL choice, loss design, schedules, data mixture, checkpoint selection.	~10×
execution	Minutes per run; each experiment fixed to a ~5-minute budget.	Multi-week H200 runs over a full training stack, data pipeline, and vLLM evaluation.	~10³×
strategy	Feedback in minutes, so broad sweeping and hill-climbing are affordable.	Feedback per trial after hours/days, noisy and distribution-shifted — budget must be allocated across a few hypotheses.	~10²×
infra	Single PyTorch script, single GPU, one metric.	Multi-H200 orchestration, persistent storage, checkpoint management, eval harness, failure recovery.	~10²×

Ratios are illustrative order-of-magnitude estimates, not measurements.

What closing the loop produced

The score is the headline. These are the reason it matters.

01 · Discovery, not just optimization

The loop found a data-mix move the operator's prior said should hurt.

It chose to 2× upsample one domain (5.9% → 11.1% of the mix) — a move the prevailing cannibalisation prior predicts should degrade other domains. It didn't: the change lifted performance off-target. This is a verifiable result the operator's prior would have excluded — the kind of outcome a within-prior optimiser is not built to produce.

02 · It revised what counts as evidence

Mid-campaign, the loop stopped trusting its own proxy metric.

Early rounds optimized the internal dev score. Later rounds drove that score to record highs (dev 0.93–0.94, a weak domain lifted from ~0.65 to 0.82) — yet the external target didn't move. The loop concluded the easiest proxy to improve was no longer the causal bottleneck, and inverted its objective: it began asking for interventions that might lower the proxy while improving the real target.

03 · The loop closes where retrying isn't free

Convergence under a cost structure an order of magnitude harsher.

No "retry cheaply." Each of the four rounds spawned 8 identical full-stack agents that edited recipes, launched real GPU jobs, debugged failures, and selected checkpoints — and the campaign still converged monotonically. We report this as infrastructure evidence that the loop can be made to close at this scale.

System design · the one idea

The obvious design failed. The opposite one worked.

We first built what looked right: specialised data / training / eval agents handing off mid-states, like a human research team. It did not scale — compounding from mid-states compounds unobserved variance, and corrupts the very signal selection depends on.

The configuration that worked is the inverse. One immutable, human-audited substrate that every round re-forks and no winner overwrites. Memory-free, identical workers instead of specialised roles. A bounded meta-agent that can only rewrite the next round's search policy — never the substrate it runs on. One principle ties them together:

Asymmetric freedom. Zero degrees of freedom on the axes that must stay invariant for trials to remain comparable — substrate, evaluation, baseline. Maximal freedom on the one axis where exploration pays — which semantic mutation to attempt. Freeze what must hold; spend all the freedom where it counts.

Full mechanics — the immutable substrate, memory-free workers, the constitutionally bounded meta-agent, and the design ablations — are in the tech report ↓

Limitations — read before you quote us

What this is not evidence of

We'd rather you find these here than discover them later. The result is real and auditable, and it is narrow.

n=1 task

One benchmark. A single public leaderboard is the external anchor. Whether the loop produces out-of-prior findings elsewhere is open.

n=1 base

One base model. A single 30B Nemotron. We don't yet know if the behaviour is a property of the loop or of this checkpoint.

infra only

120B & 550B are infrastructure evidence only. The same system post-trains them end-to-end, but with no public human baseline at that scale, this shows the loop closes there — not that its output is competitive.

The thesis · what comes next

An AI researcher for every stage of building AI.

Today, humans build AI in three stages. The path to recursive self-improvement is to hand each stage to an autonomous AI researcher — and let the output make the next round of research better.

A-Evo Lab studies self-evolving agents under one thesis — AI-as-researcher: frontier agents and models play the researcher in the loop that builds better AI. We're building an autonomous researcher for each of the three stages, on one shared stack — A-Evolve — so the whole lab iterates fast. The 30B result on this page is the post-training node.

Read the tech report (PDF) Follow on X GitHub · A-Evolve framework

Cite

@techreport{shi2026aevolvetraining,
  title       = {A-Evolve-Training: Autonomous Post-Training
                 of a 30B Model},
  author      = {Shi, Zhan and He, Bing and Sang, Yisi and Lu, Hanqing},
  institution = {A-EVO Lab, Amazon},
  year        = {2026},
  note        = {NVIDIA Nemotron-Reasoning Challenge; public
                 leaderboard standing as of 2026-06-01.}
}