AI & AUTOMATION · 2026-06-27

Autoresearch: The AI Agent That Runs Its Own ML Experiments Overnight — and Why It Matters

Autoresearch is Andrej Karpathy's experiment in letting an AI agent do science on its own: it rewrites the training code, trains a model for five minutes, checks if the result improved, keeps or discards the change, and repeats ~100 times while you sleep. Here's what it actually is, how the autonomous loop works, why its constraints are so clever, and what it tells us about the near future of engineering work.

Autoresearch is an experiment by Andrej Karpathy that hands an AI agent a small but real LLM training setup and lets it do science on its own overnight. The agent rewrites the training code, trains a model for a fixed five minutes, checks whether the result got better, keeps the change or throws it away, and repeats — roughly a hundred times while you sleep. You wake up to a log of experiments and, often, a better model than the one you went to bed with. It matters because it's the clearest working glimpse yet of something the field has been circling for years: an AI that doesn't just answer questions but runs the research loop itself.

What autoresearch actually is

Under the hood, autoresearch is a stripped-down, single-GPU version of nanochat — Karpathy's minimal language-model training stack. The whole repo comes down to three files that matter. prepare.py handles the one-time setup (downloading data, training a tokenizer) and never changes. train.py holds the entire model, optimizer, and training loop — and this is the only file the agent is allowed to touch. The third file, program.md, is the interesting one: it's the plain-English instruction set that tells the agent how to behave, and it's the only file the human edits. You're no longer tuning the model by hand. You're writing the operating instructions for the thing that tunes the model.

How the autonomous loop works, step by step

The human writes program.md — the baseline instructions that frame how the agent should experiment.
The agent reads it and proposes a concrete change to train.py: a new architecture tweak, a different optimizer setting, a batch-size change.
It trains the model for a fixed five-minute budget — wall-clock, excluding startup and compilation.
It measures the result with a single metric, val_bpb (validation bits per byte), where lower is better.
If the change improved the metric, the agent keeps it; if not, it reverts — then logs the outcome.
It repeats, unattended — about 12 experiments an hour, roughly 100 across a single night.

Why a human researcher and an agent aren't doing the same job

Aspect	Human ML researcher	Autoresearch agent
Throughput	A few experiments a day	~12 per hour, ~100 overnight
Working hours	While awake	Unattended, 24/7
Comparability	Varies run to run	Fixed 5-min budget, vocab-independent metric
The human's role	Run every experiment	Design the 'research org' in program.md
Bottleneck	Researcher's time and focus	Compute, and the quality of your instructions

The constraints are the clever part

It would be easy to dismiss the five-minute cap as a toy limitation. It's the opposite — it's what makes the whole thing work. Because every run gets exactly the same time budget, any two experiments are directly comparable no matter what the agent changed: a bigger model, a smaller batch, a different attention pattern. The fixed budget also means autoresearch naturally converges on the best model your specific hardware can produce in five minutes. And val_bpb is chosen deliberately: bits-per-byte is independent of vocabulary size, so the agent can't cheat the metric by swapping tokenizers. One file to edit, one clock, one honest number.

The shift isn't that the agent writes the code. It's that you stop programming the model and start programming the researcher.

Where it's a glimpse of the future — and where it isn't magic

The honest framing is that autoresearch is a deliberately narrow sandbox, and the default program.md is intentionally bare-bones. Left alone, an agent optimizing a single number will happily find a degenerate shortcut, plateau on a local trick, or repeat a mistake it already made. None of that is a flaw in the idea — it's the point. The skill that decides whether the overnight log is full of real progress or noise has moved up a level: it's in how you write the instructions, what guardrails you set, and which agents you put in the loop. The work doesn't disappear. It changes altitude.

The same idea, applied to engineering

What I find compelling is that autoresearch is the same pattern I've been building toward in my own engineering work: a closed loop where the AI acts, the tool reports the truth about whether it worked, and the system improves instead of guessing. That's exactly the philosophy behind my self-improving SolidWorks MCP server — write a macro, run it in live CAD, read the real errors, fix, and remember the fix forever. autoresearch does it for model training; the MCP loop does it for AI-driven CAD and simulation. Different domains, identical principle: verification plus memory beats one-shot generation every time.

For businesses, the takeaway isn't 'replace your engineers with overnight agents.' It's that the highest-leverage automation now looks like a loop, not a script — something that runs, checks its own work, and compounds. That's the kind of custom automation I build, and if you have a repetitive, measurable process that a closed loop could grind down while your team sleeps, let's talk about it.

Want to try autoresearch yourself?

The project is open source under an MIT license. It targets a single NVIDIA GPU (Karpathy tested on an H100), but the community has already produced forks for macOS, Windows RTX, and AMD, and you can scale the defaults down to train tiny models on far smaller hardware using a low-entropy dataset like TinyStories. Point a coding agent at program.md, disable its permission prompts, and tell it to kick off an experiment — then read the log in the morning. It's the cheapest way I know to feel where agentic engineering is actually heading.

Frequently Asked Questions

What is autoresearch?

Autoresearch is an open-source experiment by Andrej Karpathy in which an AI agent autonomously runs machine-learning experiments: it edits a small LLM training script, trains a model for a fixed five minutes, measures whether the result improved, keeps or discards the change, and repeats around 100 times overnight without human intervention.

Who created autoresearch?

Autoresearch was created by Andrej Karpathy and released publicly in early 2026. It is built on a simplified, single-GPU version of his nanochat language-model training code.

How does the autoresearch agent improve the model?

The agent only edits one file, train.py, which contains the model, optimizer, and training loop. It proposes a change, trains for a fixed five-minute budget, and evaluates the result with val_bpb (validation bits per byte). If the metric improves it keeps the change; if not it reverts. The human edits program.md, the instructions that guide the agent.

What is val_bpb and why is it used?

val_bpb is validation bits per byte — a measure of how well the model predicts held-out text, where lower is better. It is used because it is independent of vocabulary size, so the agent can compare architectural and tokenizer changes fairly without being able to game the metric.

Do I need an H100 GPU to run autoresearch?

No. It was tested on a single NVIDIA H100, but community forks support macOS, Windows RTX, and AMD hardware. You can also scale the defaults down — smaller model depth, shorter sequence length, and a low-entropy dataset such as TinyStories — to train small models on modest hardware.

Is the AI really doing research on its own?

Within tight constraints, yes — it generates, tests, and selects experiments unattended. But it is not autonomous in the bigger sense: the human still designs the 'research org' by writing program.md, setting the guardrails, and judging whether the overnight results represent real progress or just metric-gaming.

Work with me · More articles