Autoresearch: A 5-Minute Guide to AI-Automated GPT Training

WARNING

This article requires an NVIDIA GPU (tested on H100) to run properly.

Imagine this: while you sleep, an AI is busy tuning parameters, training models, and running experiments. You wake up to a stack of experimental logs and a better model—this is autoresearch. It is a minimalist autonomous research framework where an AI Agent independently modifies GPT training scripts, runs experiments, evaluates results, and iterates automatically. All you need to do is watch.

Target Audience

Developers interested in autonomous AI Research Agents
Students wanting to understand the LLM training process from an engineering perspective
Anyone curious to see how AI can improve itself

Core Dependencies and Environment

Python 3.10+
NVIDIA GPU (tested on H100)
uv package manager

Project Structure

autoresearch/
├── prepare.py      # Fixed constants, data prep, tokenizer (do not modify)
├── train.py        # Model, optimizer, training loop (Agent modifies this)
├── program.md      # Agent instructions (Human modifies this)
├── pyproject.toml  # Dependencies
└── results.tsv     # Experimental records (auto-generated)

The entire repository is intentionally minimalist—only three files truly matter. This design keeps the experiment scope controllable and makes diffs easy to review.

Step-by-Step Tutorial

Step 1: Install uv

First, install the uv package manager:

curl -LsSf https://astral.sh/uv/install.sh | sh

TIP

On Windows, you can use winget install astral-sh.uv or download the installer from the official website.

Step 2: Install Dependencies

Clone the repository and install dependencies:

git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync

This step will install necessary packages like PyTorch.

Step 3: Prepare Data and Tokenizer

Run the one-time data preparation script:

uv run prepare.py

This will download training data and train a BPE tokenizer, taking about 2 minutes. Data will be saved in ~/.cache/autoresearch/.

Step 4: Run a Baseline Training

Let's run a manual training first to establish a baseline:

uv run train.py

Training runs for exactly 5 minutes (excluding startup and compilation time). The evaluation metric is val_bpb (bits per byte on the validation set)—the lower, the better.

After running, you will see output similar to this:

---
val_bpb:          0.997900
training_seconds: 300.1
total_seconds:    325.9
peak_vram_mb:     45060.2
mfu_percent:      39.80
total_tokens_M:   499.6
num_steps:        953
num_params_M:     50.3
depth:            8

This baseline is our starting point. Now, hand it over to the AI.

Step 5: Connect Your AI Agent

Create a new branch for the experiment:

git checkout -b autoresearch/mar15

Initialize the results file:

echo -e "commit\tval_bpb\tmemory_gb\tstatus\tdescription" > results.tsv

Now start your AI Agent (Claude, Codex, etc., work; remember to disable all unnecessary permissions):

claude

Then tell it this:

Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.

The Agent will read program.md, understand the experimental workflow, and begin autonomously iterating on train.py.

Step 6: Understanding the Experimental Loop

The Agent will continuously loop through these steps:

Modify train.py — Change architecture, tune hyperparameters, swap optimizers, etc.
Git commit — Save the change.
Run experiment — uv run train.py > run.log 2>&1
Read results — grep "^val_bpb:\|^peak_vram_mb:" run.log
Log to results.tsv — Record the experiment.
Decision:
- val_bpb decreased → Keep the commit and continue.
- val_bpb worsened → git reset and try a different approach.

TIP

The Agent runs fully automatically; don't ask "should I continue"—just let it run. Each experiment takes about 5 minutes, so it can run roughly 12 times per hour.

Core Instruction Analysis for program.md

program.md is the brain of your AI researcher. Key points include:

Fixed Time Budget — Always 5 minutes to ensure experiments are comparable.
Single File Modification — Only train.py can be changed.
No New Dependencies — Only packages already in pyproject.toml are allowed.
Simplicity Principle — Minor improvements that add significant complexity are not worth keeping.

The first experiment should always be the baseline to establish a benchmark.

Output Format

Experimental results are recorded in results.tsv, separated by tabs:

commit	val_bpb	memory_gb	status	description
a1b2c3d	0.997900	44.0	keep	baseline
b2c3d4e	0.993200	44.2	keep	increase LR to 0.04
c3d4e5f	1.005000	44.0	discard	switch to GeLU activation

Note: Use tabs, not commas—commas can cause issues within descriptions.

Troubleshooting

1. Data Not Ready

Error: FileNotFoundError cannot find cache files.

Solution: Run uv run prepare.py first to download data and train the tokenizer.

2. Out of Memory (OOM)

Error: CUDA out of memory

Solution: The Agent should detect this and mark it as "crash" in results.tsv. Common fixes:

Reduce TOTAL_BATCH_SIZE in train.py.
Reduce the number of model layers (decrease DEPTH).
Use a shorter sequence length.

3. Training Timeout

Error: Runs for more than 10 minutes.

Solution: Manually kill the process. Mark the experiment as "discard" and roll back.

4. Agent Not Working

Problem: The Agent reads files but doesn't run experiments.

Solution: Ensure the Agent is started in the correct directory and has permissions to execute shell commands.

5. Always Worse Than Baseline

Problem: All experimental results are similar or worse than the baseline.

Solution:

Try more aggressive architectural changes (attention mechanisms, activation functions).
Adjust the learning rate—the default might not be optimal.
Try different optimizers (e.g., Muon vs. AdamW).

6. Platform Incompatibility

Error: "No CUDA GPUs available"

Solution: This code requires an NVIDIA GPU. For other platforms, use these forks:

autoresearch-macos — MacOS
autoresearch-mlx — MacOS Apple Silicon
autoresearch-win-rtx — Windows

Advanced Directions

Running on Smaller GPUs

If you don't have an H100, try these adjustments:

Use the TinyStories dataset (narrower domain, effective for small models).
Reduce vocab_size from 8192 to 2048 or even 256 (byte-level).
Reduce MAX_SEQ_LEN in prepare.py to 256 or lower.
Reduce TOTAL_BATCH_SIZE to 2**14 (approx. 16K).
Set WINDOW_PATTERN to "L" for simple attention.

Further Scaling

Once familiar, you can:

Modify program.md to include multiple Agents with different roles.
Try different datasets.
Incorporate more complex evaluation metrics.
Implement multi-GPU training (requires significant changes to prepare.py).

Official Repository
Karpathy's Announcement Tweet
nanochat — Parent project supporting more platforms.
Neural Networks "Dummy's Guide"