WARNING
This article requires an NVIDIA GPU (tested on H100) to run properly.
Imagine this: while you sleep, an AI is busy tuning parameters, training models, and running experiments. You wake up to a stack of experimental logs and a better model—this is autoresearch. It is a minimalist autonomous research framework where an AI Agent independently modifies GPT training scripts, runs experiments, evaluates results, and iterates automatically. All you need to do is watch.
Target Audience
- Developers interested in autonomous AI Research Agents
- Students wanting to understand the LLM training process from an engineering perspective
- Anyone curious to see how AI can improve itself
Core Dependencies and Environment
- Python 3.10+
- NVIDIA GPU (tested on H100)
- uv package manager
Project Structure
autoresearch/
├── prepare.py # Fixed constants, data prep, tokenizer (do not modify)
├── train.py # Model, optimizer, training loop (Agent modifies this)
├── program.md # Agent instructions (Human modifies this)
├── pyproject.toml # Dependencies
└── results.tsv # Experimental records (auto-generated)
The entire repository is intentionally minimalist—only three files truly matter. This design keeps the experiment scope controllable and makes diffs easy to review.
Step-by-Step Tutorial
Step 1: Install uv
First, install the uv package manager:
curl -LsSf https://astral.sh/uv/install.sh | sh
TIP
On Windows, you can use winget install astral-sh.uv or download the installer from the official website.
Step 2: Install Dependencies
Clone the repository and install dependencies:
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync
This step will install necessary packages like PyTorch.
Step 3: Prepare Data and Tokenizer
Run the one-time data preparation script:
uv run prepare.py
This will download training data and train a BPE tokenizer, taking about 2 minutes. Data will be saved in ~/.cache/autoresearch/.
Step 4: Run a Baseline Training
Let's run a manual training first to establish a baseline:
uv run train.py
Training runs for exactly 5 minutes (excluding startup and compilation time). The evaluation metric is val_bpb (bits per byte on the validation set)—the lower, the better.
After running, you will see output similar to this:
---
val_bpb: 0.997900
training_seconds: 300.1
total_seconds: 325.9
peak_vram_mb: 45060.2
mfu_percent: 39.80
total_tokens_M: 499.6
num_steps: 953
num_params_M: 50.3
depth: 8
This baseline is our starting point. Now, hand it over to the AI.
Step 5: Connect Your AI Agent
Create a new branch for the experiment:
git checkout -b autoresearch/mar15
Initialize the results file:
echo -e "commit\tval_bpb\tmemory_gb\tstatus\tdescription" > results.tsv
Now start your AI Agent (Claude, Codex, etc., work; remember to disable all unnecessary permissions):
claude
Then tell it this:
Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.
The Agent will read program.md, understand the experimental workflow, and begin autonomously iterating on train.py.
Step 6: Understanding the Experimental Loop
The Agent will continuously loop through these steps:
- Modify
train.py— Change architecture, tune hyperparameters, swap optimizers, etc. - Git commit — Save the change.
- Run experiment —
uv run train.py > run.log 2>&1 - Read results —
grep "^val_bpb:\|^peak_vram_mb:" run.log - Log to
results.tsv— Record the experiment. - Decision:
- val_bpb decreased → Keep the commit and continue.
- val_bpb worsened → git reset and try a different approach.
TIP
The Agent runs fully automatically; don't ask "should I continue"—just let it run. Each experiment takes about 5 minutes, so it can run roughly 12 times per hour.
Core Instruction Analysis for program.md
program.md is the brain of your AI researcher. Key points include:
- Fixed Time Budget — Always 5 minutes to ensure experiments are comparable.
- Single File Modification — Only
train.pycan be changed. - No New Dependencies — Only packages already in
pyproject.tomlare allowed. - Simplicity Principle — Minor improvements that add significant complexity are not worth keeping.
The first experiment should always be the baseline to establish a benchmark.
Output Format
Experimental results are recorded in results.tsv, separated by tabs:
commit val_bpb memory_gb status description
a1b2c3d 0.997900 44.0 keep baseline
b2c3d4e 0.993200 44.2 keep increase LR to 0.04
c3d4e5f 1.005000 44.0 discard switch to GeLU activation
Note: Use tabs, not commas—commas can cause issues within descriptions.
Troubleshooting
1. Data Not Ready
Error: FileNotFoundError cannot find cache files.
Solution: Run uv run prepare.py first to download data and train the tokenizer.
2. Out of Memory (OOM)
Error: CUDA out of memory
Solution: The Agent should detect this and mark it as "crash" in results.tsv. Common fixes:
- Reduce
TOTAL_BATCH_SIZEintrain.py. - Reduce the number of model layers (decrease
DEPTH). - Use a shorter sequence length.
3. Training Timeout
Error: Runs for more than 10 minutes.
Solution: Manually kill the process. Mark the experiment as "discard" and roll back.
4. Agent Not Working
Problem: The Agent reads files but doesn't run experiments.
Solution: Ensure the Agent is started in the correct directory and has permissions to execute shell commands.
5. Always Worse Than Baseline
Problem: All experimental results are similar or worse than the baseline.
Solution:
- Try more aggressive architectural changes (attention mechanisms, activation functions).
- Adjust the learning rate—the default might not be optimal.
- Try different optimizers (e.g., Muon vs. AdamW).
6. Platform Incompatibility
Error: "No CUDA GPUs available"
Solution: This code requires an NVIDIA GPU. For other platforms, use these forks:
- autoresearch-macos — MacOS
- autoresearch-mlx — MacOS Apple Silicon
- autoresearch-win-rtx — Windows
Advanced Directions
Running on Smaller GPUs
If you don't have an H100, try these adjustments:
- Use the TinyStories dataset (narrower domain, effective for small models).
- Reduce
vocab_sizefrom 8192 to 2048 or even 256 (byte-level). - Reduce
MAX_SEQ_LENinprepare.pyto 256 or lower. - Reduce
TOTAL_BATCH_SIZEto2**14(approx. 16K). - Set
WINDOW_PATTERNto "L" for simple attention.
Further Scaling
Once familiar, you can:
- Modify
program.mdto include multiple Agents with different roles. - Try different datasets.
- Incorporate more complex evaluation metrics.
- Implement multi-GPU training (requires significant changes to
prepare.py).
Related Resources
- Official Repository
- Karpathy's Announcement Tweet
- nanochat — Parent project supporting more platforms.
- Neural Networks "Dummy's Guide"