Autoresearch: A 5-Minute Guide to AI-Automated GPT Training

March 15, 2026

WARNING

This article requires an NVIDIA GPU (tested on H100) to run properly.

Imagine this: while you sleep, an AI is busy tuning parameters, training models, and running experiments. You wake up to a stack of experimental logs and a better model—this is autoresearch. It is a minimalist autonomous research framework where an AI Agent independently modifies GPT training scripts, runs experiments, evaluates results, and iterates automatically. All you need to do is watch.

Target Audience

  • Developers interested in autonomous AI Research Agents
  • Students wanting to understand the LLM training process from an engineering perspective
  • Anyone curious to see how AI can improve itself

Core Dependencies and Environment

  • Python 3.10+
  • NVIDIA GPU (tested on H100)
  • uv package manager

Project Structure

autoresearch/
├── prepare.py      # Fixed constants, data prep, tokenizer (do not modify)
├── train.py        # Model, optimizer, training loop (Agent modifies this)
├── program.md      # Agent instructions (Human modifies this)
├── pyproject.toml  # Dependencies
└── results.tsv     # Experimental records (auto-generated)

The entire repository is intentionally minimalist—only three files truly matter. This design keeps the experiment scope controllable and makes diffs easy to review.

Step-by-Step Tutorial

Step 1: Install uv

First, install the uv package manager:

curl -LsSf https://astral.sh/uv/install.sh | sh

TIP

On Windows, you can use winget install astral-sh.uv or download the installer from the official website.

Step 2: Install Dependencies

Clone the repository and install dependencies:

git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync

This step will install necessary packages like PyTorch.

Step 3: Prepare Data and Tokenizer

Run the one-time data preparation script:

uv run prepare.py

This will download training data and train a BPE tokenizer, taking about 2 minutes. Data will be saved in ~/.cache/autoresearch/.

Step 4: Run a Baseline Training

Let's run a manual training first to establish a baseline:

uv run train.py

Training runs for exactly 5 minutes (excluding startup and compilation time). The evaluation metric is val_bpb (bits per byte on the validation set)—the lower, the better.

After running, you will see output similar to this:

---
val_bpb:          0.997900
training_seconds: 300.1
total_seconds:    325.9
peak_vram_mb:     45060.2
mfu_percent:      39.80
total_tokens_M:   499.6
num_steps:        953
num_params_M:     50.3
depth:            8

This baseline is our starting point. Now, hand it over to the AI.

Step 5: Connect Your AI Agent

Create a new branch for the experiment:

git checkout -b autoresearch/mar15

Initialize the results file:

echo -e "commit\tval_bpb\tmemory_gb\tstatus\tdescription" > results.tsv

Now start your AI Agent (Claude, Codex, etc., work; remember to disable all unnecessary permissions):

claude

Then tell it this:

Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.

The Agent will read program.md, understand the experimental workflow, and begin autonomously iterating on train.py.

Step 6: Understanding the Experimental Loop

The Agent will continuously loop through these steps:

  1. Modify train.py — Change architecture, tune hyperparameters, swap optimizers, etc.
  2. Git commit — Save the change.
  3. Run experiment — uv run train.py > run.log 2>&1
  4. Read results — grep "^val_bpb:\|^peak_vram_mb:" run.log
  5. Log to results.tsv — Record the experiment.
  6. Decision:
    • val_bpb decreased → Keep the commit and continue.
    • val_bpb worsened → git reset and try a different approach.

TIP

The Agent runs fully automatically; don't ask "should I continue"—just let it run. Each experiment takes about 5 minutes, so it can run roughly 12 times per hour.

Core Instruction Analysis for program.md

program.md is the brain of your AI researcher. Key points include:

  • Fixed Time Budget — Always 5 minutes to ensure experiments are comparable.
  • Single File Modification — Only train.py can be changed.
  • No New Dependencies — Only packages already in pyproject.toml are allowed.
  • Simplicity Principle — Minor improvements that add significant complexity are not worth keeping.

The first experiment should always be the baseline to establish a benchmark.

Output Format

Experimental results are recorded in results.tsv, separated by tabs:

commit	val_bpb	memory_gb	status	description
a1b2c3d	0.997900	44.0	keep	baseline
b2c3d4e	0.993200	44.2	keep	increase LR to 0.04
c3d4e5f	1.005000	44.0	discard	switch to GeLU activation

Note: Use tabs, not commas—commas can cause issues within descriptions.

Troubleshooting

1. Data Not Ready

Error: FileNotFoundError cannot find cache files.

Solution: Run uv run prepare.py first to download data and train the tokenizer.

2. Out of Memory (OOM)

Error: CUDA out of memory

Solution: The Agent should detect this and mark it as "crash" in results.tsv. Common fixes:

  • Reduce TOTAL_BATCH_SIZE in train.py.
  • Reduce the number of model layers (decrease DEPTH).
  • Use a shorter sequence length.

3. Training Timeout

Error: Runs for more than 10 minutes.

Solution: Manually kill the process. Mark the experiment as "discard" and roll back.

4. Agent Not Working

Problem: The Agent reads files but doesn't run experiments.

Solution: Ensure the Agent is started in the correct directory and has permissions to execute shell commands.

5. Always Worse Than Baseline

Problem: All experimental results are similar or worse than the baseline.

Solution:

  • Try more aggressive architectural changes (attention mechanisms, activation functions).
  • Adjust the learning rate—the default might not be optimal.
  • Try different optimizers (e.g., Muon vs. AdamW).

6. Platform Incompatibility

Error: "No CUDA GPUs available"

Solution: This code requires an NVIDIA GPU. For other platforms, use these forks:

Advanced Directions

Running on Smaller GPUs

If you don't have an H100, try these adjustments:

  1. Use the TinyStories dataset (narrower domain, effective for small models).
  2. Reduce vocab_size from 8192 to 2048 or even 256 (byte-level).
  3. Reduce MAX_SEQ_LEN in prepare.py to 256 or lower.
  4. Reduce TOTAL_BATCH_SIZE to 2**14 (approx. 16K).
  5. Set WINDOW_PATTERN to "L" for simple attention.

Further Scaling

Once familiar, you can:

  • Modify program.md to include multiple Agents with different roles.
  • Try different datasets.
  • Incorporate more complex evaluation metrics.
  • Implement multi-GPU training (requires significant changes to prepare.py).
Updated March 15, 2026