Getting Started with OpenMythos: An Open-Source Implementation of the Claude Mythos Recurrent Reasoning Architecture

April 26, 2026

TIP

GitHub: kyegomez/OpenMythos | PyPI: open-mythos | License: MIT

Project Overview

In April 2026, Anthropic released Claude Mythos—the next-generation reasoning model in the Claude family—showing unprecedented capabilities on tasks such as software engineering and network security (especially discovering and exploiting zero-day vulnerabilities). However, Mythos is currently only made available in a limited way through “Project Glasswing” to a small set of defensive security research partners, and it is far from being fully released to the public.

OpenMythos is the community’s open-source response to this closed architecture: independent developer Kye Gomez reconstructed Mythos’s core architecture from first principles based on publicly available papers and research literature. It is not an official version, and it does not involve leaked weights—it is a theoretical reconstruction designed to help researchers and developers experiment and validate similar ideas.

Its core breakthrough is the Recurrent-Depth Transformer (RDT): the same Transformer weights are executed up to 16 times (configurable) within a single forward pass. Instead of stacking hundreds of independently parameterized layers like conventional models, it iteratively performs “thinking” in a continuous latent space. A recurrent model with 770M parameters can reach roughly the quality of a standard 1.3B-parameter Transformer—same capability, but with about half the parameters.


Difficulty / Duration / Takeaways

Beginner-friendly: about 30 minutes. You’ll run the first OpenMythos forward pass, understand the essence of recurrent reasoning, the principle behind LTI-stable injection, and how MoE FFN can expand model width without increasing activation parameters.


Target Audience

Developers with a basic understanding of LLM architectures, and 1–5 years of experience who want to go deeper into the “recurrent reasoning” mechanism and its engineering implementation. If you’re interested in any of the following, OpenMythos is a great starting point:

  • Why “more recurrences = deeper reasoning” is actually supported by theory
  • How LTI dynamical constraints prevent recurrent training from diverging
  • How MoE combined with shared recurrent weights finds a new balance between parameter count and compute

Core Dependencies and Environment

  • Python 3.10+
  • PyTorch 2.0+ (CUDA support can accelerate inference)
  • Optional: flash-attn >= 2.8.3 (speeds up GQA attention; IO-optimal)
  • Minimum hardware: CPU can run a toy demo; GPU can fully train
# CPU only
pip install open-mythos

# with Flash Attention 2 (requires CUDA + build tools)
pip install "open-mythos[flash]"

Full Project Structure

open_mythos/
├── main.py              # Core: OpenMythos class, MythosConfig, and all architecture components
├── moda.py              # Modules such as MoE / LoRA
├── tokenizer.py         # MythosTokenizer (wraps openai/gpt-oss-20b)
├── variants.py          # Predefined scales: mythos_1b ~ mythos_1t
├── docs/
│   ├── open_mythos.md   # Complete API reference
│   └── datasets.md      # Training dataset selection recommendations
training/
└── 3b_fine_web_edu.py   # 3B single-GPU / multi-GPU training script
examples/
├── moda_example.py          # MoE FFN + LoRA Adapter demo
└── variants_example.py     # Parameter comparisons across multiple scales
tests/
├── test_main.py             # Unit tests for core modules
├── bench_vs_transformer.py  # Benchmark comparison vs standard Transformer
└── small_benchmark.py       # Small-scale performance benchmark

Step-by-Step Instructions

Step 1 — Install

pip install open-mythos

WARNING

Compiling flash-attn requires the CUDA Toolkit and nvcc. If installation fails in an environment without nvcc, OpenMythos will automatically fall back to the standard PyTorch attention implementation. This does not affect correctness—only speed.

Step 2 — Build the Configuration

All hyperparameters are passed via MythosConfig. To understand a few key fields:

from open_mythos.main import MythosConfig

cfg = MythosConfig(
    vocab_size=1000,        # Vocabulary size; demo uses a small vocab
    dim=256,                # Hidden dimension
    n_heads=8,              # Number of Query heads
    n_kv_heads=2,          # Number of KV heads (GQA: fewer KV heads saves VRAM)
    max_seq_len=128,        # Maximum sequence length (RoPE precompute upper bound)
    max_loop_iters=4,       # Loop depth T; can be increased for inference
    prelude_layers=1,       # Number of standard Transformer layers in the Prelude block
    coda_layers=1,          # Number of standard Transformer layers in the Coda block
    attn_type="mla",        # "mla" or "gqa" (see Step 5)
    n_experts=8,            # Total number of MoE routing experts
    n_shared_experts=1,     # Number of always-on experts (activated for every token)
    n_experts_per_tok=2,    # Top-K experts activated per token
    expert_dim=64,          # Hidden dimension per expert
    lora_rank=8,            # Rank for the depth LoRA adapter
)

MythosConfig includes a default constructor; using MythosConfig() directly will load a set of moderately sized preset values.

Step 3 — Initialize the Model and Run a Forward Pass

import torch
from open_mythos.main import OpenMythos, MythosConfig

cfg = MythosConfig()
model = OpenMythos(cfg)

# Count parameters
total = sum(p.numel() for p in model.parameters())
print(f"Parameters: {total:,}")

# Forward pass: input_ids shape (B, T), logits shape (B, T, vocab_size)
ids = torch.randint(0, cfg.vocab_size, (2, 16))   # batch=2, seq=16
logits = model(ids, n_loops=4)                    # 4-loop recurrent inference
print(f"Logits shape: {logits.shape}")            # torch.Size([2, 16, 32000])

n_loops controls the loop depth. The default value comes from cfg.max_loop_iters, and you can increase it during inference. This is the key RDT feature: depth extrapolation—train with N recurrences, but run with N+k at inference to handle more complex problems.

Step 4 — Autoregressive Generation

# Generate new tokens (up to 8); loop depth is 8
out = model.generate(ids, max_new_tokens=8, n_loops=8)
print(f"Generated shape: {out.shape}")  # torch.Size([2, 24])

The generate method maintains a KV cache internally. The first call processes the full prompt; subsequent calls decode only one token at a time. temperature controls sampling randomness, and top_k limits the sampling range:

out = model.generate(
    ids,
    max_new_tokens=64,
    n_loops=16,
    temperature=0.8,   # Lower = more deterministic
    top_k=40,          # 0 disables
)

Step 5 — Choose an Attention Mechanism: MLA vs GQA

# Multi-Latent Attention (default; recommended for long contexts)
cfg_mla = MythosConfig(
    attn_type="mla",
    kv_lora_rank=512,      # Cached KV latent dimension (smaller = less VRAM)
    q_lora_rank=1536,      # Q compression dimension
    qk_rope_head_dim=64,   # RoPE-enabled per-head dimension
    qk_nope_head_dim=128,  # Per-head dimension without RoPE
    v_head_dim=128,
)

# Grouped Query Attention (VRAM-friendly; efficient with flash-attn)
cfg_gqa = MythosConfig(
    attn_type="gqa",
    n_kv_heads=4,         # Fewer than n_heads; KV cache shrinks by n_heads/n_kv_heads
)

Tradeoffs: MLA compresses KV into low-rank latent vector caches, reducing VRAM usage by roughly 10–20×, but each iteration requires rebuilding K/V (one linear projection). GQA directly caches full KV heads, and with Flash Attention 2 it achieves IO-optimal performance. At production scale, MLA saves more VRAM; at development stage, GQA is easier to debug.

Step 6 — Verify Recurrent Stability: Spectral Radius Check

The most common problem during RDT training is recurrent divergence—hidden states grow exponentially with each loop. OpenMythos guarantees stability by construction via LTI-stable injection. The verification method is to check the spectral radius ρ(A):

# Get the discretized state matrix A_discrete
A = model.recurrent.injection.get_A()           # shape (dim,)
rho = torch.linalg.eigvals(A).abs().max().item()

print(f"Spectral radius ρ(A) = {rho:.4f}")
assert rho < 1.0, f"Unstable: ρ(A) = {rho:.4f} >= 1"

WARNING

If you find rho >= 1 during custom training, do not try to manually tweak parameters to “fix” it. This indicates your LTI injection parametrization has been bypassed. Check whether you accidentally directly assigned injection.log_A or injection.log_dt.

Step 7 — Use Predefined Scale Variants

Don’t want to handcraft MythosConfig? variants.py provides presets from 1B to 1T:

from open_mythos import mythos_1b, mythos_3b, mythos_10b, mythos_50b, mythos_100b

# mythos_7b() returns MythosConfig
cfg = mythos_3b()
model = OpenMythos(cfg)
print(f"Variant 3B: {sum(p.numel() for p in model.parameters()):,} params")
VariantdimExpertsLoop DepthContextMax Output
mythos_1b204864164k4k
mythos_3b307264164k4k
mythos_10b4096128248k4k
mythos_100b8192256321M128k

Step 8 — Run the Training Script

The project includes a training script for the 3B model on the FineWeb-Edu dataset:

# Single GPU
python training/3b_fine_web_edu.py

# Multi GPU (auto-detect GPU count)
torchrun --nproc_per_node=$(python -c "import torch; print(torch.cuda.device_count())") \
    training/3b_fine_web_edu.py

TIP

By default, training uses the sample-10BT subset (30B tokens), which is suitable for quick validation. To run the full sample-100BT subset, modify the dataset_name parameter in the script to "sample-100BT".

Key training configuration:

# training/3b_fine_web_edu.py (key parameters)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
# Precision: H100/A100 use bfloat16; older GPUs use float16 + GradScaler
# Scheduler: 2000-step warmup → cosine decay
# Tokenizer: openai/gpt-oss-20b via MythosTokenizer

Common Issues Troubleshooting

1. VRAM Explosion: The Loop Count × Batch Size Trap

Unlike a standard Transformer where layers are independent, the RDT recurrent stage accumulates the KV state for all tokens into the cache at each loop step. With 16 loops × sequence length 4096 × batch size 8, the memory pressure can be several times that of a normal model.

Solution: First use a smaller n_loops (e.g., 4) to validate; once it converges, gradually increase. Scale the batch size linearly rather than keeping it fixed.

2. Missing Flash Attention but Want Speedup

When installing flash-attn fails (common on Windows + CUDA setups), OpenMythos automatically falls back to PyTorch’s native attention. This fallback does not affect correctness, only speed. In small-batch scenarios, the gap is usually not large.

3. Spectral Radius Check Fails: Training Diverges

If you changed the parametrization of LTIInjection, or froze the wrong layers during fine-tuning such that the injection parameters are broken, you may see ρ(A) >= 1. Do not simply clip parameters—reset and start over from the default initialization. The LTI constraint relies on the parametrized structure, not post-hoc correction.

4. ACT Early Stopping Leads to Too Few Actual Loops

act_threshold=0.99 means that each position exits early after accumulating 99% probability mass. If you find that the actual loop count is far lower than n_loops on certain tasks, check the token difficulty distribution in the data. Tokens that are too easy may trigger halting early, preventing harder tokens from receiving sufficient depth.

5. MoE Routing Crashes: Some Experts Never Activate

DeepSeekMoE-style auxiliary loss-free load balancing depends on dynamic adjustment of router_bias. If you train from scratch without invoking this adjustment logic, the router will quickly converge to activating only a small number of experts (typically the first 1–2). In this project, MoEFFN.forward includes a placeholder for bias adjustment, but in a custom training script you need to call the bias update logic regularly.

6. Poor Generation Quality: How to Set Temperature and top_k

Scenariotemperaturetop_kNotes
Code / Math0.3–0.510–20Low randomness
Creative writing0.7–0.90 (off)High diversity
Debug / inference validation0.0 (greedy)1Most deterministic

Further Reading / Advanced Directions

Core Papers

PaperWhat It Solves
Loop, Think & Generalize (2025)How recurrent Transformers perform implicit reasoning in latent space
Parcae (Prairie et al., 2026)Scaling laws that guarantee recurrent stability via LTI constraints
DeepSeekMoE (Dai et al., 2024)Fine-grained MoE with shared-expert routing design
DeepSeek-V2 MLA (2024)The KV compression principle of Multi-Latent Attention
Relaxed Recursive Transformers (Bae et al., 2024)How LoRA adapters improve recurrent expressiveness without adding too many parameters

Advanced Directions

1. Custom LoRA Adapter: Currently lora_rank is globally uniform. You can change it to per-layer configuration so different recurrent depths use different adaptation strengths.

2. ACT Threshold Scanning: The default 0.99 is an empirical value. Run a grid search over 0.8–0.999—you’ll find that for some tasks, 0.95 can outperform 0.99 (early stopping avoids overthinking).

3. Depth Extrapolation Experiments: Train with a fixed n_loops, then sweep n_loops during inference, and plot a curve of “number of recurrences → downstream task accuracy.” In theory, you should observe an exponential decay shape—this is one of RDT’s most important and provable properties.

4. Compare with a Standard Transformer: Use bench_vs_transformer.py to compare the recurrent model vs a standard model on compositional/systematic generalization tasks under the same parameter budget. This is RDT’s most prominent advantage setting.

Updated April 26, 2026