Ponytail: Curing AI Agents’ “Overengineering Illness” — Cuts 54% of Code in Real Tests | OpenClaw API Documentation - Open Source AI Assistant Integration Guide

Repository: https://github.com/DietrichGebert/ponytail

Low effort: get started in 10 minutes, finish your first benchmark within an hour—so you can finally fix your AI coding “runaway” habit.

Preface

Have you ever experienced this—ask Claude Code to build a date picker. It takes itself very seriously: installs flatpickr, writes a wrapper component, adds a CSS file, then starts discussing time zone handling… and hands you 287 lines.

What you meant to ask for was: “I just need an input box where I can pick a date.” What you got was: “A date-picker framework.”

That’s the AI Agent “overengineering illness”: models are trained to look “professional,” so they automatically add abstraction layers, configuration options, error handling, and test coverage—when what you actually want might be nothing more than a simple <input type="date">.

Ponytail is here to cure it. Open-sourced by Dietrich Gebert, its core idea in one sentence: “The best code is the code that never gets written.”

Ponytail isn’t a model and it’s not an IDE plugin. It’s a rule set in the style of a “lazy senior engineer.” It gives your AI a six-step ladder:

1. Does this actually need to be built?          → No: skip (YAGNI)
2. Can the standard library do it?              → Use it
3. Can a native platform feature do it?        → Use it
4. Can an already-installed dependency do it? → Use it
5. Can it be done in one line?                 → Write one line
6. If all else fails: write the smallest amount of code that works

In real experiments on tiangolo’s full-stack-fastapi-template repository, with 12 feature tickets, n=4, and Haiku 4.5, Ponytail produced this results sheet:

vs no-rules baseline	LOC	tokens	cost	time	safe
Ponytail	-54%	-22%	-20%	-27%	100%
Bare “YAGNI + one-line” prompt	-33%	-14%	-21%	-30%	95%
caveman (cramped prompt)	-20%	+7%	+3%	+2%	100%

Ponytail is the only approach where all metrics drop, and the only one that cuts code without sacrificing safety. The date picker goes from 404 lines to 23 lines; the color picker goes from 287 lines to 23 lines—because it directly uses the browser’s built-in <input type="color"> and <input type="date">.

It supports 14 AI programming tools: Claude Code, Codex, Cursor, Windsurf, Cline, Copilot CLI, Aider, Kiro, Zed, CodeWhale, OpenCode, Pi, Gemini CLI, and OpenClaw. Today, we’ll install this rule set into Claude Code and cure your Agent’s “overengineering illness” in 30 seconds.

Target audience

Developers with 1–5 years of experience writing code with AI Agents day-to-day
People who feel that “AI code is too long” and “it drags in a bunch of unused dependencies”
Teams who want consistent coding style, but don’t want to lock everything down with rigid ESLint rules
Small business owners / Tech Leads who care about AI programming costs—and hate token bills

Core dependencies and environment

Node.js 18+ (required: Ponytail lifecycle hooks run on Node; if you use nvm, make sure it’s on the PATH for non-interactive shells)
A supported AI Agent (Claude Code for the demo)
An LLM API key (demo uses Defapi for Claude Haiku 4.5 at half price)
Optional: Python 3.10+ (used when running benchmarks with pandas)
Optional: Git (clone the repo)

TIP

About API key costs: Ponytail itself is open-source and free. But to get your AI Agent running, you’ll need to spend tokens. If you also care about the bill, I strongly recommend Defapi—it offers official half-price Claude, GPT, and Gemini models, with a completely OpenAI / Anthropic–compatible API. Just swap the base URL. The tutorial below shows how to switch.

Full project structure

ponytail/
├── AGENTS.md                  # Core rules (the "lazy philosophy" read by all agents)
├── README.md / README.es.md   # Trilingual README (EN/中文/ES)
├── package.json               # Defines the pi-agent package
├── commands/                  # 6 slash commands
│   ├── ponytail.toml          # /ponytail [lite|full|ultra|off]
│   ├── ponytail-review.toml   # /ponytail-review (cuts the current diff)
│   ├── ponytail-audit.toml    # /ponytail-audit (scans the entire repo)
│   ├── ponytail-debt.toml     # /ponytail-debt (collects ponytail: comments)
│   ├── ponytail-gain.toml     # /ponytail-gain (looks at benchmark results)
│   └── ponytail-help.toml     # /ponytail-help
├── skills/                    # 6 skill images
│   ├── ponytail/              # Main rules
│   ├── ponytail-review/ ... # The other 5
├── hooks/                     # Claude / Codex lifecycle hooks
│   ├── ponytail-config.js     # Mode parsing (env + config.json)
│   └── ponytail-instructions.js
├── ponytail-mcp/              # MCP server adapter (for MCP-only hosts)
│   ├── index.js
│   ├── instructions.js
│   └── test/
├── examples/                  # 12 real "overengineering vs one-line" comparisons
│   ├── date-picker.md / color-picker.md（web built-ins）
│   ├── deep-clone.md（structuredClone）
│   ├── debounce.md
│   ├── email-validation.md（75 lines → 3 lines）
│   └── ... 12 total
├── benchmarks/                # promptfoo + agentic benchmarks
│   ├── promptfooconfig.yaml   # single-round benchmark config
│   ├── benchmark-local.py     # agentic real-repo benchmark
│   ├── agentic/               # 12 ticket scripts
│   └── results/2026-06-18-agentic.md  # Complete data
├── docs/
│   ├── agent-portability.md   # Which agent loads which file
│   └── platform-native.md
├── .openclaw/                 # OpenClaw skill adapter (auto-generated)
├── .cursor/ .windsurf/        # Cursor / Windsurf rule files
├── .clinerules/               # Cline rules
├── .kiro/steering/            # Kiro rules
└── tests/                     # Tests for rule consistency

Step-by-step tutorial

Step 1: Install Ponytail into Claude Code

In Claude Code, Ponytail is a plugin marketplace item—you can set it up in 30 seconds.

# Add Ponytail repo to your plugin marketplace list
/plugin marketplace add DietrichGebert/ponytail

# Install the main skill (once per session)
/plugin install ponytail@ponytail

After installation, open a new session. The startup text will display the current mode (default is full). You’ll see output like this:

Ponytail v0.1.0  [full]  Lazy senior dev mode active
1. Need to build?  2. Stdlib?  3. Platform?  4. Installed dep?
5. One line?       6. Minimum that works.

WARNING

nvm / Nix users note: Claude Code’s lifecycle hooks run in a non-interactive shell. Node must be on the PATH of that shell. If you use nvm, make sure you’ve sourced nvm in ~/.zshrc or ~/.bashrc. Just being able to run node -v in your current terminal is not enough.

If you want to change the strength, input:

/ponytail lite     # light mode (protects steps 1–2)
/ponytail full     # default
/ponytail ultra    # aggressive mode (no abstraction layers at all)
/ponytail off      # off
/ponytail          # no parameters = show current mode

You can also persist it:

# permanently default to ultra
export PONYTAIL_DEFAULT_MODE=ultra

Or write a config file:

// ~/.config/ponytail/config.json
// Windows: %APPDATA%\ponytail\config.json
{ "defaultMode": "ultra" }

Step 2: Run a counterexample comparison

Let’s look at the result directly. Prepare two identical prompts:

Prompt A (Ponytail off): first run /ponytail off, then ask

Add a color picker to the settings page

You’ll get an answer of about 287 lines: install a react-color library (or create a custom 5-file component), add prop validation, add onChange throttling, add an accessible label, and wire up CSS variables.

Prompt B (Ponytail on): run /ponytail full first, and ask the exact same thing again.

You’ll get:

// ponytail: browser has one
<input type="color" />

One line. Done.

Ponytail’s AGENTS.md contains this core rule snippet (verbatim from the original):

Before writing any code, stop at the first rung that holds:
1. Does this need to be built at all? (YAGNI)
2. Does the standard library already do this? Use it.
3. Does a native platform feature cover it? Use it.
4. Does an already-installed dependency solve it? Use it.
5. Can this be one line? Make it one line.
6. Only then: write the minimum code that works.

Notice the last line: “Only then”—it doesn’t mean “don’t write,” it means “go through the first 5 steps first.”

Step 3: Use Defapi to get the bill down by half

Ponytail reduces code volume, but to run the AI Agent itself, you still need to burn tokens for Claude / GPT. Defapi provides official half-price Claude / GPT / Gemini with a fully compatible API.

We switch to Defapi and run a benchmark:

Step 3.1: Get a Defapi Key

Go to defapi.org, register, grab a key starting with dk-xxx, and write it into Ponytail’s .env:

# in the root directory of the ponytail repo
cat > .env <<'EOF'
ANTHROPIC_API_KEY=dk-Your-Defapi-Key
ANTHROPIC_BASE_URL=https://api.defapi.org
EOF

Step 3.2: Verify with curl first

curl -s https://api.defapi.org/api/v1/messages \
  -H "Authorization: Bearer dk-Your-Defapi-Key" \
  -H "content-type: application/json" \
  -d '{
    "model": "anthropic/claude-haiku-4.5",
    "max_tokens": 256,
    "messages": [
      {"role": "user", "content": "Describe the word ponytail using an emoji"}
    ]
  }'

Response:

{
  "id": "msg_01H...",
  "role": "assistant",
  "content": [{"type": "text", "text": "🦄 (it should never happen)"}],
  "usage": {"input_tokens": 22, "output_tokens": 12}
}

Step 3.3: Point promptfoo to Defapi

Ponytail ships with promptfoo benchmarks. Edit benchmarks/promptfooconfig.yaml:

providers:
  - id: anthropic:messages:anthropic/claude-haiku-4.5
    config:
      baseURL: https://api.defapi.org
      apiKey: ${ANTHROPIC_API_KEY}

Run:

npx promptfoo@latest eval -c benchmarks/promptfooconfig.yaml

TIP

Defapi key facts:

Compatible with v1/chat/completions (OpenAI protocol)
Compatible with v1/messages (Anthropic protocol)
Compatible with v1beta/models/ (Gemini protocol)
The same dk- key works across Claude / GPT / Gemini
Price examples: Claude Sonnet 4.5 official $3 / $15, Defapi $1.5 / $7.5; Claude Haiku 4.5 official $1 / $5, Defapi $0.5 / $2.5

Money saved (estimated from Ponytail benchmark: 12 tickets × n=4 = 48 runs):

Model	Official cost / month	Defapi cost / month	Saved
Claude Sonnet 4.5	~$60	~$30	$30
Claude Haiku 4.5	~$20	~$10	$10
Claude Opus 4.5	~$250	~$125	$125

Ponytail cuts costs by 20%, Defapi cuts by another 50%. Together, that means your real bill gets to one-quarter (4x discount).

Step 4: Run /ponytail-review to cut the current diff

Just guarding the code at write-time with Ponytail isn’t enough—you already have a backlog of “overengineering” code. That’s where the review command comes in.

# in a repo with changes
/ponytail-review

It reviews only the current git diff, and outputs in a fixed format:

L12: delete unused `cache` parameter; no caller passes it
L34: stdlib Array.prototype.sort is stable since ES2019; drop `lodash.orderby`
L88: native `URLSearchParams` covers this; remove custom `parseQuery`
L102: yagni `BaseRepository` has one implementation; inline it
L150: shrink loop into `arr.filter(x => x.active).map(x => x.id)`
---
Net removable: 47 lines, 1 dependency

Tag types:

delete — dead code / speculative features
stdlib — reimplementing the standard library
native — work already covered by a dependency / can be done by platform-native features
yagni — an abstraction layer with only one implementation
shrink — same logic, fewer lines

The last line tells you the “net removable lines”—that’s your tech debt metric.

If the output is Lean already. Ship., it means your code is already lean enough, so you can merge with confidence.

Step 5: Run /ponytail-audit to scan the entire repo

Review focuses on the diff; audit focuses on the whole tree.

/ponytail-audit

The output looks similar, but it’s sorted by “what can be cut the most”:

delete src/utils/cache.ts (412 lines) — only used in 1 place; inline
stdlib src/utils/deep-clone.ts — use structuredClone
native src/components/DatePicker/ (287 lines) — <input type="date">
yagni src/repositories/BaseRepository.ts (180 lines) — 1 impl, inline
shrink src/api/users.ts:42-78 — same logic, 60 → 18 lines
---
Net removable: 1,247 lines, 4 dependencies

Practical advice: run review before audit. Review refactors the diff and you merge it; audit then helps you prioritize the next cleanup wave.

Step 6: Enable it in other Agents

The core advantage of Ponytail is: one set of rules, works everywhere. It provides adaptation files for every major AI programming tool:

Codex (CLI mode)

codex plugin marketplace add DietrichGebert/ponytail
codex
# Open /plugins → select Ponytail → install
# Open /hooks → trust two lifecycle hooks → open a new thread

Cursor

Copy .cursor/rules/ponytail.mdc into your project:

cp .cursor/rules/ponytail.mdc ~/your-project/.cursor/rules/

Or install globally:

cp .cursor/rules/ponytail.mdc ~/.cursor/rules/

Windsurf

cp .windsurf/rules/ponytail.md ~/.codeium/windsurf/memories/

GitHub Copilot CLI

copilot plugin marketplace add DietrichGebert/ponytail
copilot plugin install ponytail@ponytail

OpenClaw (if you’re already using it)

# The most elegant one-liner
clawhub install ponytail

Or copy manually:

cp -r .openclaw/skills/ponytail ~/.openclaw/skills/

Gemini CLI

gemini extensions install https://github.com/DietrichGebert/ponytail

Pi / Aider / Kiro / Zed / CodeWhale

These agents directly read AGENTS.md:

# project-level
cp AGENTS.md ~/your-project/AGENTS.md

# global (pi / Aider / CodeWhale can all recognize it)
cp AGENTS.md ~/.pi/AGENTS.md

Complete mapping table (from docs/agent-portability.md):

Agent	Load method	Support /ponytail commands
Claude Code	marketplace	✅
Codex	marketplace + hooks	✅
OpenCode	plugin + opencode.json	✅
OpenClaw	clawhub	✅
Gemini CLI	extension	✅
Pi	`pi install`	✅
Copilot CLI	plugin	✅ (with `ponytail:` namespace)
Cursor	.cursor/rules/	❌ (read-only rules)
Windsurf	.windsurf/rules/	❌
Cline	.clinerules/	❌
Kiro	.kiro/steering/	❌
Aider	AGENTS.md	❌
Zed	AGENTS.md	❌
CodeWhale	AGENTS.md	❌
GitHub Copilot (editor)	.github/copilot-instructions.md	❌

Step 7: Run your own benchmark

Ponytail’s real data isn’t dreamed up—it’s generated by running benchmarks/benchmark-local.py. You can also pick 5 of your own real tasks and reproduce the experiment.

Step 7.1: Prepare prompts

Edit benchmarks/prompts.json (or use the provided 5 prompts):

[
  { "id": "date-picker", "task": "Add a date picker to the settings page" },
  { "id": "color-picker", "task": "Add a color picker to the settings page" },
  { "id": "email-validate", "task": "Write a Python function that validates email addresses" },
  { "id": "deep-clone", "task": "Deep clone this object: {sample}" },
  { "id": "debounce", "task": "Write a debounce function in JavaScript" }
]

Step 7.2: Run three arms for comparison

# baseline: nothing added
npx promptfoo@latest eval -c benchmarks/promptfooconfig.yaml

# ponytail: add the skill plugin arm
PONYTAIL_DEFAULT_MODE=full npx promptfoo@latest eval -c benchmarks/promptfooconfig.yaml

Step 7.3: Check the results

Output benchmarks/output.json, which includes for each prompt under each arm:

loc — code line count
tokens — total tokens
cost — USD
time — end-to-end time
passed_safety — whether it passes the safety tests (input validation, error handling, a11y)

In most cases, the ponytail arm reduces LOC by 50–80% compared to baseline.

Step 8: Run ponytail-mcp (advanced)

If your AI host can only use MCP (for example, some desktop apps), Ponytail also has an MCP server adapter:

cd ponytail-mcp
npm install
node index.js    # start the stdio MCP server

Add this to your host’s MCP configuration:

{
  "mcpServers": {
    "ponytail": {
      "command": "node",
      "args": ["ponytail-mcp/index.js"]
    }
  }
}

It exposes:

Prompt ponytail: returns rule text, optionally with a mode parameter (lite / full / ultra)
Tool ponytail_instructions: same as above, but includes structuredContent for code-execution style hosts

WARNING

MCP mode is “user manually calls”—if you click it once in the prompt menu, it takes effect once. It is not “automatically injected every round.” If you need always-on behavior, use the plugin mode in Claude Code / Codex, not MCP.

Troubleshooting FAQs

Q1: No response after install—startup text isn’t showing?

99% of the time: Node isn’t on the PATH in a non-interactive shell. Verify with:

# Run this in a new shell (simulate non-interactive)
bash -lc 'node -v'      # bash
zsh -lc 'node -v'       # zsh

If you get command not found, source nvm into ~/.bashrc / ~/.zshrc, or install system Node directly:

# macOS
brew install node@20

# Windows
winget install OpenJS.NodeJS.LTS

Q2: “I really need that 120-line cache class”—what if it’s a hard requirement?

Two options:

# Temporarily turn it off
/ponytail off

# Or allow it locally: say it explicitly in the prompt
"Build a 120-line cache class, ignore ponytail for this task"

Ponytail is a rules set, not handcuffs. With an explicit override, the model will comply.

Q3: Does Ponytail conflict with ESLint / Prettier?

No—no conflict, different responsibilities:

Ponytail: controls “should we write this at all”—whether an abstraction is needed, whether a dependency is installed, whether a wrapper exists
ESLint: controls “is the code correct”—naming, style, unused variables
Prettier: controls “does it look nicely formatted”—indentation, semicolons, line breaks

Best results come from enabling all three. Ponytail decides how long the code should be at the very top; ESLint / Prettier handle the details downstream.

Q4: How do teams standardize the rules?

AGENTS.md is a repository-level file—just commit it to git:

# in your team repository root
curl -o AGENTS.md https://raw.githubusercontent.com/DietrichGebert/ponytail/main/AGENTS.md
git add AGENTS.md
git commit -m "chore: adopt ponytail team-wide coding rules"

All agents that read AGENTS.md (CodeWhale, Aider, Zed, Pi, Kiro, Codex extension) will automatically follow.

Q5: Does Ponytail degrade safety tests?

No—this is Ponytail’s most critical benchmark metric. In the comparison table, baseline / caveman / Ponytail are all 100% safe pass; only the bare “YAGNI + one-line” prompt drops to 95%.

Ponytail’s AGENTS.md includes a dedicated section:

Not lazy about: input validation at trust boundaries, error handling that prevents data loss, security, accessibility, the calibration real hardware needs.

Translation: Whether it’s “lazy” depends on where. Save effort if it’s business logic or UI plumbing; do not skip input validation, error fallbacks, security, or a11y.

Q6: How do I run it in CI?

Extract the /ponytail-review logic into a standalone script (Ponytail repo’s benchmarks/correctness.test.js provides a reference implementation), then:

# .github/workflows/ponytail.yml
name: ponytail
on: [pull_request]
jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: node scripts/ponytail-review.js origin/main
        env:
          ANTHROPIC_API_KEY: ${{ secrets.DEFAPI_KEY }}
          ANTHROPIC_BASE_URL: https://api.defapi.org

Failing in PR means the current diff still looks like something Ponytail thinks can be “trimmed further.”

Q7: MCP mode vs always-on mode—what should I choose?

Look at your host:

Host type	Recommended mode
Claude Code / Codex	always-on (plugin + hook)
OpenCode	always-on (plugin)
Cursor / Windsurf / Cline	always-on (rules files)
Gemini CLI	always-on (extension)
Pi / Aider / Zed	always-on (AGENTS.md)
Desktop apps that only offer MCP prompt menus	MCP (manual trigger)
Fully custom agent framework	MCP + tool mode