Natural selection for prompts, code, and text β powered by LLMs.
Feed it a seed file and fitness criteria. It breeds better versions through intelligent mutation, scores them, and keeps the winners. Repeat until it plateaus or hits your target score.
Works on anything text-based β prompts, code, configs, copy, schemas β if an LLM can judge it, AutoPrompt can evolve it.
GEN 0 (seed): 3.2/10 β generic and vague
GEN 1/10 βΒ·Β· 5.8/10 (+2.6) [42s] β added structure and constraints
GEN 2/10 Β·βΒ· 7.1/10 (+1.3) [38s] β defined tone and examples
GEN 3/10 βΒ·Β· 8.4/10 (+1.3) [45s] β added edge case handling
GEN 4/10 Β·Β·Β· 8.4/10 (=) [41s]
GEN 5/10 Β·βΒ· 9.2/10 (+0.8) [39s] β refined voice constraints
STOP: target score 9.0 reached (9.2)
You need one of these CLI tools installed:
- Claude Code β
claudeCLI - Codex β
codexCLI - Ollama β run local models (Qwen, Llama, Mistral, etc.)
No API keys needed. No pip install. Just Python 3.10+ and an LLM.
git clone https://github.com/usmanmughalji/AutoPrompt.git
cd AutoPrompt
# evolve a prompt
python3 autoprompt.py examples/prompt-optimizer/seed.txt \
examples/prompt-optimizer/criteria.md \
--target 9.0
# evolve code (with benchmark)
python3 autoprompt.py examples/code-optimizer/seed.py \
examples/code-optimizer/criteria.md \
-b "python3 examples/code-optimizer/bench.py {file}"That's it. Output lands in seed_evolved.txt (or seed_evolved.py).
# use qwen3.5 (default: 9b)
python3 autoprompt.py examples/prompt-optimizer/seed.txt \
examples/prompt-optimizer/criteria.md \
-e ollama --target 9.0
# pick a specific model
python3 autoprompt.py examples/prompt-optimizer/seed.txt \
examples/prompt-optimizer/criteria.md \
-e ollama -m qwen3.5:27b
# works with any ollama model
python3 autoprompt.py seed.txt criteria.md -e ollama -m llama3.2:3b
python3 autoprompt.py seed.txt criteria.md -e ollama -m qwen2.5-coder:14bFully offline. No API keys. No tokens. Just your GPU.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β seed file βββΊ mutate (LLM) βββΊ N variants β
β β β
β benchmark (optional) β
β β β
β judge (LLM) βββΊ scoresβ
β β β
β keep best βββΊ repeat β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Seed β your starting file (prompt, code, whatever)
- Criteria β a markdown file describing what "better" means
- Mutate β the LLM generates N variations, each trying a different strategy
- Benchmark (optional) β run a script to test the mutation (for code)
- Judge β the LLM scores each mutation against your criteria (0-10)
- Select β keep the highest scorer, feed it back into step 3
- Stop β when target score is hit, patience runs out, or generations are done
The LLM learns from history β it sees what worked and what flopped in previous generations, so mutations get smarter over time.
Optimize system prompts, few-shot examples, chain-of-thought templates.
python3 autoprompt.py my-prompt.txt criteria.md --target 9.0 --patience 3Evolve algorithms, functions, or scripts with optional benchmarks.
python3 autoprompt.py solver.py criteria.md -b "python3 bench.py {file}"Marketing copy, email templates, documentation β anything with quality criteria.
python3 autoprompt.py landing-page.md criteria.md -g 5YAML configs, SQL queries, regex patterns β if it's text and has a "better", evolve it.
python3 autoprompt.py config.yaml criteria.md -e codex| Flag | Description | Default |
|---|---|---|
-g, --generations |
Max generations to run | 10 |
-n, --population |
Mutations per generation | 3 |
-b, --bench |
Benchmark command ({file} = candidate path) |
None |
-e, --engine |
LLM backend: claude, codex, or ollama |
claude |
-m, --model |
Ollama model name (ignored for claude/codex) | qwen3.5:9b |
--target |
Stop when score reaches this value | None |
--patience |
Stop after N gens with no improvement | None |
--timeout |
Stop after N seconds total | None |
--reasoning |
Codex reasoning effort: low, medium, high |
medium |
AutoPrompt stops early when it makes sense:
# stop when good enough
python3 autoprompt.py seed.txt criteria.md --target 8.5
# stop when stuck
python3 autoprompt.py seed.txt criteria.md --patience 3
# stop after 5 minutes
python3 autoprompt.py seed.txt criteria.md --timeout 300
# combine them
python3 autoprompt.py seed.txt criteria.md --target 9.0 --patience 3 --timeout 600The criteria file is a markdown file that tells the LLM what "better" means. This is the most important part β good criteria = good evolution.
# Fitness Criteria: [What You're Evolving]
## Goal
One sentence describing the ideal output.
## Constraints
- Hard rules that must be followed
- Things that are NOT allowed
- Format requirements
## What "better" means (in priority order)
1. **Most important thing** β why it matters
2. **Second priority** β why it matters
3. **Third priority** β why it matters
## Scoring Guide
- 0-2: terrible (describe what this looks like)
- 3-4: below average
- 5-6: decent
- 7-8: good (describe what this looks like)
- 9-10: exceptional (describe what this looks like)The scoring guide is key β it anchors the LLM's judgment so scores are consistent across generations.
Evolves a generic blog post prompt into a production-quality system prompt. No benchmark needed β the LLM judges prompt quality directly.
python3 autoprompt.py examples/prompt-optimizer/seed.txt \
examples/prompt-optimizer/criteria.md \
--target 9.0 --patience 3Evolves a bubble sort into a fast hybrid sorting algorithm. Uses bench.py to verify correctness and measure speed.
python3 autoprompt.py examples/code-optimizer/seed.py \
examples/code-optimizer/criteria.md \
-b "python3 examples/code-optimizer/bench.py {file}" \
--target 8.0- Start with a bad seed β the worse the starting point, the more dramatic the improvement. Makes for better demos too.
- Be specific in criteria β "write well" is useless. "Use active voice, keep sentences under 20 words, include one concrete example per paragraph" is useful.
- Use benchmarks for code β LLM-as-judge works for subjective quality, but for code you want deterministic correctness checks.
- Set patience β
--patience 3prevents wasting tokens when the LLM has plateaued. - More population = more exploration β
-n 5tries more strategies per generation but costs more tokens. - Check the history β the LLM learns from previous generations. If it keeps trying the same thing, your criteria might be ambiguous.
AutoPrompt/
βββ autoprompt.py # the entire engine (~300 lines)
βββ examples/
β βββ prompt-optimizer/ # evolve prompts
β β βββ seed.txt # starting prompt
β β βββ criteria.md # what makes a good prompt
β βββ code-optimizer/ # evolve code
β βββ seed.py # starting code (bubble sort)
β βββ criteria.md # what makes good sorting code
β βββ bench.py # correctness + speed benchmark
βββ LICENSE
βββ README.md
One file. Zero dependencies. Stdlib only.
Found a bug? Have a cool criteria file? PRs welcome.
MIT β do whatever you want with it.