A tiny model that teaches itself to get better β at anything with a test. On your laptop. 6GB RAM. No cloud. No teacher model. No human feedback.
We discovered that a 0.8B parameter model can meaningfully improve itself by learning from its own failures β using only 6GB of RAM on a MacBook Air. We proved it on code. The technique works anywhere you can automatically verify the output.
pip install -e .
tinyforge --model models/mlx-q4-qwen35-08b --quickThe model tries to solve coding problems. It fails. It sees exactly what failed (which test, what input, what it expected vs what it got). It tries again. When it finds a better solution, we extract the weak-to-strong pair and train the model on it.
This is self-play for code. The "game" is passing tests.
All results are on fresh holdout slices the model never saw during training.
Fresh HumanEval slice 40-47:
| Setup | Public Tests | Hidden HumanEval+ |
|---|---|---|
| Base model, single-pass | 16/50 | 2/8 |
| Repair-trained, single-pass | 28/50 | 4/8 |
| Base model + evolutionary search | 42/50 | 5/8 |
| Repair-trained + evolutionary search | 44/50 | 6/8 |
π Single-pass improved 75% (16 β 28) just from training on 13 self-generated repair pairs.
Fresh HumanEval slice 56-63 (overnight run):
| Setup | Public Tests | Hidden HumanEval+ |
|---|---|---|
| Base adapter + feedback loop | 42/58 | 4/8 |
| Repair-trained + feedback loop | 47/58 | 5/8 |
π‘ The trained model doesn't just write better code β it becomes a better participant in the repair loop. It learns how to fix code when told what's wrong.
| Setup | Public Tests | Hidden HumanEval+ |
|---|---|---|
| Base model, single-pass | 28/70 | 0/8 |
| Repair-trained + search + feedback | 53/70 | 3/8 |
From 0/8 to 3/8 on the hardest unseen problems. Not magic. But real.
| # | Discovery | Details |
|---|---|---|
| π₯οΈ | Fine-tune on a laptop | QLoRA + LoRA rank 4 on 4-bit Qwen3.5-0.8B. Peak: 6.5-13GB. Runs on a MacBook Air M4. |
| π§ | Repair > final-answer training | "Here's broken code, fix it" works better than "here's the problem, here's the answer." Small models learn fixing patterns, not memorized solutions. |
| 𧬠| Evolutionary search = biggest jump | Generate multiple attempts, test them, keep the best, mutate using failure feedback. This is where most gains come from. |
| π― | Specific failure feedback matters | input=[15] expected='FizzBuzz' got='Fizz' helps more than 3/6 tests passed. Evidence > scores. |
| π€ | Model learns to be a repair partner | Biggest overnight finding: the trained model writes much better code inside the feedback loop. It learned how to respond to test failures. |
| More compute β better results | Wider search, colder sampling, extra generalization β we tried them. They didn't reliably help. |
- π macOS with Apple Silicon (M1/M2/M3/M4)
- πΎ 8GB+ RAM (16GB+ recommended)
- π Python 3.10+
git clone https://github.com/ranausmanai/tinyforge.git
cd tinyforge
pip install -e .
pip install mlx-lm pyyaml# Quantized Qwen3.5-0.8B (smallest, ~400MB)
python -c "from mlx_lm import load; load('mlx-community/Qwen2.5-Coder-0.5B-Instruct-4bit')"Or convert the model we trained on:
python -m mlx_lm.convert \
--hf-path Qwen/Qwen3.5-0.8B \
--mlx-path models/mlx-q4-qwen35-08b \
-q --q-bits 4# β‘ Quick demo (~5-10 minutes, 5 tasks)
tinyforge --model models/mlx-q4-qwen35-08b --quick
# π₯ Full run (~45-60 minutes, 20 tasks)
tinyforge --model models/mlx-q4-qwen35-08bπ₯ tinyforge
A model teaching itself to code better.
βββββββββββββββββββββββββββββββββββββββ
π BEFORE (single-pass baseline)
[1/5] fizzbuzz ............ 4/6
[2/5] valid_parens ........ 2/5
...
𧬠EVOLVING (search + feedback)
[1/5] fizzbuzz ............ 6/6 β (gen=2)
...
π TRAINING on 8 self-generated repair pairs
Mixing with 40 rehearsal samples...
Training for 20 iterations...
π AFTER (trained model + search)
[1/5] fizzbuzz ............ 6/6 β
...
Before: 12/26 ββββββββββ 46%
Search: 22/26 ββββββββββ 85%
After: 24/26 ββββββββββ 92%
tinyforge --model models/mlx-q4-qwen35-08b --adapter outputs/biolora6g/growFor each coding task:
- π± Seed β Generate
population_sizecandidate solutions - π§ͺ Evaluate β Run each candidate against test cases in a sandboxed subprocess
- π Select β Keep the top
elitesunique solutions - π Mutate β Generate new candidates using failure feedback from the best parents
- π Repeat for
generationsrounds - β Return the best solution found
After evolution, we pair weak solutions with strong ones:
Weak (3/6 tests): Strong (6/6 tests):
def fizzbuzz(n): def fizzbuzz(n):
if n % 3 == 0: return "Fizz" if n % 15 == 0: return "FizzBuzz"
if n % 5 == 0: return "Buzz" if n % 3 == 0: return "Fizz"
return str(n) if n % 5 == 0: return "Buzz"
return str(n)
The training prompt includes the weak code + test failures. The training target is the strong code. The model learns: "when you see this kind of failure, fix it like this."
| Parameter | Value |
|---|---|
| Method | MLX LoRA (4-bit quantized base + low-rank adapters) |
| LoRA rank | 4 (extremely small β adds minimal parameters) |
| Optimizer | Adafactor (memory-efficient) |
| Data | Self-generated repair pairs + rehearsal data (prevents forgetting) |
| Iterations | 40 (~2-3 minutes) |
| Peak memory | 6.5-13GB depending on sequence length |
Create a JSONL file with your own tasks:
{"id": "add", "type": "code", "prompt": "Write a function add(a, b) that returns the sum", "entry_point": "add", "tests": [{"input": [1, 2], "expected": 3}, {"input": [-1, 1], "expected": 0}]}tinyforge --model my-model --tasks my_tasks.jsonlSee examples/ for more task formats.
src/finetuneqwen/
tinyforge/ # π₯ The self-improvement engine
cli.py # Main CLI & demo experience
evolve.py # Evolutionary search
verify.py # Test verification (sandboxed)
extract.py # Repair pair extraction
builtin_tasks.py # 20 built-in coding problems
task.py # Task definition
# π¬ Original research pipeline
humaneval_evolutionary_eval.py # HumanEval-specific evolutionary eval
export_humaneval_repair_pairs.py # HumanEval repair data extraction
train_qlora.py # PyTorch QLoRA training
prepare_data.py # Dataset preparation
configs/ # MLX training configs
scripts/ # Shell scripts for experiments
data/ # Training data (repair pairs, rehearsal)
examples/ # Example task files
results/ # Evaluation summaries
# Evolutionary search with base adapter
bash scripts/50_humaneval_evo.sh \
models/mlx-q4-qwen35-08b \
outputs/biolora6g/grow \
results/humaneval_repair_v1 \
8 40 4 3 2 0.6 1337 base_evo_40_47 1
# With repair adapter
bash scripts/50_humaneval_evo.sh \
models/mlx-q4-qwen35-08b \
outputs/humaneval_repair_v1/adapter \
results/humaneval_repair_v1 \
8 40 4 3 2 0.6 1337 repair_evo_40_47 1bash scripts/60_humaneval_build_repair_v1.shpython -m mlx_lm.lora \
--model models/mlx-q4-qwen35-08b \
--data data/humaneval_repair_v1/mlx \
--config configs/biolora6g/humaneval_repair_v1_qwen35_08b.yaml \
--adapter-path outputs/humaneval_repair_v1/adapter \
--resume-adapter-file outputs/biolora6g/grow/adapters.safetensors- π« This is not a magic one-shot model. The gains come from the system (model + search + feedback), not from the model alone.
- π Tested on HumanEval slices, not full benchmarks. Results are real but scoped.
- π Transfer is stronger on harder problems and weaker on easy ones (where the base model is already decent).
- π A 0.8B model will never match GPT-4. That's not the point.
If a 0.8B model can teach itself to code better on a laptop with 6GB RAM β what happens when you run this technique on a 7B model? A 70B model? With a real GPU cluster?
The technique is the contribution, not the model size.
Self-play works for Go (AlphaGo). Self-play works for math (AlphaProof). We're showing it works for code, on hardware anyone can afford.
The core loop is simple: generate something β check if it's good β learn from what went wrong β try again. We proved it on code. But it works anywhere you can measure whether the output is right or wrong.
| Domain | What's the "test"? |
|---|---|
| ποΈ SQL queries | Run the query, check if the results match |
| π’ Math proofs | Verify the proof mechanically (Lean, Coq, SymPy) |
| β‘ Chip design (Verilog/VHDL) | Simulate the circuit against test benches |
| π§ͺ Drug molecules | Check properties via physics simulation |
| π Security testing | Did the generated input crash the software? |
| π€ Robotics | Run the policy in simulation, measure the score |
| π Data pipelines / ETL | Input data in, expected data out β compare |
| Domain | What's the "test"? |
|---|---|
| π£ Ad copy | A/B test it β did people click? Click rate is automatic feedback |
| π§ Email outreach | Did they open? Did they reply? Open/reply rate = the verifier |
| π SEO content | Did it rank? Did it get traffic? Google tells you |
| π Product descriptions | Conversion rate β did people buy? |
| π± Social media posts | Engagement. Likes, shares, saves β all measurable |
| π Teaching / tutoring | Did the student get the next question right? |
| βοΈ Headlines & hooks | Scroll-stop rate. You already track this |
The idea: if you can score the output β even roughly β the model can learn from its own mistakes. No labeled dataset needed. No expert in the loop. The model makes its own training data by failing and fixing.
- A task β what you want the model to produce
- A way to check β any script/metric that says "good" or "bad" (and ideally, why)
- A small model β even 0.8B parameters is enough to start
That's it. The model teaches itself. On your laptop. Your data never leaves your machine.
| Phase | Peak Memory | Time (MacBook Air M4) |
|---|---|---|
| 𧬠Evolutionary search (8 tasks) | ~4GB | ~10 min |
| π§ Repair pair extraction | ~100MB | seconds |
| π LoRA training (40 iters) | 6.5-13GB | ~3 min |
| π₯ Full tinyforge demo (5 tasks) | ~10GB | ~8 min |
MIT β do whatever you want with it.
If you use this technique in your work:
@misc{tinyforge2025,
title = {tinyforge: Self-improving tiny language models through test-driven repair training},
author = {Usman Muhammad},
year = {2025},
url = {https://github.com/ranausmanai/tinyforge}
}Built with π₯ on a MacBook Air M4 with 24GB RAM
No GPUs were harmed in the making of this project.