Skip to content

ranausmanai/tinyforge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”₯ tinyforge

A tiny model that teaches itself to get better β€” at anything with a test. On your laptop. 6GB RAM. No cloud. No teacher model. No human feedback.

Python 3.10+ Apple Silicon MLX License: MIT RAM

We discovered that a 0.8B parameter model can meaningfully improve itself by learning from its own failures β€” using only 6GB of RAM on a MacBook Air. We proved it on code. The technique works anywhere you can automatically verify the output.


pip install -e .
tinyforge --model models/mlx-q4-qwen35-08b --quick

🧠 What this actually does

The model tries to solve coding problems. It fails. It sees exactly what failed (which test, what input, what it expected vs what it got). It tries again. When it finds a better solution, we extract the weak-to-strong pair and train the model on it.

This is self-play for code. The "game" is passing tests.


tinyforge architecture

πŸ“Š Key Results

All results are on fresh holdout slices the model never saw during training.

πŸ”§ The repair training effect

Fresh HumanEval slice 40-47:

Setup Public Tests Hidden HumanEval+
Base model, single-pass 16/50 2/8
Repair-trained, single-pass 28/50 4/8
Base model + evolutionary search 42/50 5/8
Repair-trained + evolutionary search 44/50 6/8

πŸ“ˆ Single-pass improved 75% (16 β†’ 28) just from training on 13 self-generated repair pairs.

πŸ”„ The feedback loop effect

Fresh HumanEval slice 56-63 (overnight run):

Setup Public Tests Hidden HumanEval+
Base adapter + feedback loop 42/58 4/8
Repair-trained + feedback loop 47/58 5/8

πŸ’‘ The trained model doesn't just write better code β€” it becomes a better participant in the repair loop. It learns how to fix code when told what's wrong.

πŸ”οΈ Hard problems (slice 72-79)

Setup Public Tests Hidden HumanEval+
Base model, single-pass 28/70 0/8
Repair-trained + search + feedback 53/70 3/8

From 0/8 to 3/8 on the hardest unseen problems. Not magic. But real.


πŸ”¬ What We Discovered

# Discovery Details
πŸ–₯️ Fine-tune on a laptop QLoRA + LoRA rank 4 on 4-bit Qwen3.5-0.8B. Peak: 6.5-13GB. Runs on a MacBook Air M4.
πŸ”§ Repair > final-answer training "Here's broken code, fix it" works better than "here's the problem, here's the answer." Small models learn fixing patterns, not memorized solutions.
🧬 Evolutionary search = biggest jump Generate multiple attempts, test them, keep the best, mutate using failure feedback. This is where most gains come from.
🎯 Specific failure feedback matters input=[15] expected='FizzBuzz' got='Fizz' helps more than 3/6 tests passed. Evidence > scores.
🀝 Model learns to be a repair partner Biggest overnight finding: the trained model writes much better code inside the feedback loop. It learned how to respond to test failures.
⚠️ More compute β‰  better results Wider search, colder sampling, extra generalization β€” we tried them. They didn't reliably help.

⚑ Quick Start

Requirements

  • 🍎 macOS with Apple Silicon (M1/M2/M3/M4)
  • πŸ’Ύ 8GB+ RAM (16GB+ recommended)
  • 🐍 Python 3.10+

Install

git clone https://github.com/ranausmanai/tinyforge.git
cd tinyforge
pip install -e .
pip install mlx-lm pyyaml

Download a base model

# Quantized Qwen3.5-0.8B (smallest, ~400MB)
python -c "from mlx_lm import load; load('mlx-community/Qwen2.5-Coder-0.5B-Instruct-4bit')"

Or convert the model we trained on:

python -m mlx_lm.convert \
  --hf-path Qwen/Qwen3.5-0.8B \
  --mlx-path models/mlx-q4-qwen35-08b \
  -q --q-bits 4

Run

# ⚑ Quick demo (~5-10 minutes, 5 tasks)
tinyforge --model models/mlx-q4-qwen35-08b --quick

# πŸ”₯ Full run (~45-60 minutes, 20 tasks)
tinyforge --model models/mlx-q4-qwen35-08b

What you'll see

πŸ”₯ tinyforge
   A model teaching itself to code better.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

πŸ“ BEFORE (single-pass baseline)
   [1/5] fizzbuzz ............ 4/6
   [2/5] valid_parens ........ 2/5
   ...

🧬 EVOLVING (search + feedback)
   [1/5] fizzbuzz ............ 6/6  βœ“  (gen=2)
   ...

πŸŽ“ TRAINING on 8 self-generated repair pairs
   Mixing with 40 rehearsal samples...
   Training for 20 iterations...

🏁 AFTER (trained model + search)
   [1/5] fizzbuzz ............ 6/6  βœ“
   ...

   Before:  12/26  β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘  46%
   Search:  22/26  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘  85%
   After:   24/26  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘  92%

Use with your own adapter

tinyforge --model models/mlx-q4-qwen35-08b --adapter outputs/biolora6g/grow

βš™οΈ How It Works

🧬 The evolutionary loop

For each coding task:

  1. 🌱 Seed β€” Generate population_size candidate solutions
  2. πŸ§ͺ Evaluate β€” Run each candidate against test cases in a sandboxed subprocess
  3. πŸ† Select β€” Keep the top elites unique solutions
  4. πŸ”€ Mutate β€” Generate new candidates using failure feedback from the best parents
  5. πŸ” Repeat for generations rounds
  6. βœ… Return the best solution found

πŸ”§ Repair pair extraction

After evolution, we pair weak solutions with strong ones:

Weak (3/6 tests):                    Strong (6/6 tests):
def fizzbuzz(n):                     def fizzbuzz(n):
    if n % 3 == 0: return "Fizz"         if n % 15 == 0: return "FizzBuzz"
    if n % 5 == 0: return "Buzz"         if n % 3 == 0: return "Fizz"
    return str(n)                        if n % 5 == 0: return "Buzz"
                                         return str(n)

The training prompt includes the weak code + test failures. The training target is the strong code. The model learns: "when you see this kind of failure, fix it like this."

πŸŽ“ Training

Parameter Value
Method MLX LoRA (4-bit quantized base + low-rank adapters)
LoRA rank 4 (extremely small β€” adds minimal parameters)
Optimizer Adafactor (memory-efficient)
Data Self-generated repair pairs + rehearsal data (prevents forgetting)
Iterations 40 (~2-3 minutes)
Peak memory 6.5-13GB depending on sequence length

πŸ“ Custom Tasks

Create a JSONL file with your own tasks:

{"id": "add", "type": "code", "prompt": "Write a function add(a, b) that returns the sum", "entry_point": "add", "tests": [{"input": [1, 2], "expected": 3}, {"input": [-1, 1], "expected": 0}]}
tinyforge --model my-model --tasks my_tasks.jsonl

See examples/ for more task formats.


πŸ“ Project Structure

src/finetuneqwen/
  tinyforge/                          # πŸ”₯ The self-improvement engine
    cli.py                            # Main CLI & demo experience
    evolve.py                         # Evolutionary search
    verify.py                         # Test verification (sandboxed)
    extract.py                        # Repair pair extraction
    builtin_tasks.py                  # 20 built-in coding problems
    task.py                           # Task definition

  # πŸ”¬ Original research pipeline
  humaneval_evolutionary_eval.py      # HumanEval-specific evolutionary eval
  export_humaneval_repair_pairs.py    # HumanEval repair data extraction
  train_qlora.py                      # PyTorch QLoRA training
  prepare_data.py                     # Dataset preparation

configs/                              # MLX training configs
scripts/                              # Shell scripts for experiments
data/                                 # Training data (repair pairs, rehearsal)
examples/                             # Example task files
results/                              # Evaluation summaries

πŸ” Reproduce Our Results

Best result (repair-v1, slice 40-47)

# Evolutionary search with base adapter
bash scripts/50_humaneval_evo.sh \
  models/mlx-q4-qwen35-08b \
  outputs/biolora6g/grow \
  results/humaneval_repair_v1 \
  8 40 4 3 2 0.6 1337 base_evo_40_47 1

# With repair adapter
bash scripts/50_humaneval_evo.sh \
  models/mlx-q4-qwen35-08b \
  outputs/humaneval_repair_v1/adapter \
  results/humaneval_repair_v1 \
  8 40 4 3 2 0.6 1337 repair_evo_40_47 1

Build repair training data

bash scripts/60_humaneval_build_repair_v1.sh

Train repair adapter

python -m mlx_lm.lora \
  --model models/mlx-q4-qwen35-08b \
  --data data/humaneval_repair_v1/mlx \
  --config configs/biolora6g/humaneval_repair_v1_qwen35_08b.yaml \
  --adapter-path outputs/humaneval_repair_v1/adapter \
  --resume-adapter-file outputs/biolora6g/grow/adapters.safetensors

πŸ€” Honest Limitations

  • 🚫 This is not a magic one-shot model. The gains come from the system (model + search + feedback), not from the model alone.
  • πŸ“ Tested on HumanEval slices, not full benchmarks. Results are real but scoped.
  • πŸ“ˆ Transfer is stronger on harder problems and weaker on easy ones (where the base model is already decent).
  • 🐜 A 0.8B model will never match GPT-4. That's not the point.

🌍 Why This Matters

If a 0.8B model can teach itself to code better on a laptop with 6GB RAM β€” what happens when you run this technique on a 7B model? A 70B model? With a real GPU cluster?

The technique is the contribution, not the model size.

Self-play works for Go (AlphaGo). Self-play works for math (AlphaProof). We're showing it works for code, on hardware anyone can afford.


🌐 Beyond Code β€” Where This Technique Applies

The core loop is simple: generate something β†’ check if it's good β†’ learn from what went wrong β†’ try again. We proved it on code. But it works anywhere you can measure whether the output is right or wrong.

For engineers & researchers

Domain What's the "test"?
πŸ—ƒοΈ SQL queries Run the query, check if the results match
πŸ”’ Math proofs Verify the proof mechanically (Lean, Coq, SymPy)
⚑ Chip design (Verilog/VHDL) Simulate the circuit against test benches
πŸ§ͺ Drug molecules Check properties via physics simulation
πŸ” Security testing Did the generated input crash the software?
πŸ€– Robotics Run the policy in simulation, measure the score
πŸ“Š Data pipelines / ETL Input data in, expected data out β€” compare

For marketers, creators & everyone else

Domain What's the "test"?
πŸ“£ Ad copy A/B test it β€” did people click? Click rate is automatic feedback
πŸ“§ Email outreach Did they open? Did they reply? Open/reply rate = the verifier
πŸ”Ž SEO content Did it rank? Did it get traffic? Google tells you
πŸ›’ Product descriptions Conversion rate β€” did people buy?
πŸ“± Social media posts Engagement. Likes, shares, saves β€” all measurable
πŸŽ“ Teaching / tutoring Did the student get the next question right?
✍️ Headlines & hooks Scroll-stop rate. You already track this

The idea: if you can score the output β€” even roughly β€” the model can learn from its own mistakes. No labeled dataset needed. No expert in the loop. The model makes its own training data by failing and fixing.

πŸ”‘ What you need

  1. A task β€” what you want the model to produce
  2. A way to check β€” any script/metric that says "good" or "bad" (and ideally, why)
  3. A small model β€” even 0.8B parameters is enough to start

That's it. The model teaches itself. On your laptop. Your data never leaves your machine.


πŸ“¦ Resource Envelope

Phase Peak Memory Time (MacBook Air M4)
🧬 Evolutionary search (8 tasks) ~4GB ~10 min
πŸ”§ Repair pair extraction ~100MB seconds
πŸŽ“ LoRA training (40 iters) 6.5-13GB ~3 min
πŸ”₯ Full tinyforge demo (5 tasks) ~10GB ~8 min

πŸ“„ License

MIT β€” do whatever you want with it.


πŸ“– Citation

If you use this technique in your work:

@misc{tinyforge2025,
  title   = {tinyforge: Self-improving tiny language models through test-driven repair training},
  author  = {Usman Muhammad},
  year    = {2025},
  url     = {https://github.com/ranausmanai/tinyforge}
}


Built with πŸ”₯ on a MacBook Air M4 with 24GB RAM

No GPUs were harmed in the making of this project.

About

A tiny model that teaches itself to code better. On your laptop. No cloud. No teacher model. No human feedback.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors