This project implements a complete pipeline for training a Paul Graham essay generator using:
- Supervised Fine-Tuning (SFT) on real Paul Graham essays
- Synthetic DPO data generation using multiple LLMs
- Direct Preference Optimization (DPO) for alignment
SmolLM3-3B → [SFT] → SmolGraham → [DPO Data Gen] → dpo_pairs.csv → [DPO] → SmolGraham-DPO
Install dependencies:
pip install -r requirements.txtSet up API keys for DPO data generation:
export OPENAI_API_KEY="your_openai_key"
export OPENROUTER_API_KEY="your_openrouter_key" # OptionalTrain the base model on real Paul Graham essays:
python sft.pyThis will:
- Load the SmolLM3-3B base model
- Fine-tune on Paul Graham essays from Hugging Face
- Save the SFT model to
SmolGraham/ - Test with a sample generation
Output: SmolGraham/ directory containing the SFT model
Create synthetic preference pairs using multiple LLMs:
python generate_dpo_data.pyThis will:
- Load random Wikipedia topics
- Generate essays using multiple models (GPT-3.5, GPT-4, Qwen, Mistral)
- Use GPT-4 to judge which essays are better
- Create preference pairs for DPO training
- Save to
dpo_pairs.csv
Output: dpo_pairs.csv with columns:
topic: Essay topicprompt: Generation promptchosen: Preferred essayrejected: Less preferred essaychosen_model/rejected_model: Source modelsjudgment_reasoning: Why one was chosenconfidence: Judgment confidence (1-10)
Further tune the SFT model using preference learning:
python dpo.pyThis will:
- Load the SFT model from
SmolGraham/ - Load preference pairs from
dpo_pairs.csv - Filter for high-confidence judgments (confidence ≥ 6)
- Train using DPO to align with preferences
- Save the final model to
SmolGraham-DPO/ - Test with sample generations
Output: SmolGraham-DPO/ directory containing the final DPO-aligned model
- Base Model (
SmolLM3-3B): General language model - SFT Model (
SmolGraham): Specialized for Paul Graham's writing style - DPO Model (
SmolGraham-DPO): Aligned to prefer higher-quality outputs
- 2 epochs, batch size 8, learning rate 1e-5
- Max sequence length: 1024 tokens
- Uses real Paul Graham essays from Hugging Face
- Configurable number of topics (default: 10 for testing)
- Multiple LLM providers (OpenAI + OpenRouter)
- Parallel processing with 50 workers
- GPT-4 as preference judge
- 1 epoch, batch size 2, learning rate 5e-7
- Beta = 0.1 (KL penalty coefficient)
- Filters for confidence ≥ 6 judgments
- Uses sigmoid DPO loss
Both SFT and DPO training log to TensorBoard:
tensorboard --logdir=SmolGraham/runs # SFT metrics
tensorboard --logdir=SmolGraham-DPO/runs # DPO metricsAfter training, use the final model:
from transformers import pipeline, AutoTokenizer
# Load the DPO-trained model
pipe = pipeline("text-generation", model="SmolGraham-DPO")
tokenizer = AutoTokenizer.from_pretrained("SmolGraham-DPO")
# Generate an essay
messages = [{"role": "user", "content": "Write a Paul Graham essay about startup ideas"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
output = pipe(prompt, max_new_tokens=500, temperature=0.7)
print(output[0]["generated_text"])- GPU recommended for training (especially DPO with reference model)
- DPO data generation requires API keys and may incur costs
- Start with small datasets for testing (adjust
num_topicsingenerate_dpo_data.py) - Models are saved in Hugging Face format and can be uploaded to the Hub