A minimal-from-scratch implementation of Direct Preference Optimization (DPO) training for causal language models.
This repository contains a small experimental pipeline that:
- Loads a pretrained causal LM and a frozen reference model using Hugging Face Transformers.
- Loads a preference dataset (Dahoas/full-hh-rlhf), filters for stronger preferences, and tokenizes prompt/completion pairs.
- Implements the core DPO loss and a lightweight training loop that fine-tunes a subset of model parameters.
This project is intended as an educational reference and starting point for experimenting with preference-based alignment.
Direct Preference Optimization (DPO) is a method for aligning language models with human preferences without requiring reinforcement learning. It works by fine-tuning a pretrained language model using pairs of preferred and non-preferred completions, optimizing a loss function that encourages the model to assign higher probabilities to preferred outputs.
The training experiments were tracked using Weights & Biases (wandb).
To insure that our model trains correctly, we logged the following metrics:
- Training loss over time

- Validation accuracy on held-out preference pairs

- Model log diff (to garantee the logit of wanted answer is higher than unwanted)

main.py- Entrypoint script. Loads models/tokenizer, prepares data loaders, runs a training loop with DPO loss, logs metrics to Weights & Biases (wandb), and evaluates accuracy on held-out data.src/dpo.py- Implementation of the DPO loss computation.src/data.py- Dataset loading and filtering (usesdatasets.load_dataset("Dahoas/full-hh-rlhf")). Includes a customis_strong_preferencefilter.src/utils.py- Tokenization helpers, padding collate function, and acompute_logprobutility that converts model logits into average log-probabilities over completions.
Install requirements:
python -m pip install -r requirements.txtRunning training (example):
python main.py --cache_dir /path/to/cache --model_name Gensyn/Qwen2.5-0.5B-Instruct --max_samples 1000 --batch_size 2Notes:
- The script will pick
cudaif available, otherwisecpu. main.pycurrently enables gradient updates for parameters whose names includemodel.layers.23and freezes the rest — this is a simple way to limit fine-tuning to a small subset of model weights.- The default
model_nameisGensyn/Qwen2.5-0.5B-Instructbut you can substitute any compatible causal LM model from Hugging Face.

