dpo-from-scratch

A minimal-from-scratch implementation of Direct Preference Optimization (DPO) training for causal language models.

This repository contains a small experimental pipeline that:

Loads a pretrained causal LM and a frozen reference model using Hugging Face Transformers.
Loads a preference dataset (Dahoas/full-hh-rlhf), filters for stronger preferences, and tokenizes prompt/completion pairs.
Implements the core DPO loss and a lightweight training loop that fine-tunes a subset of model parameters.

This project is intended as an educational reference and starting point for experimenting with preference-based alignment.

Explaination of DPO

Direct Preference Optimization (DPO) is a method for aligning language models with human preferences without requiring reinforcement learning. It works by fine-tuning a pretrained language model using pairs of preferred and non-preferred completions, optimizing a loss function that encourages the model to assign higher probabilities to preferred outputs.

Logging

The training experiments were tracked using Weights & Biases (wandb).

To insure that our model trains correctly, we logged the following metrics:

Training loss over time
Validation accuracy on held-out preference pairs
Model log diff (to garantee the logit of wanted answer is higher than unwanted)

main.py - Entrypoint script. Loads models/tokenizer, prepares data loaders, runs a training loop with DPO loss, logs metrics to Weights & Biases (wandb), and evaluates accuracy on held-out data.
src/dpo.py - Implementation of the DPO loss computation.
src/data.py - Dataset loading and filtering (uses datasets.load_dataset("Dahoas/full-hh-rlhf")). Includes a custom is_strong_preference filter.
src/utils.py - Tokenization helpers, padding collate function, and a compute_logprob utility that converts model logits into average log-probabilities over completions.

Quick start

Install requirements:

python -m pip install -r requirements.txt

Running training (example):

python main.py --cache_dir /path/to/cache --model_name Gensyn/Qwen2.5-0.5B-Instruct --max_samples 1000 --batch_size 2

Notes:

The script will pick cuda if available, otherwise cpu.
main.py currently enables gradient updates for parameters whose names include model.layers.23 and freezes the rest — this is a simple way to limit fine-tuning to a small subset of model weights.
The default model_name is Gensyn/Qwen2.5-0.5B-Instruct but you can substitute any compatible causal LM model from Hugging Face.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
res		res
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dpo-from-scratch

Explaination of DPO

Logging

Contents

Quick start

About

Uh oh!

Releases

Packages

Languages

License

NotTheStallion/dpo-from-scratch

Folders and files

Latest commit

History

Repository files navigation

dpo-from-scratch

Explaination of DPO

Logging

Contents

Quick start

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages