-
Notifications
You must be signed in to change notification settings - Fork 183
feat: add ORPO trainer support #594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi eghuzefa, thank you for your PR. Do you happen to formatted everything? It's hard to review the actual change right now. Can you format the files that you changed? |
Hey, yes I did, two files were primarily added and the following file was changed for adding the trainer export |
|
I will go through it again, fix all these conflicts, and reformat my changes, by tomorrow. |
b704c94 to
529b8d3
Compare
|
Hi eghuzefa, please let me know when this is PR ready for review. Right now still have formatting issues. |
Thanks for the review! I had rebased 2 days ago on the latest upstream/main and today made the following changes to match the codebase patterns: I updated ORPO to be compatible with recent upstream API changes: common.get_per_token_logps() now returns a single value instead of a tuple, and I renamed the _prepare_inputs parameter from training_input to input_data to match the base class signature. I also removed the init.py files from both tunix/rl/orpo/ and tests/rl/orpo/ to exactly match the structure used by DPO and GRPO (both export directly from tunix/init.py without intermediate init.py files). All 6 tests pass, pyink formatting is clean, and pylint gives 9.78/10 with only minor non-blocking docstring warnings. Could you please clarify what specific formatting issues you're seeing? I've run pre-commit hooks and everything passes. Are there specific files or lines that need attention, or perhaps a different formatting tool I should use? |
Hey @wang2yn84, yeah you're right — I ran pre-commit on the whole repo. I’ll revert the formatting-only changes and keep just the relevant ones. |
faf4d72 to
28f574d
Compare
Hi @wang2yn84, I have fixed it now, kindly check. |
|
Thank you @eghuzefa, it's much better now! Couple of questions here: 1. ORPO is a variation of DPO, can you move to sft/ ? 2. We are refactoring the logics now to abstract the core algorithm. Before that, can you merge the algorithm directly to dpo? Most of the code added should be similar to DPO except some additional configs and the loss function. |
Done! I've merged ORPO into DPO in sft/. Both algorithms now share the same trainer with algorithm="orpo" config and no reference model. I kept the dpo folder name to minimize changes - happy to rename it to something more generic like preference or po if you prefer, but wanted to avoid unnecessary refactoring for now. All tests passing ✓ |
|
Thank you so much for the support! Looks great! |
|
Do you have a gmail account? Can we chat a bit more about your contribution? |
|
One last thing, can you squash your commits into 1? |
hey, sure, here it is, m.huzefa1993@gmail.com |
2f254a2 to
a3f1ca8
Compare
|
Apologies for the noise! |
Oh no worry, I saw some github workflow test failure last time, need to fix them before we can merge. |
Are you available on Google chat? |
150d12f to
5f10c23
Compare
5f10c23 to
9e99644
Compare
Summary
This PR implements ORPO (Odds Ratio Preference Optimization) as a memory-efficient alternative to DPO for preference tuning. ORPO achieves approximately 50% memory savings by eliminating the need for a reference model during training.
Motivation
ORPO provides the same preference learning capabilities as DPO while requiring only a single model forward pass, making it ideal for resource-constrained environments and large model training.
Paper: https://arxiv.org/abs/2403.07691
Changes
Core Implementation
tunix/rl/orpo/orpo_trainer.py: Complete ORPO trainer implementation (516 lines)ORPOTrainerclass inheriting fromPeftTrainerORPOTrainingConfigdataclass withlambda_orpohyperparameterorpo_loss_fnimplementing odds ratio-based preference optimizationtunix/rl/orpo/__init__.py: Module exportsTesting
tests/rl/orpo/orpo_trainer_test.py: Comprehensive test suiteIntegration
tunix/__init__.py: Added ORPO exports to main package APIKey Features
PeftTrainerinfrastructurerewards/chosen,rewards/rejected,rewards/margin,rewards/accuracy,odds_ratio