DPO: Direct Preference Optimization - A reference implementation for training language models using preference data without an explicit reward model.
## Understanding DPO
DPO (Direct Preference Optimization) is a method for training language models using preference data without the need for an explicit reward model. It relies on contrastive samples (chosen vs. rejected) from preference data to fine-tune the policy model, particularly excelling in language model alignment tasks.
## Key Features of the DPO Project
The DPO project includes the following key features:
- Support for original DPO, "conservative" DPO, and IPO.
- A two-stage training pipeline involving supervised fine-tuning (SFT) followed by preference learning.
- Multi-GPU support with BasicTrainer, FSDPTrainer, and TensorParallelTrainer.
- Accelerated training through mixed precision (bfloat16) and activation checkpointing.
## Functionality of the DPO Project
The DPO project provides the following functionality:
- `train.py`: The main entry script for SFT or DPO training with a command-line interface.
- `trainers.py`: Implementation of trainer classes supporting multi-GPU logic.
- `utils.py`: Utility functions shared across multiple files.
- `preference_datasets.py`: Logic for handling SFT and DPO preference training datasets, including support for custom datasets like Anthropic-HH, Stanford Human Preferences, and StackExchange.
## Installation and Local Usage of olmOCR
The DPO project can be used for training through command-line examples:
- **SFT Example**: `python -u train.py model=pythia69 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia69 gradient_accumulation_steps=2 batch_size=64 eval_batch_size=32 trainer=FSDPTrainer sample_during_eval=false`
- **DPO Example**: `python -u train.py model=pythia69 datasets=[hh] loss=dpo loss.beta=0.1 model.archive=/path/to/checkpoint/from/sft/step-XXXX/policy.pt exp_name=anthropic_dpo_pythia69 gradient_accumulation_steps=2 batch_size=32 eval_batch_size=32 trainer=FSDPTrainer sample_during_eval=false`
- **Custom Datasets**: Users can update `preference_datasets.py` and pass custom datasets via `datasets=[xyz]`.
## Accessing the DPO Project Code
The DPO project code is hosted on [GitHub](https://github.com/eric-mitchell/direct-preference-optimization), where users can access the source code, documentation, and examples.
## Benefits of DPO Over Traditional RLHF Methods
DPO offers several benefits over traditional RLHF (Reinforcement Learning from Human Feedback) methods:
- Simplified training process by eliminating the need for an explicit reward model.
- Increased stability and efficiency in training.
- Superior performance in controlling generated sentiment, improving summary quality, and single-turn dialogue responses.
### Citation sources:
- [DPO: Direct Preference Optimization](https://github.com/eric-mitchell/direct-preference-optimization) - Official URL
Updated: 2025-03-28