Skip to content

Commit e09c67c

Browse files
docs: Merge Multi-Adapter RL and NPP sections from main
Incorporated content from PR #4436 (Multi-Adapter RL Training) and NPP section that were added to main after this PR branch was created. Changes: - Added Multi-Adapter RL Training subsection under PPO trainer section - Added Naive Pipeline Parallelism (NPP) subsection under Multi-GPU Training - Maintained consistent formatting with the rewritten documentation style Resolves merge conflict between PR #4421 complete rewrite and additions from PR #4436 that were merged to main.
1 parent 075f4d3 commit e09c67c

File tree

1 file changed

+125
-0
lines changed

1 file changed

+125
-0
lines changed

docs/source/peft_integration.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -338,6 +338,96 @@ trainer.train()
338338
</hfoption>
339339
</hfoptions>
340340

341+
### Proximal Policy Optimization (PPO)
342+
343+
#### Multi-Adapter RL Training
344+
345+
You can use a single base model with multiple PEFT adapters for the entire PPO algorithm - including retrieving reference logits, computing active logits, and calculating rewards. This approach is useful for memory-efficient RL training.
346+
347+
> [!WARNING]
348+
> This feature is experimental and convergence has not been extensively tested. We encourage the community to share feedback and report any issues.
349+
350+
**Requirements**
351+
352+
Install PEFT and optionally bitsandbytes for 8-bit models:
353+
354+
```bash
355+
pip install peft bitsandbytes
356+
```
357+
358+
**Training Workflow**
359+
360+
The multi-adapter approach requires three stages:
361+
362+
1. **Supervised Fine-Tuning (SFT)**: Train a base model on your target domain (e.g., IMDB dataset) using `SFTTrainer`
363+
2. **Reward Model Training**: Train a reward model adapter using PEFT and `RewardTrainer` (see [reward modeling example](https://github.com/huggingface/trl/tree/main/examples/scripts/reward_modeling.py))
364+
3. **PPO Training**: Fine-tune new adapters using PPO with the reward adapter
365+
366+
> [!IMPORTANT]
367+
> Use the same base model (architecture and weights) for stages 2 & 3.
368+
369+
**Basic Usage**
370+
371+
After training your reward adapter and pushing it to the Hub:
372+
373+
```python
374+
from peft import LoraConfig
375+
from trl import AutoModelForCausalLMWithValueHead, PPOTrainer
376+
377+
model_name = "huggyllama/llama-7b"
378+
rm_adapter_id = "trl-lib/llama-7b-hh-rm-adapter"
379+
380+
# Configure PPO adapter
381+
lora_config = LoraConfig(
382+
r=16,
383+
lora_alpha=32,
384+
lora_dropout=0.05,
385+
bias="none",
386+
task_type="CAUSAL_LM",
387+
)
388+
389+
# Load model with reward adapter
390+
model = AutoModelForCausalLMWithValueHead.from_pretrained(
391+
model_name,
392+
peft_config=lora_config,
393+
reward_adapter=rm_adapter_id,
394+
)
395+
396+
trainer = PPOTrainer(model=model, ...)
397+
```
398+
399+
In your training loop, compute rewards using:
400+
401+
```python
402+
rewards = trainer.model.compute_reward_score(**inputs)
403+
```
404+
405+
**Advanced Features**
406+
407+
**Multiple Policy Adapters**
408+
409+
You can train multiple adapters on the same base model for different policies. Control which adapter to activate using the `ppo_adapter_name` argument:
410+
411+
```python
412+
adapter_name_policy_1 = "policy_1"
413+
rewards = trainer.model.compute_reward_score(**inputs, ppo_adapter_name=adapter_name_policy_1)
414+
```
415+
416+
**Quantized Base Models**
417+
418+
For memory-efficient training, load the base model in 8-bit or 4-bit while keeping adapters in float32:
419+
420+
```python
421+
from transformers import BitsAndBytesConfig
422+
423+
model = AutoModelForCausalLMWithValueHead.from_pretrained(
424+
model_name,
425+
peft_config=lora_config,
426+
reward_adapter=rm_adapter_id,
427+
quantization_config=BitsAndBytesConfig(load_in_8bit=True),
428+
)
429+
```
430+
341431
## QLoRA: Quantized Low-Rank Adaptation
342432

343433
QLoRA combines 4-bit quantization with LoRA to enable fine-tuning of very large models on consumer hardware. This technique can reduce memory requirements by up to 4x compared to standard LoRA.
@@ -687,6 +777,41 @@ accelerate launch trl/scripts/sft.py \
687777
--lora_r 32
688778
```
689779

780+
### Naive Pipeline Parallelism (NPP) for Large Models
781+
782+
For very large models (>60B parameters), TRL supports Naive Pipeline Parallelism (NPP), which distributes the model and adapters across multiple GPUs. The activations and gradients are communicated across GPUs, supporting both `int8` and other data types.
783+
784+
![NPP](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl-npp.png)
785+
786+
**How to Use NPP**
787+
788+
Load your model with a custom `device_map` to split it across multiple devices:
789+
790+
```python
791+
from transformers import AutoModelForCausalLM
792+
from peft import LoraConfig
793+
794+
# Create custom device map (see accelerate documentation)
795+
device_map = {
796+
"model.embed_tokens": 0,
797+
"model.layers.0": 0,
798+
# ... distribute layers across GPUs
799+
"lm_head": 0, # Must be on GPU 0
800+
}
801+
802+
model = AutoModelForCausalLM.from_pretrained(
803+
"meta-llama/Llama-2-70b-hf",
804+
device_map=device_map,
805+
peft_config=lora_config,
806+
)
807+
```
808+
809+
> [!IMPORTANT]
810+
> - Keep the `lm_head` module on the first GPU (device 0) to avoid errors
811+
> - See this [tutorial on device maps](https://github.com/huggingface/blog/blob/main/accelerate-large-models.md) for proper configuration
812+
> - Run training scripts directly (not with `accelerate launch`): `python script.py`
813+
> - Data Parallelism is not yet supported with NPP
814+
690815
## Resources
691816

692817
### TRL Examples and Notebooks

0 commit comments

Comments
 (0)