You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: Merge Multi-Adapter RL and NPP sections from main
Incorporated content from PR #4436 (Multi-Adapter RL Training) and NPP
section that were added to main after this PR branch was created.
Changes:
- Added Multi-Adapter RL Training subsection under PPO trainer section
- Added Naive Pipeline Parallelism (NPP) subsection under Multi-GPU Training
- Maintained consistent formatting with the rewritten documentation style
Resolves merge conflict between PR #4421 complete rewrite and additions
from PR #4436 that were merged to main.
Copy file name to clipboardExpand all lines: docs/source/peft_integration.md
+125Lines changed: 125 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -338,6 +338,96 @@ trainer.train()
338
338
</hfoption>
339
339
</hfoptions>
340
340
341
+
### Proximal Policy Optimization (PPO)
342
+
343
+
#### Multi-Adapter RL Training
344
+
345
+
You can use a single base model with multiple PEFT adapters for the entire PPO algorithm - including retrieving reference logits, computing active logits, and calculating rewards. This approach is useful for memory-efficient RL training.
346
+
347
+
> [!WARNING]
348
+
> This feature is experimental and convergence has not been extensively tested. We encourage the community to share feedback and report any issues.
349
+
350
+
**Requirements**
351
+
352
+
Install PEFT and optionally bitsandbytes for 8-bit models:
353
+
354
+
```bash
355
+
pip install peft bitsandbytes
356
+
```
357
+
358
+
**Training Workflow**
359
+
360
+
The multi-adapter approach requires three stages:
361
+
362
+
1.**Supervised Fine-Tuning (SFT)**: Train a base model on your target domain (e.g., IMDB dataset) using `SFTTrainer`
363
+
2.**Reward Model Training**: Train a reward model adapter using PEFT and `RewardTrainer` (see [reward modeling example](https://github.com/huggingface/trl/tree/main/examples/scripts/reward_modeling.py))
364
+
3.**PPO Training**: Fine-tune new adapters using PPO with the reward adapter
365
+
366
+
> [!IMPORTANT]
367
+
> Use the same base model (architecture and weights) for stages 2 & 3.
368
+
369
+
**Basic Usage**
370
+
371
+
After training your reward adapter and pushing it to the Hub:
372
+
373
+
```python
374
+
from peft import LoraConfig
375
+
from trl import AutoModelForCausalLMWithValueHead, PPOTrainer
376
+
377
+
model_name ="huggyllama/llama-7b"
378
+
rm_adapter_id ="trl-lib/llama-7b-hh-rm-adapter"
379
+
380
+
# Configure PPO adapter
381
+
lora_config = LoraConfig(
382
+
r=16,
383
+
lora_alpha=32,
384
+
lora_dropout=0.05,
385
+
bias="none",
386
+
task_type="CAUSAL_LM",
387
+
)
388
+
389
+
# Load model with reward adapter
390
+
model = AutoModelForCausalLMWithValueHead.from_pretrained(
You can train multiple adapters on the same base model for different policies. Control which adapter to activate using the `ppo_adapter_name` argument:
QLoRA combines 4-bit quantization with LoRA to enable fine-tuning of very large models on consumer hardware. This technique can reduce memory requirements by up to 4x compared to standard LoRA.
### Naive Pipeline Parallelism (NPP) for Large Models
781
+
782
+
For very large models (>60B parameters), TRL supports Naive Pipeline Parallelism (NPP), which distributes the model and adapters across multiple GPUs. The activations and gradients are communicated across GPUs, supporting both `int8` and other data types.
0 commit comments