docs: Merge Multi-Adapter RL and NPP sections from main

behroozazarkhalili · behroozazarkhalili · commit e09c67cecb11 · 2025-11-13T10:46:18.000-08:00
Incorporated content from PR #4436 (Multi-Adapter RL Training) and NPP section that were added to main after this PR branch was created. Changes: - Added Multi-Adapter RL Training subsection under PPO trainer section - Added Naive Pipeline Parallelism (NPP) subsection under Multi-GPU Training - Maintained consistent formatting with the rewritten documentation style Resolves merge conflict between PR #4421 complete rewrite and additions from PR #4436 that were merged to main.
diff --git a/docs/source/peft_integration.md b/docs/source/peft_integration.md
@@ -338,6 +338,96 @@ trainer.train()
 </hfoption>
 </hfoptions>
 
+### Proximal Policy Optimization (PPO)
+
+#### Multi-Adapter RL Training
+
+You can use a single base model with multiple PEFT adapters for the entire PPO algorithm - including retrieving reference logits, computing active logits, and calculating rewards. This approach is useful for memory-efficient RL training.
+
+> [!WARNING]
+> This feature is experimental and convergence has not been extensively tested. We encourage the community to share feedback and report any issues.
+
+**Requirements**
+
+Install PEFT and optionally bitsandbytes for 8-bit models:
+
+```bash
+pip install peft bitsandbytes
+```
+
+**Training Workflow**
+
+The multi-adapter approach requires three stages:
+
+1. **Supervised Fine-Tuning (SFT)**: Train a base model on your target domain (e.g., IMDB dataset) using `SFTTrainer`
+2. **Reward Model Training**: Train a reward model adapter using PEFT and `RewardTrainer` (see [reward modeling example](https://github.com/huggingface/trl/tree/main/examples/scripts/reward_modeling.py))
+3. **PPO Training**: Fine-tune new adapters using PPO with the reward adapter
+
+> [!IMPORTANT]
+> Use the same base model (architecture and weights) for stages 2 & 3.
+
+**Basic Usage**
+
+After training your reward adapter and pushing it to the Hub:
+
+```python
+from peft import LoraConfig
+from trl import AutoModelForCausalLMWithValueHead, PPOTrainer
+
+model_name = "huggyllama/llama-7b"
+rm_adapter_id = "trl-lib/llama-7b-hh-rm-adapter"
+
+# Configure PPO adapter
+lora_config = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM",
+)
+
+# Load model with reward adapter
+model = AutoModelForCausalLMWithValueHead.from_pretrained(
+    model_name,
+    peft_config=lora_config,
+    reward_adapter=rm_adapter_id,
+)
+
+trainer = PPOTrainer(model=model, ...)
+```
+
+In your training loop, compute rewards using:
+
+```python
+rewards = trainer.model.compute_reward_score(**inputs)
+```
+
+**Advanced Features**
+
+**Multiple Policy Adapters**
+
+You can train multiple adapters on the same base model for different policies. Control which adapter to activate using the `ppo_adapter_name` argument:
+
+```python
+adapter_name_policy_1 = "policy_1"
+rewards = trainer.model.compute_reward_score(**inputs, ppo_adapter_name=adapter_name_policy_1)
+```
+
+**Quantized Base Models**
+
+For memory-efficient training, load the base model in 8-bit or 4-bit while keeping adapters in float32:
+
+```python
+from transformers import BitsAndBytesConfig
+
+model = AutoModelForCausalLMWithValueHead.from_pretrained(
+    model_name,
+    peft_config=lora_config,
+    reward_adapter=rm_adapter_id,
+    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
+)
+```
+
 ## QLoRA: Quantized Low-Rank Adaptation
 
 QLoRA combines 4-bit quantization with LoRA to enable fine-tuning of very large models on consumer hardware. This technique can reduce memory requirements by up to 4x compared to standard LoRA.
@@ -687,6 +777,41 @@ accelerate launch trl/scripts/sft.py \
     --lora_r 32
 ```
 
+### Naive Pipeline Parallelism (NPP) for Large Models
+
+For very large models (>60B parameters), TRL supports Naive Pipeline Parallelism (NPP), which distributes the model and adapters across multiple GPUs. The activations and gradients are communicated across GPUs, supporting both `int8` and other data types.
+
+![NPP](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl-npp.png)
+
+**How to Use NPP**
+
+Load your model with a custom `device_map` to split it across multiple devices:
+
+```python
+from transformers import AutoModelForCausalLM
+from peft import LoraConfig
+
+# Create custom device map (see accelerate documentation)
+device_map = {
+    "model.embed_tokens": 0,
+    "model.layers.0": 0,
+    # ... distribute layers across GPUs
+    "lm_head": 0,  # Must be on GPU 0
+}
+
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-70b-hf",
+    device_map=device_map,
+    peft_config=lora_config,
+)
+```
+
+> [!IMPORTANT]
+> - Keep the `lm_head` module on the first GPU (device 0) to avoid errors
+> - See this [tutorial on device maps](https://github.com/huggingface/blog/blob/main/accelerate-large-models.md) for proper configuration
+> - Run training scripts directly (not with `accelerate launch`): `python script.py`
+> - Data Parallelism is not yet supported with NPP
+
 ## Resources
 
 ### TRL Examples and Notebooks