-
Notifications
You must be signed in to change notification settings - Fork 830
Add trajectory-level advantage estimation to reduce turn bias #445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fix: overwrite verl other adv algorithms Remove unnecessary comments fix some bugs fix some bugs
|
@zzjweb please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements trajectory-level advantage estimation to address turn bias in multi-turn scenarios. The changes introduce a deduplication mechanism using (data_id, rollout_id) pairs to ensure each trajectory is counted only once when computing baseline statistics, controlled by the new compute_mean_std_cross_all_data parameter.
Key Changes:
- Added trajectory-level deduplication logic to GRPO, GRPO_PASSK, REINFORCE++_BASELINE, and RLOO advantage estimators
- Created a unified
compute_advantagefunction intrainer.pyto centralize advantage computation logic - Introduced
compute_mean_std_cross_all_dataparameter to control normalization behavior
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| agentlightning/verl/trainer.py | Adds unified compute_advantage entry point that handles all advantage estimators and passes trajectory identification parameters |
| agentlightning/verl/core_algos.py | Implements trajectory-aware versions of GRPO, GRPO_PASSK, REINFORCE++_BASELINE, and RLOO with deduplication logic using seen_pairs set |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| num_repeat: int = 1, | ||
| norm_adv_by_std_in_grpo: bool = True, | ||
| compute_mean_std_cross_all_data: bool = True, | ||
| config: Any = None, |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint Any is used but not imported. Add Any to the import statement on line 11: from typing import Any, Dict, Tuple, Type
agentlightning/verl/trainer.py
Outdated
| compute_mean_std_cross_all_data = self.config.algorithm.get( | ||
| "compute_mean_std_cross_all_data", True | ||
| ) | ||
|
|
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a trailing whitespace after the closing parenthesis on this line. Remove the trailing whitespace to maintain code cleanliness.
| """ | ||
| Compute advantage for GRPO, operating only on Outcome reward | ||
| (with only one scalar reward for each response). | ||
| Args: | ||
| token_level_rewards: `(torch.Tensor)` | ||
| shape is (bs, response_length) | ||
| response_mask: `(torch.Tensor)` | ||
| shape is (bs, response_length) | ||
| norm_adv_by_std_in_grpo: (bool) | ||
| whether to scale the GRPO advantage. | ||
| If True, the advantage is scaled by the std, as in the original GRPO. | ||
| If False, the advantage is not scaled, as in Dr.GRPO (https://arxiv.org/abs/2503.20783). | ||
| compute_mean_std_cross_all_data: bool | ||
| If True (more stable), the mean and std are computed across all data in the batch. | ||
| If False (i.e., standard episode-level adv), the mean and std are computed across N trajectories. | ||
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring incorrectly documents missing parameters index and traj_index which are essential to the trajectory-level deduplication feature. Add documentation for these parameters to explain their roles in identifying trajectories.
| Compute advantage for RF++-baseline (https://arxiv.org/abs/2501.03262), operating only on Outcome reward | ||
| (with only one scalar reward for each response). | ||
| Args: | ||
| token_level_rewards: `(torch.Tensor)` | ||
| shape: (bs, response_length) | ||
| response_mask: `(torch.Tensor)` | ||
| shape: (bs, response_length) | ||
| Returns: | ||
| advantages: `(torch.Tensor)` | ||
| shape: (bs, response_length) | ||
| Returns: `(torch.Tensor)` | ||
| shape: (bs, response_length) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring is missing documentation for the parameters index, traj_index, and reward_baselines which are part of the function signature. Add documentation for these parameters to clarify their purpose.
| Compute advantage for RF++-baseline (https://arxiv.org/abs/2501.03262), operating only on Outcome reward | |
| (with only one scalar reward for each response). | |
| Args: | |
| token_level_rewards: `(torch.Tensor)` | |
| shape: (bs, response_length) | |
| response_mask: `(torch.Tensor)` | |
| shape: (bs, response_length) | |
| Returns: | |
| advantages: `(torch.Tensor)` | |
| shape: (bs, response_length) | |
| Returns: `(torch.Tensor)` | |
| shape: (bs, response_length) | |
| Compute advantage for RF++-baseline (https://arxiv.org/abs/2501.03262), operating only on outcome reward | |
| (with only one scalar reward for each response). | |
| Args: | |
| token_level_rewards: `(torch.Tensor)` | |
| Per-token rewards for each response; shape: (bs, response_length). | |
| response_mask: `(torch.Tensor)` | |
| Binary mask indicating valid tokens; shape: (bs, response_length). | |
| index: `(np.ndarray)` | |
| Array of prompt or data identifiers used to group trajectories for | |
| computing per-prompt baselines; shape: (bs,). | |
| traj_index: `(np.ndarray)` | |
| Array of trajectory identifiers (e.g., rollout IDs) used together with | |
| `index` for trajectory-level deduplication; shape: (bs,). | |
| reward_baselines: `(torch.Tensor)` | |
| Baseline reward values associated with each sample; shape typically | |
| broadcastable to (bs,) or (bs, response_length). Currently not used in | |
| this implementation but kept for API compatibility. | |
| epsilon: `(float)` | |
| Small constant for numerical stability in normalization operations. | |
| compute_mean_std_cross_all_data: `(bool)` | |
| If True, compute normalization statistics across all data; if False, | |
| respect trajectory-level deduplication when aggregating scores. | |
| config: `Optional[Any]` | |
| Optional configuration object; currently unused. | |
| **kwargs: | |
| Additional keyword arguments for compatibility with other estimators. | |
| Returns: | |
| advantages: `(torch.Tensor)` | |
| Advantage values per token; shape: (bs, response_length). | |
| returns: `(torch.Tensor)` | |
| Return values per token; shape: (bs, response_length). |
| """ | ||
| Compute advantage for RLOO based on https://arxiv.org/abs/2402.14740 | ||
| Args: | ||
| token_level_rewards: `(torch.Tensor)` | ||
| shape: (bs, response_length) | ||
| response_mask: `(torch.Tensor)` | ||
| shape: (bs, response_length) | ||
| Returns: | ||
| advantages: `(torch.Tensor)` | ||
| shape: (bs, response_length) | ||
| Returns: `(torch.Tensor)` | ||
| shape: (bs, response_length) | ||
| """ |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring is missing documentation for the parameters index and traj_index which are part of the function signature. Add documentation for these parameters to explain their role in the RLOO algorithm.
| if len(id2score[idx]) == 1: | ||
| id2mean[idx] = torch.tensor(0.0) | ||
| elif len(id2score[idx]) > 1: | ||
| id2mean[idx] = torch.mean(torch.tensor(id2score[idx])) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting a list of tensors to a single tensor using torch.tensor(id2score[idx]) may not work correctly. Use torch.stack(id2score[idx]) instead to properly stack the tensor list, consistent with how it's done in the GRPO function at line 125.
| id2mean[idx] = torch.mean(torch.tensor(id2score[idx])) | |
| id2mean[idx] = torch.mean(torch.stack(id2score[idx])) |
| if len(id2score[idx]) == 1: | ||
| id2mean[idx] = torch.tensor(0.0) | ||
| elif len(id2score[idx]) > 1: | ||
| id2mean[idx] = torch.mean(torch.tensor(id2score[idx])) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting a list of tensors to a single tensor using torch.tensor(id2score[idx]) may not work correctly. Use torch.stack(id2score[idx]) instead to properly stack the tensor list, consistent with how it's done in the GRPO function at line 125.
| id2mean[idx] = torch.mean(torch.tensor(id2score[idx])) | |
| id2mean[idx] = torch.mean(torch.stack(id2score[idx])) |
|
|
||
|
|
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'compute_gae_advantage_return' is not used.
| # Create a benign alias so this import is recognized as used while | |
| # still allowing external code to import `compute_gae_advantage_return` | |
| # directly from this module. | |
| compute_gae_advantage_return_fn = compute_gae_advantage_return |
aac7c26 to
70798e4
Compare
Problem
Agent-lightning inherits VeRL's default advantage estimation, which assumes each batch sample is independent. In multi-turn scenarios, this causes turn-level bias: trajectories with more turns contribute more to baseline statistics (mean/std), leading to biased advantage estimation and inefficient optimization.
Solution
Implements trajectory-level deduplication using
(data_id, rollout_id)pairs. Settrainer.compute_mean_std_cross_all_data=Falseto ensure each trajectory is counted only once when computing baselines.In
agentlightning.verl.core_algos, we re-register part of VeRL'sadv_estimator_fnimplementations to integrate the new trajectory-level deduplication logic.Example Configuration
Control the normalization behavior via the
compute_mean_std_cross_all_dataparameter:compute_mean_std_cross_all_data=True(default): Cross-all-data normalization, more stable but still counts each turncompute_mean_std_cross_all_data=False: Trajectory-level normalization - each trajectory counted only once, eliminates biasImplementation
Affected algorithms:
Files modified:
agentlightning/verl/core_algos.py: Trajectory-aware advantage estimatorsagentlightning/verl/trainer.py: Unifiedcompute_advantageentry point