[bug] KL Divergence explodes to billions in GRPO training

Hello,

### 🐛 Describe the bug

I found very high KL divergence scores on the main branch with

```shell
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
```

and 
```
group_size: 8
local_batch_size: 2 # per-device batch size
max_req_tokens: 1024
max_res_tokens: 16384
```
(complete config file: [qwen3_1_7b_kl_checks.yaml](https://github.com/user-attachments/files/24280382/qwen3_1_7b_kl_checks.yaml) )

Here are the logged numbers with KL divergence scores

```
grpo_loss/kl_divergence_mean: 1,568,987.96
grpo_loss/kl_divergence_max: 6,821,680,128.0
grpo_loss/policy_gradient_loss: -0.198
grpo_loss/total_loss: 1.77
buffer/sample/avg_sampled_policy_age: 1.0 
```

<img width="1655" height="383" alt="Image" src="https://github.com/user-attachments/assets/0a9d9803-b900-4158-8a6a-c9d3952fd38a" />

`beta=1e-6` is suppressing these high values but something seems really off

### Versions

Tested latest on [e940fd89ad9b262e690988f2f1a5ee6cb0d25574](https://github.com/meta-pytorch/torchforge/commit/e940fd89ad9b262e690988f2f1a5ee6cb0d25574)

Config: apps/grpo/qwen3_1_7b.yaml with 16k response length and batch size 2

Step: 9
torch: 2.9.0+cu128
torchtitan: 0.2.0
vllm: 0.10.1.dev0+g6d8d0a24c.d20251219

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug] KL Divergence explodes to billions in GRPO training #664

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[bug] KL Divergence explodes to billions in GRPO training #664

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions