[large tensor] fix CUDA extensions int64 overflow for large tensor dimensions by zrr1999 · Pull Request #561 · PaddlePaddle/PaddleFleet

zrr1999 · 2026-02-12T09:30:56Z

主要修复 CUDA 扩展在支持大张量时的 int32 溢出问题，修改涉及多个 .cu 和 utils.h 文件：

count_cumsum.cu
- load_128_bits / store_128_bits 增加 IdxT 模板参数，支持 int64_t 索引
- 循环变量 N_vec、i 改为 int64_t，避免大 N 溢出
- 增加 N == 0 的提前返回
filter_scores.cu
- 循环步长 gridDim.x * blockDim.x 使用 static_cast<int64_t> 避免溢出
- grid_size 先用 int64_t 计算再转成 int
- 增加 PD_CHECK，检查 total_elements / total_valid 不超过 INT_MAX
fuse_stack_transpose_fp8_quant.cu
- grid_x 使用 int64_t 计算
- 增加 PADDLE_ENFORCE_LE，保证 grid.x <= INT_MAX
fuse_swiglu_scale.cu / swiglu_kernel.cu
- 对 rows == 0 或 hidden_size == 0 做提前返回
- 增加 rows * hidden2 / rows * input_dim <= INT_MAX 的检查
router_metadata.cu
- 用 PADDLE_ENFORCE_LE 替代 TODO，检查 num_tokens * K <= INT_MAX
tokens_unzip_gather.cu
- 调整 quanted_hidden_size 逻辑
- 当 expert 无 token 时跳过 kernel 启动
tokens_unzip_slice.cu
- 循环步长使用 static_cast<int64_t>(blockDim.x) * gridDim.x
- 对 total_zipped_rows == 0 提前返回
tokens_zip_prob.cu
- num_expert、topk 改为 int64_t，并增加 INT_MAX 检查
- total_items 与 grid 计算改为使用 int64_t 再转 int
tokens_zip_unique_add.cu
- 循环索引与步长改为 int64_t，避免 hidden_size 溢出
utils.h
- unrolled_memcpy、vectorized_memcpy、try_vectorized_memcpy 的 num_elements 改为 int64_t

Copilot

Pull request overview

该 PR 主要围绕 CUDA 自定义算子在“大 tensor / 大维度”场景下的健壮性改造：将部分索引/计数从 int 升级到 int64_t，并在若干 kernel launcher 前增加 INT_MAX 边界保护与 0-size 快速返回，以避免整数溢出与无效 kernel launch。

Changes:

将多处 kernel/辅助函数的元素计数、循环索引改为 int64_t，减少大尺寸下的溢出风险
为若干算子新增 INT_MAX 上界检查、0-size 提前返回，避免非法配置/无效 launch
小幅调整部分逻辑分支以减少不必要的 kernel 启动

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/paddlefleet/_extensions/utils.h`	memcpy 辅助函数参数/索引改为 `int64_t`，并调整函数声明格式
`src/paddlefleet/_extensions/tokens_zip_unique_add.cu`	kernel 内循环索引改为 `int64_t`，避免 hidden_size 大时溢出
`src/paddlefleet/_extensions/tokens_zip_prob.cu`	增加 `num_expert/topk` 的 `INT_MAX` 检查并改用 `int64_t` 中间量计算 grid 等
`src/paddlefleet/_extensions/tokens_unzip_slice.cu`	循环步进改为 `int64_t` 计算，并在 0 行时提前返回
`src/paddlefleet/_extensions/tokens_unzip_gather.cu`	整理 scale 形状读取逻辑，补齐无 scale 时的 `quanted_hidden_size` 计算，并跳过无 token 的 kernel launch
`src/paddlefleet/_extensions/swiglu_kernel.cu`	增加 0-size 早返回与 `rows*input_dim <= INT_MAX` 检查
`src/paddlefleet/_extensions/router_metadata.cu`	增加 `num_tokens*K <= INT_MAX` 检查并在部分位置显式 cast
`src/paddlefleet/_extensions/fuse_swiglu_scale.cu`	forward/backward 增加 0-size 早返回与 `rows*hidden2 <= INT_MAX` 检查
`src/paddlefleet/_extensions/fuse_stack_transpose_fp8_quant.cu`	对 grid.x 做 `INT_MAX` 上界保护并引入 `int64_t` 中间变量
`src/paddlefleet/_extensions/filter_scores.cu`	增加 `total_elements/total_valid` 的 `INT_MAX` 检查，grid_size 用 int64 计算后再安全 cast
`src/paddlefleet/_extensions/count_cumsum.cu`	128-bit load/store 支持更大索引类型，局部循环索引改为 `int64_t`，并新增 N==0 提前返回

src/paddlefleet/_extensions/fuse_stack_transpose_fp8_quant.cu

src/paddlefleet/_extensions/utils.h

src/paddlefleet/_extensions/count_cumsum.cu

Copilot · 2026-02-12T09:42:22Z

src/paddlefleet/_extensions/tokens_zip_prob.cu

  int64_t zipped_rows = zipped_expertwise_rowmap_shape[0];
-  int num_expert = zipped_expertwise_rowmap_shape[1];
-  int topk = dispatched_indices_shape[1];
-  PD_CHECK(unzipped_probs.size() == num_expert);
+  int64_t num_expert = zipped_expertwise_rowmap_shape[1];
+  int64_t topk = dispatched_indices_shape[1];
+  PD_CHECK(num_expert <= static_cast<int64_t>(std::numeric_limits<int>::max()),
+           "num_expert must be <= INT_MAX for tokens_zip_prob.");
+  PD_CHECK(topk <= static_cast<int64_t>(std::numeric_limits<int>::max()),
+           "topk must be <= INT_MAX for tokens_zip_prob.");
+  PD_CHECK(unzipped_probs.size() == static_cast<size_t>(num_expert),
+           "unzipped_probs.size() must equal num_expert.");
+  int num_expert_int = static_cast<int>(num_expert);
+  int topk_int = static_cast<int>(topk);

  auto zipped_probs =
      paddle::empty({zipped_rows, topk}, dtype, unzipped_probs[0].place());



tokens_zip_prob_impl 在检查 unzipped_probs.size()==num_expert 后仍直接访问 unzipped_probs[0] 来取 place；当 num_expert==0（且 unzipped_probs 为空）会越界，同时后续 rowmap 索引也不成立。建议显式要求 num_expert > 0（以及 topk > 0，如适用），或在 0 专家/0 topk 情况下提前返回一个空 shape 的 zipped_probs。

src/paddlefleet/_extensions/utils.h

wanghuancoder

LGTM。PR修改原则为：1）能ENFORCE/CHECK拦截的不改int64；2）拦截不了的改int64，但目测对Kernel性能造成的印象有限。

zrr1999 · 2026-02-13T06:32:34Z

LGTM。PR修改原则为：1）能ENFORCE/CHECK拦截的不改int64；2）拦截不了的改int64，但目测对Kernel性能造成的印象有限。

收到

From00

LGTM

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

zrr1999 · 2026-02-27T03:20:58Z

/re-run all-failed

zrr1999 · 2026-02-27T08:26:40Z

/re-run all-failed

zrr1999 · 2026-02-28T08:28:48Z

/re-run all-failed

update

551f433

Copilot AI review requested due to automatic review settings February 12, 2026 09:30

Copilot started reviewing on behalf of zrr1999 February 12, 2026 09:31 View session

zrr1999 changed the title ~~update~~ [large tensor] fix extensions issues Feb 12, 2026

Copilot AI reviewed Feb 12, 2026

View reviewed changes

zrr1999 changed the title ~~[large tensor] fix extensions issues~~ [large tensor] fix CUDA extensions int64 overflow for large tensor dimensions Feb 12, 2026

wanghuancoder previously approved these changes Feb 13, 2026

View reviewed changes

From00 previously approved these changes Feb 13, 2026

View reviewed changes

risemeup1 previously approved these changes Feb 27, 2026

View reviewed changes

Update src/paddlefleet/_extensions/count_cumsum.cu

b44d6e9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

zrr1999 dismissed stale reviews from risemeup1, From00, and wanghuancoder via b44d6e9 February 27, 2026 03:16

update

63a7ab2

update

2028d11

rm N==0

112aed9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[large tensor] fix CUDA extensions int64 overflow for large tensor dimensions#561

[large tensor] fix CUDA extensions int64 overflow for large tensor dimensions#561
zrr1999 wants to merge 5 commits intoPaddlePaddle:developfrom
zrr1999:large-tensor/extensions

zrr1999 commented Feb 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Uh oh!

wanghuancoder left a comment

Uh oh!

zrr1999 commented Feb 13, 2026

Uh oh!

From00 left a comment

Uh oh!

zrr1999 commented Feb 27, 2026

Uh oh!

zrr1999 commented Feb 27, 2026

Uh oh!

zrr1999 commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zrr1999 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wanghuancoder left a comment

Choose a reason for hiding this comment

Uh oh!

zrr1999 commented Feb 13, 2026

Uh oh!

From00 left a comment

Choose a reason for hiding this comment

Uh oh!

zrr1999 commented Feb 27, 2026

Uh oh!

zrr1999 commented Feb 27, 2026

Uh oh!

zrr1999 commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zrr1999 commented Feb 12, 2026 •

edited

Loading