Skip to content

[large tensor] fix CUDA extensions int64 overflow for large tensor dimensions#561

Open
zrr1999 wants to merge 5 commits intoPaddlePaddle:developfrom
zrr1999:large-tensor/extensions
Open

[large tensor] fix CUDA extensions int64 overflow for large tensor dimensions#561
zrr1999 wants to merge 5 commits intoPaddlePaddle:developfrom
zrr1999:large-tensor/extensions

Conversation

@zrr1999
Copy link
Member

@zrr1999 zrr1999 commented Feb 12, 2026

主要修复 CUDA 扩展在支持大张量时的 int32 溢出问题,修改涉及多个 .cuutils.h 文件:

  1. count_cumsum.cu

    • load_128_bits / store_128_bits 增加 IdxT 模板参数,支持 int64_t 索引
    • 循环变量 N_veci 改为 int64_t,避免大 N 溢出
    • 增加 N == 0 的提前返回
  2. filter_scores.cu

    • 循环步长 gridDim.x * blockDim.x 使用 static_cast<int64_t> 避免溢出
    • grid_size 先用 int64_t 计算再转成 int
    • 增加 PD_CHECK,检查 total_elements / total_valid 不超过 INT_MAX
  3. fuse_stack_transpose_fp8_quant.cu

    • grid_x 使用 int64_t 计算
    • 增加 PADDLE_ENFORCE_LE,保证 grid.x <= INT_MAX
  4. fuse_swiglu_scale.cu / swiglu_kernel.cu

    • rows == 0hidden_size == 0 做提前返回
    • 增加 rows * hidden2 / rows * input_dim <= INT_MAX 的检查
  5. router_metadata.cu

    • PADDLE_ENFORCE_LE 替代 TODO,检查 num_tokens * K <= INT_MAX
  6. tokens_unzip_gather.cu

    • 调整 quanted_hidden_size 逻辑
    • 当 expert 无 token 时跳过 kernel 启动
  7. tokens_unzip_slice.cu

    • 循环步长使用 static_cast<int64_t>(blockDim.x) * gridDim.x
    • total_zipped_rows == 0 提前返回
  8. tokens_zip_prob.cu

    • num_experttopk 改为 int64_t,并增加 INT_MAX 检查
    • total_items 与 grid 计算改为使用 int64_t 再转 int
  9. tokens_zip_unique_add.cu

    • 循环索引与步长改为 int64_t,避免 hidden_size 溢出
  10. utils.h

    • unrolled_memcpyvectorized_memcpytry_vectorized_memcpynum_elements 改为 int64_t

Copilot AI review requested due to automatic review settings February 12, 2026 09:30
@zrr1999 zrr1999 changed the title update [large tensor] fix extensions issues Feb 12, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 主要围绕 CUDA 自定义算子在“大 tensor / 大维度”场景下的健壮性改造:将部分索引/计数从 int 升级到 int64_t,并在若干 kernel launcher 前增加 INT_MAX 边界保护与 0-size 快速返回,以避免整数溢出与无效 kernel launch。

Changes:

  • 将多处 kernel/辅助函数的元素计数、循环索引改为 int64_t,减少大尺寸下的溢出风险
  • 为若干算子新增 INT_MAX 上界检查、0-size 提前返回,避免非法配置/无效 launch
  • 小幅调整部分逻辑分支以减少不必要的 kernel 启动

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/paddlefleet/_extensions/utils.h memcpy 辅助函数参数/索引改为 int64_t,并调整函数声明格式
src/paddlefleet/_extensions/tokens_zip_unique_add.cu kernel 内循环索引改为 int64_t,避免 hidden_size 大时溢出
src/paddlefleet/_extensions/tokens_zip_prob.cu 增加 num_expert/topkINT_MAX 检查并改用 int64_t 中间量计算 grid 等
src/paddlefleet/_extensions/tokens_unzip_slice.cu 循环步进改为 int64_t 计算,并在 0 行时提前返回
src/paddlefleet/_extensions/tokens_unzip_gather.cu 整理 scale 形状读取逻辑,补齐无 scale 时的 quanted_hidden_size 计算,并跳过无 token 的 kernel launch
src/paddlefleet/_extensions/swiglu_kernel.cu 增加 0-size 早返回与 rows*input_dim <= INT_MAX 检查
src/paddlefleet/_extensions/router_metadata.cu 增加 num_tokens*K <= INT_MAX 检查并在部分位置显式 cast
src/paddlefleet/_extensions/fuse_swiglu_scale.cu forward/backward 增加 0-size 早返回与 rows*hidden2 <= INT_MAX 检查
src/paddlefleet/_extensions/fuse_stack_transpose_fp8_quant.cu 对 grid.x 做 INT_MAX 上界保护并引入 int64_t 中间变量
src/paddlefleet/_extensions/filter_scores.cu 增加 total_elements/total_validINT_MAX 检查,grid_size 用 int64 计算后再安全 cast
src/paddlefleet/_extensions/count_cumsum.cu 128-bit load/store 支持更大索引类型,局部循环索引改为 int64_t,并新增 N==0 提前返回

Comment on lines 69 to 83
int64_t zipped_rows = zipped_expertwise_rowmap_shape[0];
int num_expert = zipped_expertwise_rowmap_shape[1];
int topk = dispatched_indices_shape[1];
PD_CHECK(unzipped_probs.size() == num_expert);
int64_t num_expert = zipped_expertwise_rowmap_shape[1];
int64_t topk = dispatched_indices_shape[1];
PD_CHECK(num_expert <= static_cast<int64_t>(std::numeric_limits<int>::max()),
"num_expert must be <= INT_MAX for tokens_zip_prob.");
PD_CHECK(topk <= static_cast<int64_t>(std::numeric_limits<int>::max()),
"topk must be <= INT_MAX for tokens_zip_prob.");
PD_CHECK(unzipped_probs.size() == static_cast<size_t>(num_expert),
"unzipped_probs.size() must equal num_expert.");
int num_expert_int = static_cast<int>(num_expert);
int topk_int = static_cast<int>(topk);

auto zipped_probs =
paddle::empty({zipped_rows, topk}, dtype, unzipped_probs[0].place());

Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokens_zip_prob_impl 在检查 unzipped_probs.size()==num_expert 后仍直接访问 unzipped_probs[0] 来取 place;当 num_expert==0(且 unzipped_probs 为空)会越界,同时后续 rowmap 索引也不成立。建议显式要求 num_expert > 0(以及 topk > 0,如适用),或在 0 专家/0 topk 情况下提前返回一个空 shape 的 zipped_probs。

Copilot uses AI. Check for mistakes.
@zrr1999 zrr1999 changed the title [large tensor] fix extensions issues [large tensor] fix CUDA extensions int64 overflow for large tensor dimensions Feb 12, 2026
wanghuancoder
wanghuancoder previously approved these changes Feb 13, 2026
Copy link

@wanghuancoder wanghuancoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM。PR修改原则为:1)能ENFORCE/CHECK拦截的不改int64;2)拦截不了的改int64,但目测对Kernel性能造成的印象有限。

@zrr1999
Copy link
Member Author

zrr1999 commented Feb 13, 2026

LGTM。PR修改原则为:1)能ENFORCE/CHECK拦截的不改int64;2)拦截不了的改int64,但目测对Kernel性能造成的印象有限。

收到

From00
From00 previously approved these changes Feb 13, 2026
Copy link
Collaborator

@From00 From00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

risemeup1
risemeup1 previously approved these changes Feb 27, 2026
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@zrr1999 zrr1999 dismissed stale reviews from risemeup1, From00, and wanghuancoder via b44d6e9 February 27, 2026 03:16
@zrr1999
Copy link
Member Author

zrr1999 commented Feb 27, 2026

/re-run all-failed

@zrr1999
Copy link
Member Author

zrr1999 commented Feb 27, 2026

/re-run all-failed

@zrr1999
Copy link
Member Author

zrr1999 commented Feb 28, 2026

/re-run all-failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants