Fix trtllm-gen attention illegal memory access #2002

Tom-Zheng · 2025-10-29T12:58:06Z

📌 Description

This PR fixes illegal memory access of trtllm-gen attention kernels. It changes the workspace buffer from int_workspace_buffer to float_workspace_buffer. int_workspace_buffer is a fixed sized buffer and not initialized to zero, which should not be used.

🔍 Related Issues

Issue #1928

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Bug Fixes
- Fixed memory allocation in the decode module to improve computation accuracy and stability during text generation.

gemini-code-assist · 2025-10-29T12:58:17Z

Summary of Changes

Hello @Tom-Zheng, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical illegal memory access bug affecting the trtllm-gen attention kernels. The fix involves a targeted change in the workspace buffer utilized by the paged_run function, transitioning from an unsuitable integer buffer to a more appropriate float buffer. This ensures the stability and correctness of these high-performance kernels by preventing memory corruption caused by an improperly managed buffer.

Highlights

Memory Access Fix: Resolved an illegal memory access issue within the trtllm-gen attention kernels, which was causing instability.
Workspace Buffer Change: Switched the workspace buffer used in the paged_run function from int_workspace_buffer to float_workspace_buffer to ensure proper initialization and sizing.
Root Cause Addressed: The int_workspace_buffer was identified as problematic due to its fixed size and uninitialized state, leading to the memory access errors.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2025-10-29T12:58:32Z

Walkthrough

The paged_run wrapper in the TrtllmGenDecodeModule now passes float_workspace_buffer instead of int_workspace_buffer to the trtllm_paged_attention_decode kernel, changing the workspace buffer data source for the decode computation.

Changes

Cohort / File(s)	Change Summary
Workspace buffer parameter update `flashinfer/decode.py`	Modified paged_run to pass float_workspace_buffer (instead of int_workspace_buffer) as workspace parameter to trtllm_paged_attention_decode

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Verify that float_workspace_buffer is the correct buffer type for trtllm_paged_attention_decode computation requirements
Confirm this change resolves the underlying issue (e.g., data type mismatch, memory layout compatibility)
Check for any related calls or dependencies that may be affected by this buffer type swap

Poem

🐰 A buffer swap, precise and neat,
From int to float, the math complete,
The workspace flows with proper grace,
Computation finds its rightful place! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "Fix trtllm-gen attention illegal memory access" directly and clearly summarizes the main change in the pull request. According to the raw summary and PR description, the changeset modifies the trtllm-gen decode module to use float_workspace_buffer instead of int_workspace_buffer to fix an illegal memory access issue. The title is concise, specific, and accurately reflects the primary objective without being vague or misleading.
Description Check	✅ Passed	The PR description follows the required template structure and includes all key sections: a clear Description section explaining the fix (changing from int_workspace_buffer to float_workspace_buffer and why), a Related Issues section linking issue #1928, and a completed Pull Request Checklist with pre-commit checks and tests marked as done. The description provides sufficient context about the problem and solution while confirming that all required pre-merge validation steps have been completed.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request addresses a critical illegal memory access bug in the trtllm-gen attention kernels by correctly using float_workspace_buffer instead of the uninitialized int_workspace_buffer in flashinfer/decode.py. The change is correct and effectively resolves the issue for the decode path.

However, a similar vulnerability exists in flashinfer/prefill.py where int_workspace_buffer is also used for the trtllm-gen backend's paged attention. To ensure comprehensive stability and prevent similar memory errors, it is highly recommended to apply the same fix to the prefill path in this pull request.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

flashinfer/decode.py (2)
1928-1936: Op name mismatch: ragged_run vs paged_run.

Custom op registers as ..._ragged_run (Line 1928) while the fake op registers as ..._paged_run (Line 2004). This inconsistency can confuse tooling/tests and complicate mock paths.

Unify the op name (ragged_run or paged_run) across both decorators for clarity and easier stubbing.

Also applies to: 2004-2012

1988-2002: Fix is applied correctly; capacity guard is optional but recommended.

The critical change from int_workspace_buffer to float_workspace_buffer (line 1991) is confirmed in the trtllm-gen decode path. This prevents illegal memory access with the kernel's zero-initialized workspace.

The optional capacity guard suggestion is good practice to fail fast on buffer mismatches:
ws_cap = float_workspace_buffer.numel() * float_workspace_buffer.element_size()
assert ws_cap >= workspace_size, f"buffer {ws_cap} bytes < required {workspace_size} bytes"
Minor note: Custom op registered as "..._ragged_run" (line 1930) but fake op uses "..._paged_run" (line 1935)—consider aligning names if applicable.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb6b620 and a28664f.

📒 Files selected for processing (1)

flashinfer/decode.py (1 hunks)

yzh119

Thanks for spotting the bug and working on bugfix!

yzh119 · 2025-10-29T16:42:40Z

/bot run

flashinfer-bot · 2025-10-29T16:43:23Z

GitLab MR !98 has been created, and the CI pipeline #37527615 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2025-10-29T21:14:11Z

[CANCELED] Pipeline #37527615: canceled

…benchmark (#2018)  ## 📌 Description [tests/attention/test_trtllm_gen_attention.py](https://github.com/flashinfer-ai/flashinfer/blob/v0.5.0rc2/tests/attention/test_trtllm_gen_attention.py#L1021-L1076) was failing and therefore marked xfail. PR #2002 fixed the underlying root cause. Current PR thus removed the `xfail` marker so that these long seqlen cases could be fixed moving forward. Additionally, PR #2002 revealed a bug in the microbenchmark script where [trtllm_batch_decode_with_kv_cache](https://github.com/flashinfer-ai/flashinfer/blob/v0.5.0rc2/flashinfer/decode.py#L2082-L2083) explicitly requires the workspace to be zeroed before first use: ``` workspace_buffer : torch.Tensor. Must be initialized to 0 for its first use. workspace ``` while the microbenchmark code does not zero out, causing undefined behavior such as IMAs that depend on the ordering of backends tested. Current PR fixes the issue by explicitly calling `workspace_buffer.zero_()` between testing different backends.  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * Improved stability of performance benchmarks by properly resetting workspace buffer between backend invocations. * **Tests** * Enabled previously skipped test for long sequence length handling.

Fix trtllm-gen attention illegal memory access

a28664f

Tom-Zheng requested review from bkryu, cyx-6, nvmbreughe, wenscarl and yzh119 as code owners October 29, 2025 12:58

gemini-code-assist bot reviewed Oct 29, 2025

View reviewed changes

coderabbitai bot reviewed Oct 29, 2025

View reviewed changes

yzh119 approved these changes Oct 29, 2025

View reviewed changes

bkryu mentioned this pull request Oct 29, 2025

release: Bump version for v0.5.0rc1 release; cherry-pick #2001 and #2002 #2005

Closed

5 tasks

yzh119 merged commit 6a962ef into flashinfer-ai:main Oct 29, 2025
4 checks passed

This was referenced Oct 31, 2025

test: Enable xfailed trtllm decode long seqlen tests and update microbenchmark #2018

Merged

[bug] [trtllm_batch_decode_with_kv_cache] illegal memory access #1928

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix trtllm-gen attention illegal memory access #2002

Fix trtllm-gen attention illegal memory access #2002

Uh oh!

Tom-Zheng commented Oct 29, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

gemini-code-assist bot commented Oct 29, 2025

Uh oh!

coderabbitai bot commented Oct 29, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

yzh119 left a comment

Uh oh!

yzh119 commented Oct 29, 2025

Uh oh!

flashinfer-bot commented Oct 29, 2025

Uh oh!

flashinfer-bot commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix trtllm-gen attention illegal memory access #2002

Fix trtllm-gen attention illegal memory access #2002

Uh oh!

Conversation

Tom-Zheng commented Oct 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Oct 29, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Oct 29, 2025

Uh oh!

flashinfer-bot commented Oct 29, 2025

Uh oh!

flashinfer-bot commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tom-Zheng commented Oct 29, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 29, 2025 •

edited

Loading