Skip to content

Conversation

@Tom-Zheng
Copy link
Contributor

@Tom-Zheng Tom-Zheng commented Oct 29, 2025

📌 Description

This PR fixes illegal memory access of trtllm-gen attention kernels. It changes the workspace buffer from int_workspace_buffer to float_workspace_buffer. int_workspace_buffer is a fixed sized buffer and not initialized to zero, which should not be used.

🔍 Related Issues

Issue #1928

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Bug Fixes
    • Fixed memory allocation in the decode module to improve computation accuracy and stability during text generation.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Tom-Zheng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical illegal memory access bug affecting the trtllm-gen attention kernels. The fix involves a targeted change in the workspace buffer utilized by the paged_run function, transitioning from an unsuitable integer buffer to a more appropriate float buffer. This ensures the stability and correctness of these high-performance kernels by preventing memory corruption caused by an improperly managed buffer.

Highlights

  • Memory Access Fix: Resolved an illegal memory access issue within the trtllm-gen attention kernels, which was causing instability.
  • Workspace Buffer Change: Switched the workspace buffer used in the paged_run function from int_workspace_buffer to float_workspace_buffer to ensure proper initialization and sizing.
  • Root Cause Addressed: The int_workspace_buffer was identified as problematic due to its fixed size and uninitialized state, leading to the memory access errors.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 29, 2025

Walkthrough

The paged_run wrapper in the TrtllmGenDecodeModule now passes float_workspace_buffer instead of int_workspace_buffer to the trtllm_paged_attention_decode kernel, changing the workspace buffer data source for the decode computation.

Changes

Cohort / File(s) Change Summary
Workspace buffer parameter update
flashinfer/decode.py
Modified paged_run to pass float_workspace_buffer (instead of int_workspace_buffer) as workspace parameter to trtllm_paged_attention_decode

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Verify that float_workspace_buffer is the correct buffer type for trtllm_paged_attention_decode computation requirements
  • Confirm this change resolves the underlying issue (e.g., data type mismatch, memory layout compatibility)
  • Check for any related calls or dependencies that may be affected by this buffer type swap

Poem

🐰 A buffer swap, precise and neat,
From int to float, the math complete,
The workspace flows with proper grace,
Computation finds its rightful place! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Fix trtllm-gen attention illegal memory access" directly and clearly summarizes the main change in the pull request. According to the raw summary and PR description, the changeset modifies the trtllm-gen decode module to use float_workspace_buffer instead of int_workspace_buffer to fix an illegal memory access issue. The title is concise, specific, and accurately reflects the primary objective without being vague or misleading.
Description Check ✅ Passed The PR description follows the required template structure and includes all key sections: a clear Description section explaining the fix (changing from int_workspace_buffer to float_workspace_buffer and why), a Related Issues section linking issue #1928, and a completed Pull Request Checklist with pre-commit checks and tests marked as done. The description provides sufficient context about the problem and solution while confirming that all required pre-merge validation steps have been completed.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical illegal memory access bug in the trtllm-gen attention kernels by correctly using float_workspace_buffer instead of the uninitialized int_workspace_buffer in flashinfer/decode.py. The change is correct and effectively resolves the issue for the decode path.

However, a similar vulnerability exists in flashinfer/prefill.py where int_workspace_buffer is also used for the trtllm-gen backend's paged attention. To ensure comprehensive stability and prevent similar memory errors, it is highly recommended to apply the same fix to the prefill path in this pull request.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
flashinfer/decode.py (2)

1928-1936: Op name mismatch: ragged_run vs paged_run.

Custom op registers as ..._ragged_run (Line 1928) while the fake op registers as ..._paged_run (Line 2004). This inconsistency can confuse tooling/tests and complicate mock paths.

Unify the op name (ragged_run or paged_run) across both decorators for clarity and easier stubbing.

Also applies to: 2004-2012


1988-2002: Fix is applied correctly; capacity guard is optional but recommended.

The critical change from int_workspace_buffer to float_workspace_buffer (line 1991) is confirmed in the trtllm-gen decode path. This prevents illegal memory access with the kernel's zero-initialized workspace.

The optional capacity guard suggestion is good practice to fail fast on buffer mismatches:

ws_cap = float_workspace_buffer.numel() * float_workspace_buffer.element_size()
assert ws_cap >= workspace_size, f"buffer {ws_cap} bytes < required {workspace_size} bytes"

Minor note: Custom op registered as "..._ragged_run" (line 1930) but fake op uses "..._paged_run" (line 1935)—consider aligning names if applicable.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb6b620 and a28664f.

📒 Files selected for processing (1)
  • flashinfer/decode.py (1 hunks)

Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting the bug and working on bugfix!

@yzh119
Copy link
Collaborator

yzh119 commented Oct 29, 2025

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !98 has been created, and the CI pipeline #37527615 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[CANCELED] Pipeline #37527615: canceled

@yzh119 yzh119 merged commit 6a962ef into flashinfer-ai:main Oct 29, 2025
4 checks passed
yzh119 pushed a commit that referenced this pull request Nov 2, 2025
…benchmark (#2018)

<!-- .github/pull_request_template.md -->

## 📌 Description


[tests/attention/test_trtllm_gen_attention.py](https://github.com/flashinfer-ai/flashinfer/blob/v0.5.0rc2/tests/attention/test_trtllm_gen_attention.py#L1021-L1076)
was failing and therefore marked xfail.

PR #2002 fixed the underlying root cause. Current PR thus removed the
`xfail` marker so that these long seqlen cases could be fixed moving
forward.

Additionally, PR #2002 revealed a bug in the microbenchmark script where
[trtllm_batch_decode_with_kv_cache](https://github.com/flashinfer-ai/flashinfer/blob/v0.5.0rc2/flashinfer/decode.py#L2082-L2083)
explicitly requires the workspace to be zeroed before first use:
```
    workspace_buffer : torch.Tensor. Must be initialized to 0 for its first use.
        workspace
```
while the microbenchmark code does not zero out, causing undefined
behavior such as IMAs that depend on the ordering of backends tested.
Current PR fixes the issue by explicitly calling
`workspace_buffer.zero_()` between testing different backends.


<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved stability of performance benchmarks by properly resetting
workspace buffer between backend invocations.

* **Tests**
  * Enabled previously skipped test for long sequence length handling.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants