fix(tokenizer): fix byte-fallback incremental decode dropping CJK characters#710
fix(tokenizer): fix byte-fallback incremental decode dropping CJK characters#710CatherineSue wants to merge 2 commits intomainfrom
Conversation
…byte-fallback correctness
The previous byte-length comparison (`new_text.len() > prefix_text.len()`)
silently dropped characters when \u{FFFD} resolved to a real character of
the same UTF-8 byte length (e.g., 3-byte FFFD → 3-byte CJK). This caused
two compounding bugs:
1. **Correctness**: CJK characters, symbols, and other 3-byte UTF-8
characters from byte-fallback tokenizers were never emitted.
2. **Performance**: Because characters were never detected as "new text",
prefix_offset never advanced, causing the decode window to grow
unbounded — O(N²) total work on long generations.
Fix: compare actual byte content to find the divergence point instead of
comparing lengths. This correctly detects FFFD→character transitions
regardless of byte length, causing offsets to advance naturally every 2-4
tokens during byte-fallback sequences.
Also bounds the initial prefix_offset in Sequence::with_tokens using the
same INITIAL_INCREMENTAL_DETOKENIZATION_OFFSET (5) used by DecodeStream,
matching the HuggingFace TGI / vLLM convention.
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses critical correctness and performance issues in the tokenizer's incremental decoding logic, particularly affecting byte-fallback tokenizers like SentencePiece, GLM, and LLaMA. Previously, a byte-length comparison could silently drop CJK characters and other multi-byte sequences, leading to incorrect output and unbounded growth of the decode window. The changes replace this with a more robust byte-content comparison and ensure proper bounding of the initial decode offset, guaranteeing accurate and efficient streaming of generated text. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
📝 WalkthroughWalkthroughRefactors incremental detokenization to use content-based divergence detection (byte-wise common-prefix with UTF‑8 boundary backing) and explicit U+FFFD handling; adds Sequence::flush and extensive byte-fallback tokenizer tests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Code Review
This pull request provides a crucial fix for incremental decoding in byte-fallback tokenizers, addressing a correctness issue where CJK characters and other multi-byte sequences were silently dropped, and a performance issue leading to O(N²) tokenizer work. The solution correctly identifies newly decoded text using a byte-content comparison instead of a byte-length comparison, which was the root cause of the problem. The introduction of INITIAL_INCREMENTAL_DETOKENIZATION_OFFSET also aligns the Sequence initialization with established practices in similar projects.
The changes are well-implemented across sequence.rs and stream.rs, ensuring consistent behavior. The addition of a dedicated ByteFallbackTokenizer mock and a comprehensive suite of 7 new regression tests is excellent, thoroughly validating the fix for various scenarios including CJK characters, 4-byte emojis, and offset advancement. This significantly improves the robustness and reliability of the tokenizer component.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/tokenizer/src/sequence.rs`:
- Around line 129-157: The Sequence currently treats any trailing U+FFFD as an
"incomplete" UTF-8 sequence in append_token (using token_ids, read_offset,
prefix_offset and tokenizer.decode) but there is no finalize/flush path to ever
emit a legitimate replacement character; add a finalize() (or flush()) method on
Sequence that decodes the remaining buffered tokens (from
prefix_offset..read_offset or prefix_offset..) via
tokenizer.decode(skip_special_tokens) and returns that text (so EF BF BD or
explicit U+FFFD at stream end is emitted), and update append_token/docs to stop
suppressing a trailing U+FFFD only when finalize() will be called (i.e., do not
permanently advance read_offset such that a replacement can never be produced);
ensure the new finalize() is referenced where sequences are consumed so
end-of-stream replacement characters are produced correctly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6a3b26b0-18a2-4691-85b4-dfccd3375024
📒 Files selected for processing (3)
crates/tokenizer/src/sequence.rscrates/tokenizer/src/stream.rscrates/tokenizer/src/tests.rs
Align Sequence offset management with DecodeStream: only advance read_offset on successful text emission, not unconditionally. This enables a flush() method that recovers legitimate replacement characters deferred by the trailing-FFFD check at end-of-stream. Signed-off-by: Chang Su <chang.s.su@oracle.com>
Description
Problem
Incremental decoding in
Sequence::append_token()andDecodeStream::step()uses a byte-length comparison (new_text.len() > prefix_text.len()) to detect newly decoded text. This silently fails for byte-fallback tokenizers (SentencePiece withbyte_fallback: true) when\u{FFFD}resolves to a real character of the same UTF-8 byte length.For example, CJK character
中is 3 UTF-8 bytes — the same as\u{FFFD}. When three byte-fallback tokens complete a CJK character, the decoded text changes from"prefix\u{FFFD}"to"prefix中"but the byte length stays identical, so the length check returns false and the character is never emitted.This causes two compounding bugs:
prefix_offsetnever advances, causing the decode window[prefix_offset..]to grow unbounded — O(N²) total tokenizer work on long generations.Scope & Practical Impact
Investigation of the GLM-5 tokenizer (
zai-org/GLM-5-FP8) confirms:byte_fallback: false), NOT SentencePiece byte-fallback.<0xE4>-style byte tokens exist in the 154,820-entry vocabulary.\u{FFFD}because GPT-2's byte-level encoding maps all 256 bytes to valid Unicode codepoints.The bug path cannot trigger with GLM-5 or any GPT-2/ByteLevel BPE tokenizer. It only affects SentencePiece tokenizers with
byte_fallback: true(e.g., LLaMA, Gemma) when generating characters absent from the vocabulary. Manual end-to-end testing with GLM-4.5-Air (which shares the same tokenizer family) onorigin/mainconfirmed no issues with 30k+ streamed CJK tokens.The original PR #696 reported performance degradation at ~32k tokens with GLM models, but this is more likely attributable to backend/engine-level issues (KV cache pressure, attention scaling, scheduling) rather than the tokenizer's incremental decode path.
Solution
Replace byte-length comparison with byte-content comparison — find where
new_textactually diverges fromprefix_textby comparing bytes, then emit everything after the divergence point. This correctly detects\u{FFFD}→ character transitions regardless of byte length, causing offsets to advance naturally every 2-4 tokens during byte-fallback sequences.Also adds
Sequence::flush()(matchingDecodeStream::flush()) to recover any text deferred by the trailing-FFFD check at end-of-stream, and alignsread_offsetmanagement so it only advances on successful emission.Changes
sequence.rs: Replaced byte-length comparison inappend_token()with byte-content divergence detection. Addedflush()method.read_offsetnow only advances on successful emission (matchingDecodeStream).stream.rs: Applied same byte-content comparison fix toDecodeStream::step().tests.rs: AddedByteFallbackTokenizermock (simulates real byte-fallback behavior viafrom_utf8_lossy) and 10 regression tests covering CJK characters, consecutive CJK, 4-byte emoji, offset advancement, flush behavior, DecodeStream byte-fallback, and prefill offset bounding.Test Plan
cargo +nightly fmt --all -- --checkcargo clippy -p llm-tokenizer --all-targets --all-features -- -D warningscargo test -p llm-tokenizer(110 tests pass, including 10 new regression tests)Manual end-to-end test script (GLM4.5-Air, 32k tokens, streaming CJK)
Sample output (GLM-4.5-Air, 30k+ chunks of streamed CJK, truncated)
Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspasses