Skip to content

fix(tokenizer): fix byte-fallback incremental decode dropping CJK characters#710

Open
CatherineSue wants to merge 2 commits intomainfrom
fix/tokenizer-incremental-decode-correctness
Open

fix(tokenizer): fix byte-fallback incremental decode dropping CJK characters#710
CatherineSue wants to merge 2 commits intomainfrom
fix/tokenizer-incremental-decode-correctness

Conversation

@CatherineSue
Copy link
Collaborator

@CatherineSue CatherineSue commented Mar 10, 2026

Description

Problem

Incremental decoding in Sequence::append_token() and DecodeStream::step() uses a byte-length comparison (new_text.len() > prefix_text.len()) to detect newly decoded text. This silently fails for byte-fallback tokenizers (SentencePiece with byte_fallback: true) when \u{FFFD} resolves to a real character of the same UTF-8 byte length.

For example, CJK character is 3 UTF-8 bytes — the same as \u{FFFD}. When three byte-fallback tokens complete a CJK character, the decoded text changes from "prefix\u{FFFD}" to "prefix中" but the byte length stays identical, so the length check returns false and the character is never emitted.

This causes two compounding bugs:

  1. Correctness: CJK characters, 3-byte symbols, and other same-length resolutions are silently dropped from streamed output.
  2. Performance: Because no text is detected, prefix_offset never advances, causing the decode window [prefix_offset..] to grow unbounded — O(N²) total tokenizer work on long generations.

Scope & Practical Impact

This is a latent, narrow edge case — a defensive improvement, not a production-critical fix.

Investigation of the GLM-5 tokenizer (zai-org/GLM-5-FP8) confirms:

  • GLM-5 uses GPT-2 style ByteLevel BPE (byte_fallback: false), NOT SentencePiece byte-fallback.
  • No <0xE4>-style byte tokens exist in the 154,820-entry vocabulary.
  • 4,216 common CJK characters have dedicated tokens (e.g., "中" = token 98322).
  • 16,776 rarer CJK characters are handled via BPE sub-word merges of GPT-2 byte-encoded characters — which never produce \u{FFFD} because GPT-2's byte-level encoding maps all 256 bytes to valid Unicode codepoints.

The bug path cannot trigger with GLM-5 or any GPT-2/ByteLevel BPE tokenizer. It only affects SentencePiece tokenizers with byte_fallback: true (e.g., LLaMA, Gemma) when generating characters absent from the vocabulary. Manual end-to-end testing with GLM-4.5-Air (which shares the same tokenizer family) on origin/main confirmed no issues with 30k+ streamed CJK tokens.

The original PR #696 reported performance degradation at ~32k tokens with GLM models, but this is more likely attributable to backend/engine-level issues (KV cache pressure, attention scaling, scheduling) rather than the tokenizer's incremental decode path.

Solution

Replace byte-length comparison with byte-content comparison — find where new_text actually diverges from prefix_text by comparing bytes, then emit everything after the divergence point. This correctly detects \u{FFFD} → character transitions regardless of byte length, causing offsets to advance naturally every 2-4 tokens during byte-fallback sequences.

Also adds Sequence::flush() (matching DecodeStream::flush()) to recover any text deferred by the trailing-FFFD check at end-of-stream, and aligns read_offset management so it only advances on successful emission.

Changes

  • sequence.rs: Replaced byte-length comparison in append_token() with byte-content divergence detection. Added flush() method. read_offset now only advances on successful emission (matching DecodeStream).
  • stream.rs: Applied same byte-content comparison fix to DecodeStream::step().
  • tests.rs: Added ByteFallbackTokenizer mock (simulates real byte-fallback behavior via from_utf8_lossy) and 10 regression tests covering CJK characters, consecutive CJK, 4-byte emoji, offset advancement, flush behavior, DecodeStream byte-fallback, and prefill offset bounding.

Test Plan

  • cargo +nightly fmt --all -- --check
  • cargo clippy -p llm-tokenizer --all-targets --all-features -- -D warnings
  • cargo test -p llm-tokenizer (110 tests pass, including 10 new regression tests)
Manual end-to-end test script (GLM4.5-Air, 32k tokens, streaming CJK)
import json
import requests

BASE_URL = "http://localhost:3002/v1"
API_KEY = "test_api_key"
MODEL = "/raid/models/zai-org/GLM-4.5-Air"  # adjust to your GLM4.5 model path

url = f"{BASE_URL}/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}",
}

payload = {
    "model": MODEL,
    "messages": [
        {"role": "system", "content": "You are a helpful AI assistant. Always respond in Chinese."},
        {"role": "user", "content": "请详细介绍中国历史上所有主要朝代,包括每个朝代的建立者、首都、主要成就、文化贡献和灭亡原因。请尽可能详细地回答。"},
    ],
    "temperature": 0,
    "max_tokens": 32000,
    "ignore_eos": True,
    "stream": True,
}

resp = requests.post(url, headers=headers, json=payload, timeout=600, stream=True)
resp.raise_for_status()

total_text = ""
chunk_count = 0

for line in resp.iter_lines():
    line = line.decode("utf-8")
    if not line.startswith("data: "):
        continue
    data = line[len("data: "):]
    if data == "[DONE]":
        break

    chunk = json.loads(data)
    if chunk.get("usage"):
        print(f"\n\nUsage: {chunk['usage'].get('completion_tokens')} completion tokens")

    choices = chunk.get("choices", [])
    if not choices:
        continue

    content = choices[0].get("delta", {}).get("content", "")
    if content:
        total_text += content
        chunk_count += 1
        print(content, end="", flush=True)

print(f"\n\nTotal chunks: {chunk_count}, Total chars: {len(total_text)}")
Sample output (GLM-4.5-Air, 30k+ chunks of streamed CJK, truncated)
**中央军队**
- **禁军**:
  - 南衙十六卫:负责京城防卫
  - 北衙禁军:负责保卫皇帝安全,包括羽林军、龙武军、神武军等
  -神策军:中后期成为最强大的禁军,由宦官控制
- **其他中央军队**:
  - 京城卫戍部队:负责京城治安
  - 皇家卫队:负责保卫皇宫安全

**地方军队**
- **府兵**:
  -府兵是唐朝初期的主要军事力量
  -府兵分为卫和所,卫是高级军事单位,所是基层军事单位
  -府兵平时务农,战时从军
- **藩镇军队**:
  -藩镇军队是唐朝中后期的主要军事力量
  -藩镇军队由节度使招募和管理
  -藩镇军队长期驻守在边疆或重要地区

...

Total chunks: 30717, Total chars: 48851
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated

…byte-fallback correctness

The previous byte-length comparison (`new_text.len() > prefix_text.len()`)
silently dropped characters when \u{FFFD} resolved to a real character of
the same UTF-8 byte length (e.g., 3-byte FFFD → 3-byte CJK). This caused
two compounding bugs:

1. **Correctness**: CJK characters, symbols, and other 3-byte UTF-8
   characters from byte-fallback tokenizers were never emitted.
2. **Performance**: Because characters were never detected as "new text",
   prefix_offset never advanced, causing the decode window to grow
   unbounded — O(N²) total work on long generations.

Fix: compare actual byte content to find the divergence point instead of
comparing lengths. This correctly detects FFFD→character transitions
regardless of byte length, causing offsets to advance naturally every 2-4
tokens during byte-fallback sequences.

Also bounds the initial prefix_offset in Sequence::with_tokens using the
same INITIAL_INCREMENTAL_DETOKENIZATION_OFFSET (5) used by DecodeStream,
matching the HuggingFace TGI / vLLM convention.

Signed-off-by: Chang Su <chang.s.su@oracle.com>
@CatherineSue CatherineSue requested a review from slin1237 as a code owner March 10, 2026 20:04
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical correctness and performance issues in the tokenizer's incremental decoding logic, particularly affecting byte-fallback tokenizers like SentencePiece, GLM, and LLaMA. Previously, a byte-length comparison could silently drop CJK characters and other multi-byte sequences, leading to incorrect output and unbounded growth of the decode window. The changes replace this with a more robust byte-content comparison and ensure proper bounding of the initial decode offset, guaranteeing accurate and efficient streaming of generated text.

Highlights

  • Incremental Decoding Fix: Implemented a byte-content comparison mechanism in incremental decoding functions (append_token() and step()) to correctly handle byte-fallback tokenizers, resolving issues with CJK characters and other multi-byte sequences that were previously dropped.
  • Performance Optimization: Introduced INITIAL_INCREMENTAL_DETOKENIZATION_OFFSET and applied it to bound the prefix_offset in Sequence::with_tokens(), preventing O(N²) performance degradation due to an unbounded decode window.
  • Test Coverage: Added a comprehensive suite of 7 new regression tests, including a ByteFallbackTokenizer mock, to ensure the robustness and correctness of the incremental decoding fixes and offset bounding.
Changelog
  • crates/tokenizer/src/sequence.rs
    • Defined INITIAL_INCREMENTAL_DETOKENIZATION_OFFSET constant.
    • Updated Sequence::with_tokens() to initialize prefix_offset using the new constant.
    • Refactored Sequence::append_token() to use byte-content divergence detection instead of byte-length comparison for incremental text decoding.
  • crates/tokenizer/src/stream.rs
    • Modified DecodeStream::step() to employ byte-content comparison for detecting new text chunks.
  • crates/tokenizer/src/tests.rs
    • Introduced ByteFallbackTokenizer to simulate byte-fallback behavior for testing.
    • Added test_sequence_byte_fallback_cjk_character to verify CJK character decoding.
    • Added test_sequence_byte_fallback_consecutive_cjk to test multiple consecutive CJK characters.
    • Added test_sequence_byte_fallback_4byte_emoji to confirm 4-byte emoji decoding.
    • Added test_sequence_byte_fallback_offsets_advance to ensure prefix_offset advances correctly.
    • Added test_decode_stream_byte_fallback_cjk for DecodeStream CJK handling.
    • Added test_decode_stream_byte_fallback_consecutive_cjk for DecodeStream consecutive CJK handling.
    • Added test_sequence_with_tokens_bounded_prefix_offset to verify the initial prefix offset bounding.
Activity
  • No human activity (comments, reviews, etc.) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

Refactors incremental detokenization to use content-based divergence detection (byte-wise common-prefix with UTF‑8 boundary backing) and explicit U+FFFD handling; adds Sequence::flush and extensive byte-fallback tokenizer tests.

Changes

Cohort / File(s) Summary
Sequence / Stream logic
crates/tokenizer/src/sequence.rs, crates/tokenizer/src/stream.rs
Adds INITIAL_INCREMENTAL_DETOKENIZATION_OFFSET; replaces length-based heuristics with content-based divergence detection between prefix and full windows; treats trailing U+FFFD specially; updates prefix_offset/read_offset semantics; adds Sequence::flush.
Tests / Byte-fallback mock
crates/tokenizer/src/tests.rs
Adds a ByteFallbackTokenizer mock and extensive regression tests covering ASCII and multi-byte UTF‑8 sequences, incremental decoding, prefix_offset bounds, DecodeStream stepping, and flush/fallback edge cases.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

tokenizer

Poem

🐰 I nibble bytes and chase the trail,
Finding where the boundaries fail.
With careful hops I split and flush,
No more broken UTF‑8 mush—
Tokens hum, the stream prevails. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main fix: addressing a byte-fallback incremental decode bug where CJK characters were being dropped. This directly aligns with the primary change addressing the byte-length comparison issue.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/tokenizer-incremental-decode-correctness

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a crucial fix for incremental decoding in byte-fallback tokenizers, addressing a correctness issue where CJK characters and other multi-byte sequences were silently dropped, and a performance issue leading to O(N²) tokenizer work. The solution correctly identifies newly decoded text using a byte-content comparison instead of a byte-length comparison, which was the root cause of the problem. The introduction of INITIAL_INCREMENTAL_DETOKENIZATION_OFFSET also aligns the Sequence initialization with established practices in similar projects.

The changes are well-implemented across sequence.rs and stream.rs, ensuring consistent behavior. The addition of a dedicated ByteFallbackTokenizer mock and a comprehensive suite of 7 new regression tests is excellent, thoroughly validating the fix for various scenarios including CJK characters, 4-byte emojis, and offset advancement. This significantly improves the robustness and reliability of the tokenizer component.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tokenizer/src/sequence.rs`:
- Around line 129-157: The Sequence currently treats any trailing U+FFFD as an
"incomplete" UTF-8 sequence in append_token (using token_ids, read_offset,
prefix_offset and tokenizer.decode) but there is no finalize/flush path to ever
emit a legitimate replacement character; add a finalize() (or flush()) method on
Sequence that decodes the remaining buffered tokens (from
prefix_offset..read_offset or prefix_offset..) via
tokenizer.decode(skip_special_tokens) and returns that text (so EF BF BD or
explicit U+FFFD at stream end is emitted), and update append_token/docs to stop
suppressing a trailing U+FFFD only when finalize() will be called (i.e., do not
permanently advance read_offset such that a replacement can never be produced);
ensure the new finalize() is referenced where sequences are consumed so
end-of-stream replacement characters are produced correctly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6a3b26b0-18a2-4691-85b4-dfccd3375024

📥 Commits

Reviewing files that changed from the base of the PR and between 7962745 and 33eb448.

📒 Files selected for processing (3)
  • crates/tokenizer/src/sequence.rs
  • crates/tokenizer/src/stream.rs
  • crates/tokenizer/src/tests.rs

Align Sequence offset management with DecodeStream: only advance
read_offset on successful text emission, not unconditionally. This
enables a flush() method that recovers legitimate replacement
characters deferred by the trailing-FFFD check at end-of-stream.

Signed-off-by: Chang Su <chang.s.su@oracle.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant