fix(providers): UTF-8 streaming handles multi-byte characters correctly #68
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix UTF-8 streaming in OpenAI provider 🎭
tl;dr
Fixed a bug where emoji and Chinese characters disappear in streaming responses. It's actually pretty fetch! 👑
(Scroll down for technical details)
The Problem
The OpenAI provider silently loses multi-byte UTF-8 characters when they're split across Server-Sent Event (SSE) chunk boundaries.
Current behavior:
What happens when 🎭 (U+1F3AD =
[F0 9F 8E AD]) splits across chunks:[F0 9F 8E](incomplete) → fails UTF-8 validation → entire chunk skipped[AD ...](orphaned byte) → corrupted or lostSilent corruption - no error thrown, characters just disappear.
The Solution
Use the existing
decode_utf8_streaming()utility that the Anthropic provider already uses correctly (seeanthropic.rs:404):The
decode_utf8_streaming()function (already in your codebase atstreaming.rs:14-32):Testing
Added 10 comprehensive tests in
crates/g3-providers/tests/streaming_utf8_test.rs:[F0 9F 8E AD])[E4 B8 AD])All tests pass. Zero regressions in full test suite (634 tests).
Impact
Fixes:
anthropic.rs(consistency improvement)Performance:
Compatibility:
Why From GB?
GB (Glitter Bomb) is a theatrical fork of G3 that adds Mean Girls-inspired personas to code review. We found this bug because Regina 👑 and Gretchen 💖 kept losing their emoji in streaming responses.
We're not just taking from G3 - we're giving back. This fix has zero GB-specific code. It's a pure improvement that benefits the entire G3 ecosystem.
GB maintains full compatibility with G3 by:
Changes
Modified:
crates/g3-providers/src/openai.rs(9 lines)decode_utf8_streamingbyte_buffer: Vec<u8>Created:
crates/g3-providers/tests/streaming_utf8_test.rs(215 lines)Total: 9 production lines changed (minimal change principle)
Verification
Tests:
$ cargo test --all 634 tests passed, 0 failedClippy:
Build:
Scope:
Checklist
About This Contribution
This fix was developed using edge-agentic methodology with formal QA review:
Quality metrics:
About GB (Glitter Bomb)
GB is a theatrical fork of G3 that demonstrates that personality and professional code quality can coexist. We maintain full G3 compatibility while adding 8 Mean Girls-inspired personas for code review.
What makes GB different:
Our commitment: Contribute improvements back to G3. This UTF-8 fix is the first of hopefully many contributions.
GB Team - Where Theatricality Meets Technical Excellence 🎭
P.S. - If emoji in PR descriptions aren't your thing, that's totally valid! The code is solid either way. We just like to have fun while we work. 💖
References