Fix Gemma byte-fallback detokenization for UTF-8 output#635
Open
psymon-ai wants to merge 1 commit into
Open
Conversation
Byte-fallback tokens such as <0xEA><0xB9><0xB0> were surfaced as literal
strings in decoded output instead of being reassembled into UTF-8.
For Gemma family tokenizers configured with byte_fallback=true:
- SPTokenizer batch decode now post-processes the result so any
byte-fallback remnants are aggregated and UTF-8 decoded.
- BPETokenizer decode does the same on both the metaspace-replace
branch and the byte-level branch.
- BPETokenizer streaming single-token decode short-circuits a
<0xHH> piece to a single raw byte, matching SPTokenizer behavior
so streaming callers can accumulate complete UTF-8 sequences.
Invalid or incomplete byte spans (for example a stream ending mid
multi-byte codepoint) are preserved as their original <0xHH> literals,
so byte-fallback debug information is never silently dropped.
Verified locally on a 134-sample Korean Gemma-4 audio eval where this
single change moves the on-device CER from 4.638% to 3.949% on the INT8
build and from 5.545% to 4.856% on the INT4 build, by repairing one
sample whose ground truth contains the Korean character (kkae).
Signed-off-by: psymon <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix detokenization of Gemma byte-fallback tokens such as
<0xEA>so that contiguous byte pieces are reassembled into their intended UTF-8 codepoints instead of surfacing as literal<0xHH>strings in the final text.The bug
When Gemma 4 generates Korean (and other scripts whose codepoints are not in the SentencePiece vocabulary as single pieces), it emits byte-fallback tokens. Cactus's current decoders surface these as their raw piece string. Example observed Gemma-4 audio transcription:
The byte span
<0xEA><0xB9><0xB0>is valid UTF-8 for깰, so the intended text is:This affects any language whose Gemma tokens fall through to byte fallback - Korean, Polish, Japanese kana that exits the trained vocabulary, and so on.
Root cause
SPTokenizer::decodeonly reassembled bytes inside its batch branch. The single-token branch returned a raw byte, which works for streaming through the existing call site atcactus/ffi/cactus_transcribe.cpp:248(final_text += piece). However the SP batch branch did not run any post-aggregation that would catch byte-fallback that already passed throughpostprocess_textunchanged in unusual call orders.BPETokenizer::decodedid not handle byte-fallback in either theREPLACE_METASPACEbranch or the byte-level branch. Both branches just concatenated the literalid_to_token_[id]strings, so<0xHH>came through verbatim. For Gemma 4 this is the active path because the BPE vs SP detector picks BPE whenevermergesis present.BPETokenizer::decodehad no byte-fallback short-circuit, so streaming callers would accumulate literal<0xHH>strings instead of bytes.The fix
A small shared helper in
cactus/engine/engine_tokenizer.cpp:reassemble_byte_fallbackscans the decoded text for contiguous runs of the<0xHH>pattern, decodes the accumulated bytes as UTF-8 when the run is well-formed, and preserves the original literal form when the run is invalid or truncated. This means partial multi-byte sequences at the end of a generation (cut off bymax_tokensfor example) are not silently dropped - the literal<0xHH>remains visible for debugging.Applied at the end of:
SPTokenizer::decodebatch pathBPETokenizer::decodeREPLACE_METASPACEbranchBPETokenizer::decodebyte-level branchBoth decoders only run the helper when
runtime_config_.byte_fallbackistrue, so tokenizers that do not declare byte fallback are unaffected.Plus a single-token short-circuit at the top of
BPETokenizer::decode(mirroring the existing SPTokenizer behavior) that returns a raw byte when the only token is a byte-fallback piece. This is what makes streaming work: per-token calls accumulate into astd::stringbyte buffer that ends up as valid UTF-8 by stream end.Tests
tests/test_byte_fallback_detokenize.cppcovers:parse_byte_fallback_pieceaccepts uppercase and lowercase hex<0xEA><0xB9><0xB0>→깰)<0xC4><0x85><0xC4><0x99>→ąę)<0x20>→ space)Numerical impact
134-sample Korean Gemma-4 audio eval on a Galaxy S10e (real on-device run, not desktop):
The improvement comes from repairing sample 31 of that set, whose reference contains the Korean syllable
깰which Gemma's tokenizer represents via byte fallback. The same bug also affected an FP16 first-32-sample run where the absolute CER moved from 5.236% to 2.352% after the same fix.DCO
Signed-off-byline is included on the commit.Out of scope
BPETokenizerbyte-level branch may still benefit from per-stream pending-byte tracking when called token-by-token without the short-circuit (very unusual call pattern). Not addressed here because the documented streaming pattern uses single-token decode, which the short-circuit covers.Build verification (Android arm64-v8a, Galaxy S10e)
Cross-compiled via NDK r27, deployed to a real Galaxy S10e (Exynos 9820, arm64-v8a).
All 10 cases pass. The fix was also validated end-to-end via a Python port
of the C++ helper applied to existing Cactus eval JSON, moving Korean
Gemma-4 audio CER from 4.638% to 3.949% on INT8 and 5.545% to 4.856% on INT4
across 134 samples.