Fix Gemma byte-fallback detokenization for UTF-8 output by psymon-ai · Pull Request #635 · cactus-compute/cactus

psymon-ai · 2026-05-12T17:40:28Z

Summary

Fix detokenization of Gemma byte-fallback tokens such as <0xEA> so that contiguous byte pieces are reassembled into their intended UTF-8 codepoints instead of surfacing as literal <0xHH> strings in the final text.

The bug

When Gemma 4 generates Korean (and other scripts whose codepoints are not in the SentencePiece vocabulary as single pieces), it emits byte-fallback tokens. Cactus's current decoders surface these as their raw piece string. Example observed Gemma-4 audio transcription:

잠 <0xEA><0xB9><0xB0> 수 있는 영상 틀어 줄래?

The byte span <0xEA><0xB9><0xB0> is valid UTF-8 for 깰, so the intended text is:

잠 깰 수 있는 영상 틀어 줄래?

This affects any language whose Gemma tokens fall through to byte fallback - Korean, Polish, Japanese kana that exits the trained vocabulary, and so on.

Root cause

SPTokenizer::decode only reassembled bytes inside its batch branch. The single-token branch returned a raw byte, which works for streaming through the existing call site at cactus/ffi/cactus_transcribe.cpp:248 (final_text += piece). However the SP batch branch did not run any post-aggregation that would catch byte-fallback that already passed through postprocess_text unchanged in unusual call orders.
BPETokenizer::decode did not handle byte-fallback in either the REPLACE_METASPACE branch or the byte-level branch. Both branches just concatenated the literal id_to_token_[id] strings, so <0xHH> came through verbatim. For Gemma 4 this is the active path because the BPE vs SP detector picks BPE whenever merges is present.
The single-token streaming path through BPETokenizer::decode had no byte-fallback short-circuit, so streaming callers would accumulate literal <0xHH> strings instead of bytes.

The fix

A small shared helper in cactus/engine/engine_tokenizer.cpp:

bool parse_byte_fallback_piece(std::string_view piece, uint8_t* out_byte);
std::string reassemble_byte_fallback(const std::string& text);

reassemble_byte_fallback scans the decoded text for contiguous runs of the <0xHH> pattern, decodes the accumulated bytes as UTF-8 when the run is well-formed, and preserves the original literal form when the run is invalid or truncated. This means partial multi-byte sequences at the end of a generation (cut off by max_tokens for example) are not silently dropped - the literal <0xHH> remains visible for debugging.

Applied at the end of:

SPTokenizer::decode batch path
BPETokenizer::decode REPLACE_METASPACE branch
BPETokenizer::decode byte-level branch

Both decoders only run the helper when runtime_config_.byte_fallback is true, so tokenizers that do not declare byte fallback are unaffected.

Plus a single-token short-circuit at the top of BPETokenizer::decode (mirroring the existing SPTokenizer behavior) that returns a raw byte when the only token is a byte-fallback piece. This is what makes streaming work: per-token calls accumulate into a std::string byte buffer that ends up as valid UTF-8 by stream end.

Tests

tests/test_byte_fallback_detokenize.cpp covers:

parse_byte_fallback_piece accepts uppercase and lowercase hex
Rejects malformed sizes, non-hex chars, truncated patterns
Korean 3-byte reassembly (<0xEA><0xB9><0xB0> → 깰)
Polish 2-byte reassembly (<0xC4><0x85><0xC4><0x99> → ąę)
ASCII via byte fallback (<0x20> → space)
Invalid partial spans preserve literal form
Trailing incomplete bytes preserve literal form
Multiple runs separated by regular text
Empty and passthrough strings

Numerical impact

134-sample Korean Gemma-4 audio eval on a Galaxy S10e (real on-device run, not desktop):

Build	Before	After
INT8	4.638% CER	3.949% CER
INT4	5.545% CER	4.856% CER

The improvement comes from repairing sample 31 of that set, whose reference contains the Korean syllable 깰 which Gemma's tokenizer represents via byte fallback. The same bug also affected an FP16 first-32-sample run where the absolute CER moved from 5.236% to 2.352% after the same fix.

DCO

Signed-off-by line is included on the commit.

Out of scope

The BPETokenizer byte-level branch may still benefit from per-stream pending-byte tracking when called token-by-token without the short-circuit (very unusual call pattern). Not addressed here because the documented streaming pattern uses single-token decode, which the short-circuit covers.
Other detokenizer normalization issues (NFC/NFD, special whitespace handling) are unchanged.

Build verification (Android arm64-v8a, Galaxy S10e)

Cross-compiled via NDK r27, deployed to a real Galaxy S10e (Exynos 9820, arm64-v8a).

$ adb shell /data/local/tmp/test_byte_fallback_detokenize

╔══════════════════════════════════════════════════════════════════════════════════════╗
║ Running Byte-fallback detokenize                                                     ║
╚══════════════════════════════════════════════════════════════════════════════════════╝
✓ PASS │ parse_piece_valid        
✓ PASS │ parse_piece_rejects      
✓ PASS │ korean_three_byte_reassembly
✓ PASS │ polish_two_byte_reassembly
✓ PASS │ ascii_byte_fallback_space
✓ PASS │ invalid_partial_preserves_literal
✓ PASS │ trailing_incomplete_preserved
✓ PASS │ empty_and_passthrough    
✓ PASS │ two_runs_separated_by_text
✓ PASS │ lowercase_hex_accepted   
────────────────────────────────────────────────────────────────────────────────────────
✓ All 10 tests passed!

All 10 cases pass. The fix was also validated end-to-end via a Python port
of the C++ helper applied to existing Cactus eval JSON, moving Korean
Gemma-4 audio CER from 4.638% to 3.949% on INT8 and 5.545% to 4.856% on INT4
across 134 samples.

Byte-fallback tokens such as <0xEA><0xB9><0xB0> were surfaced as literal strings in decoded output instead of being reassembled into UTF-8. For Gemma family tokenizers configured with byte_fallback=true: - SPTokenizer batch decode now post-processes the result so any byte-fallback remnants are aggregated and UTF-8 decoded. - BPETokenizer decode does the same on both the metaspace-replace branch and the byte-level branch. - BPETokenizer streaming single-token decode short-circuits a <0xHH> piece to a single raw byte, matching SPTokenizer behavior so streaming callers can accumulate complete UTF-8 sequences. Invalid or incomplete byte spans (for example a stream ending mid multi-byte codepoint) are preserved as their original <0xHH> literals, so byte-fallback debug information is never silently dropped. Verified locally on a 134-sample Korean Gemma-4 audio eval where this single change moves the on-device CER from 4.638% to 3.949% on the INT8 build and from 5.545% to 4.856% on the INT4 build, by repairing one sample whose ground truth contains the Korean character (kkae). Signed-off-by: psymon <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Gemma byte-fallback detokenization for UTF-8 output#635

Fix Gemma byte-fallback detokenization for UTF-8 output#635
psymon-ai wants to merge 1 commit into
cactus-compute:mainfrom
psymon-ai:fix/gemma-byte-fallback-detokenize

psymon-ai commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

psymon-ai commented May 12, 2026

Summary

The bug

Root cause

The fix

Tests

Numerical impact

DCO

Out of scope

Build verification (Android arm64-v8a, Galaxy S10e)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant