Skip to content

Fix Gemma byte-fallback detokenization for UTF-8 output#635

Open
psymon-ai wants to merge 1 commit into
cactus-compute:mainfrom
psymon-ai:fix/gemma-byte-fallback-detokenize
Open

Fix Gemma byte-fallback detokenization for UTF-8 output#635
psymon-ai wants to merge 1 commit into
cactus-compute:mainfrom
psymon-ai:fix/gemma-byte-fallback-detokenize

Conversation

@psymon-ai
Copy link
Copy Markdown

Summary

Fix detokenization of Gemma byte-fallback tokens such as <0xEA> so that contiguous byte pieces are reassembled into their intended UTF-8 codepoints instead of surfacing as literal <0xHH> strings in the final text.

The bug

When Gemma 4 generates Korean (and other scripts whose codepoints are not in the SentencePiece vocabulary as single pieces), it emits byte-fallback tokens. Cactus's current decoders surface these as their raw piece string. Example observed Gemma-4 audio transcription:

잠 <0xEA><0xB9><0xB0> 수 있는 영상 틀어 줄래?

The byte span <0xEA><0xB9><0xB0> is valid UTF-8 for , so the intended text is:

잠 깰 수 있는 영상 틀어 줄래?

This affects any language whose Gemma tokens fall through to byte fallback - Korean, Polish, Japanese kana that exits the trained vocabulary, and so on.

Root cause

  • SPTokenizer::decode only reassembled bytes inside its batch branch. The single-token branch returned a raw byte, which works for streaming through the existing call site at cactus/ffi/cactus_transcribe.cpp:248 (final_text += piece). However the SP batch branch did not run any post-aggregation that would catch byte-fallback that already passed through postprocess_text unchanged in unusual call orders.
  • BPETokenizer::decode did not handle byte-fallback in either the REPLACE_METASPACE branch or the byte-level branch. Both branches just concatenated the literal id_to_token_[id] strings, so <0xHH> came through verbatim. For Gemma 4 this is the active path because the BPE vs SP detector picks BPE whenever merges is present.
  • The single-token streaming path through BPETokenizer::decode had no byte-fallback short-circuit, so streaming callers would accumulate literal <0xHH> strings instead of bytes.

The fix

A small shared helper in cactus/engine/engine_tokenizer.cpp:

bool parse_byte_fallback_piece(std::string_view piece, uint8_t* out_byte);
std::string reassemble_byte_fallback(const std::string& text);

reassemble_byte_fallback scans the decoded text for contiguous runs of the <0xHH> pattern, decodes the accumulated bytes as UTF-8 when the run is well-formed, and preserves the original literal form when the run is invalid or truncated. This means partial multi-byte sequences at the end of a generation (cut off by max_tokens for example) are not silently dropped - the literal <0xHH> remains visible for debugging.

Applied at the end of:

  • SPTokenizer::decode batch path
  • BPETokenizer::decode REPLACE_METASPACE branch
  • BPETokenizer::decode byte-level branch

Both decoders only run the helper when runtime_config_.byte_fallback is true, so tokenizers that do not declare byte fallback are unaffected.

Plus a single-token short-circuit at the top of BPETokenizer::decode (mirroring the existing SPTokenizer behavior) that returns a raw byte when the only token is a byte-fallback piece. This is what makes streaming work: per-token calls accumulate into a std::string byte buffer that ends up as valid UTF-8 by stream end.

Tests

tests/test_byte_fallback_detokenize.cpp covers:

  • parse_byte_fallback_piece accepts uppercase and lowercase hex
  • Rejects malformed sizes, non-hex chars, truncated patterns
  • Korean 3-byte reassembly (<0xEA><0xB9><0xB0>)
  • Polish 2-byte reassembly (<0xC4><0x85><0xC4><0x99>ąę)
  • ASCII via byte fallback (<0x20> → space)
  • Invalid partial spans preserve literal form
  • Trailing incomplete bytes preserve literal form
  • Multiple runs separated by regular text
  • Empty and passthrough strings

Numerical impact

134-sample Korean Gemma-4 audio eval on a Galaxy S10e (real on-device run, not desktop):

Build Before After
INT8 4.638% CER 3.949% CER
INT4 5.545% CER 4.856% CER

The improvement comes from repairing sample 31 of that set, whose reference contains the Korean syllable which Gemma's tokenizer represents via byte fallback. The same bug also affected an FP16 first-32-sample run where the absolute CER moved from 5.236% to 2.352% after the same fix.

DCO

Signed-off-by line is included on the commit.

Out of scope

  • The BPETokenizer byte-level branch may still benefit from per-stream pending-byte tracking when called token-by-token without the short-circuit (very unusual call pattern). Not addressed here because the documented streaming pattern uses single-token decode, which the short-circuit covers.
  • Other detokenizer normalization issues (NFC/NFD, special whitespace handling) are unchanged.

Build verification (Android arm64-v8a, Galaxy S10e)

Cross-compiled via NDK r27, deployed to a real Galaxy S10e (Exynos 9820, arm64-v8a).

$ adb shell /data/local/tmp/test_byte_fallback_detokenize

╔══════════════════════════════════════════════════════════════════════════════════════╗
║ Running Byte-fallback detokenize                                                     ║
╚══════════════════════════════════════════════════════════════════════════════════════╝
✓ PASS │ parse_piece_valid        
✓ PASS │ parse_piece_rejects      
✓ PASS │ korean_three_byte_reassembly
✓ PASS │ polish_two_byte_reassembly
✓ PASS │ ascii_byte_fallback_space
✓ PASS │ invalid_partial_preserves_literal
✓ PASS │ trailing_incomplete_preserved
✓ PASS │ empty_and_passthrough    
✓ PASS │ two_runs_separated_by_text
✓ PASS │ lowercase_hex_accepted   
────────────────────────────────────────────────────────────────────────────────────────
✓ All 10 tests passed!

All 10 cases pass. The fix was also validated end-to-end via a Python port
of the C++ helper applied to existing Cactus eval JSON, moving Korean
Gemma-4 audio CER from 4.638% to 3.949% on INT8 and 5.545% to 4.856% on INT4
across 134 samples.

Byte-fallback tokens such as <0xEA><0xB9><0xB0> were surfaced as literal
strings in decoded output instead of being reassembled into UTF-8.

For Gemma family tokenizers configured with byte_fallback=true:
  - SPTokenizer batch decode now post-processes the result so any
    byte-fallback remnants are aggregated and UTF-8 decoded.
  - BPETokenizer decode does the same on both the metaspace-replace
    branch and the byte-level branch.
  - BPETokenizer streaming single-token decode short-circuits a
    <0xHH> piece to a single raw byte, matching SPTokenizer behavior
    so streaming callers can accumulate complete UTF-8 sequences.

Invalid or incomplete byte spans (for example a stream ending mid
multi-byte codepoint) are preserved as their original <0xHH> literals,
so byte-fallback debug information is never silently dropped.

Verified locally on a 134-sample Korean Gemma-4 audio eval where this
single change moves the on-device CER from 4.638% to 3.949% on the INT8
build and from 5.545% to 4.856% on the INT4 build, by repairing one
sample whose ground truth contains the Korean character (kkae).

Signed-off-by: psymon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant