Skip to content

f8afc46 regresses Gemma tokenizer loading: tab tokens in vocab.txt cause std::stof exception during init #577

@yujonglee

Description

@yujonglee

Reproduction

  1. Use the google/gemma-3-270m-it model.

  2. Convert or load Cactus weights that include a vocab.txt file in ID<TAB>token format.

  3. Initialize the model with cactus_init(...).

  4. Initialization fails with:

    Exception during init: stof

Expected Behavior

Models should initialize successfully even when vocab.txt contains tokens with literal tab characters.

Actual Behavior

Initialization fails because the tokenizer loader interprets a tab inside the token text as a score separator and then attempts to parse the remainder with std::stof(...).

Likely Root Cause

Before f8afc46, the loader treated everything after the first tab as part of the token.

After f8afc46, the loader looks for another tab inside that token text and treats it as token<TAB>score. That breaks valid entries where the token itself contains tabs.

Gemma is a concrete repro case because its vocabulary includes tokens such as "\t\t\t". In that case, the parser ends up trying to parse "\t" as a float, which throws and causes model initialization to fail.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions