Reproduction
-
Use the google/gemma-3-270m-it model.
-
Convert or load Cactus weights that include a vocab.txt file in ID<TAB>token format.
-
Initialize the model with cactus_init(...).
-
Initialization fails with:
Exception during init: stof
Expected Behavior
Models should initialize successfully even when vocab.txt contains tokens with literal tab characters.
Actual Behavior
Initialization fails because the tokenizer loader interprets a tab inside the token text as a score separator and then attempts to parse the remainder with std::stof(...).
Likely Root Cause
Before f8afc46, the loader treated everything after the first tab as part of the token.
After f8afc46, the loader looks for another tab inside that token text and treats it as token<TAB>score. That breaks valid entries where the token itself contains tabs.
Gemma is a concrete repro case because its vocabulary includes tokens such as "\t\t\t". In that case, the parser ends up trying to parse "\t" as a float, which throws and causes model initialization to fail.
Reproduction
Use the
google/gemma-3-270m-itmodel.Convert or load Cactus weights that include a
vocab.txtfile inID<TAB>tokenformat.Initialize the model with
cactus_init(...).Initialization fails with:
Exception during init: stofExpected Behavior
Models should initialize successfully even when
vocab.txtcontains tokens with literal tab characters.Actual Behavior
Initialization fails because the tokenizer loader interprets a tab inside the token text as a score separator and then attempts to parse the remainder with
std::stof(...).Likely Root Cause
Before f8afc46, the loader treated everything after the first tab as part of the token.
After f8afc46, the loader looks for another tab inside that token text and treats it as
token<TAB>score. That breaks valid entries where the token itself contains tabs.Gemma is a concrete repro case because its vocabulary includes tokens such as
"\t\t\t". In that case, the parser ends up trying to parse"\t"as a float, which throws and causes model initialization to fail.