Skip to content

[DIAGNOSTIC - DO NOT MERGE] fix(load_tokenizer): default to use_fast=True#26

Open
hallerite wants to merge 3 commits into
mainfrom
fix/load-tokenizer-use-fast-default
Open

[DIAGNOSTIC - DO NOT MERGE] fix(load_tokenizer): default to use_fast=True#26
hallerite wants to merge 3 commits into
mainfrom
fix/load-tokenizer-use-fast-default

Conversation

@hallerite
Copy link
Copy Markdown
Member

Summary

load_tokenizer currently calls AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False) without specifying a backend. AutoTokenizer is a factory that picks between two backends:

  • TokenizersBackend (Rust, fast) — the desired path.
  • PythonBackend (pure Python) — has a latent crash in its __init__.

PythonBackend.__init__ calls _add_tokens(...), which calls self.get_vocab(). In transformers/tokenization_utils_base.py:1439:

def get_vocab(self):
    raise NotImplementedError()

PythonBackend does not override this. So whenever HF silently falls back to the Python path, pool construction raises a bare NotImplementedError that surfaces to env workers as ModelError() -> NotImplementedError().

Reproduction

Observed in prime-rl rollouts on PrimeIntellect/GLM-4.5-Air with trust_remote_code=False:

  • Intermittent at batch_size=256, rollouts_per_example=8.
  • Reliable at batch_size=512, rollouts_per_example=16.

The identical AutoTokenizer.from_pretrained call returned TokenizersBackend on the head node and PythonBackend on the compute nodes — likely a race in HF's backend selection under concurrent first-time loads.

Stack:

RendererPool.__init__ -> factory() -> load_tokenizer()
  -> AutoTokenizer.from_pretrained(GLM-4.5-Air, trust_remote_code=False)
  -> tokenization_python.py:_add_tokens
  -> self.get_vocab()
  -> tokenization_utils_base.py:1439  raise NotImplementedError()

Fix

Pass use_fast=True explicitly. This forces the Rust path. If fast tokenizer files are genuinely missing, AutoTokenizer raises a clear OSError instead of silently routing to the half-implemented Python class.

This is also a latent upstream bug in transformers (PythonBackend._add_tokens -> get_vocab is unreachable), but the renderers-side default is the right call regardless.

🤖 Generated with Claude Code

Forces transformers' Rust TokenizersBackend instead of letting AutoTokenizer
silently fall back to PythonBackend, whose __init__ -> _add_tokens ->
get_vocab() raises NotImplementedError (get_vocab is unimplemented on the
PythonBackend class). Reproduced under concurrent RendererPool init for
GLM-4.5-Air with trust_remote_code=False; the failure rate climbs with
per-step concurrency.

If fast tokenizer files are missing, AutoTokenizer with use_fast=True raises
a clear OSError instead of corrupting the pool silently.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@hallerite hallerite marked this pull request as ready for review May 13, 2026 11:43
hallerite and others added 2 commits May 13, 2026 11:53
The default-path call site in `load_tokenizer` now passes
`use_fast=True`; update the two call-shape tests that compare the
captured kwargs by equality. The Kimi pinned-revision branch is
intentionally unchanged — its source call site does not pass
`use_fast=True`, and the corresponding assertion already reflects
that.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… too

The Kimi pinned-revision branch traverses the same HF dispatch logic
that exhibits the silent-slow-fallback race; setting use_fast=True
explicitly forces the failure to be loud on that path as well. Kimi-K2
family ships tokenizer.json so the fast path is available in practice.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@hallerite hallerite changed the title fix(load_tokenizer): default to use_fast=True [DIAGNOSTIC - DO NOT MERGE] fix(load_tokenizer): default to use_fast=True May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant