Skip to content

[DIAGNOSTIC - DO NOT MERGE] fix(pool): default to serial pool init; env opt-in for parallel#29

Draft
hallerite wants to merge 1 commit into
mainfrom
fix/pool-init-workers-env
Draft

[DIAGNOSTIC - DO NOT MERGE] fix(pool): default to serial pool init; env opt-in for parallel#29
hallerite wants to merge 1 commit into
mainfrom
fix/pool-init-workers-env

Conversation

@hallerite
Copy link
Copy Markdown
Member

Summary

  • Default RendererPool construction to serial (workers=1) to eliminate a rare-but-catastrophic race during concurrent AutoTokenizer.from_pretrained in pool build
  • Add RENDERERS_POOL_INIT_WORKERS env var to opt back into parallel construction (clamped to [1, min(size, 8)])
  • Unit tests for the env-var resolution logic

Why

In scale RL rollouts against PrimeIntellect/GLM-4.5-Air, we observed intermittent NotImplementedError raised from the transformers Python tokenizer fallback path (tokenization_python.PythonBackend.__init__ → _add_tokens → get_vocab) during RendererPool.__init__. The race is rare but catastrophic: it poisons one slot of the pool, after which random rollouts that happen to check out that slot raise.

Forcing use_fast=True (#26) makes the failure mode loud (OSError if fast files are missing) and fixes the common case, but in our cluster we still saw intermittent failures during parallel pool construction. Serializing pool build (workers=1) eliminated them.

The cost of going serial is bounded:

  • Pool size 32, ~300 ms per from_pretrained call → ~10s of additional pool-build time on startup
  • Pool build runs once per env-worker process, off the steady-state rollout path
  • Steady-state cost is unchanged (the pool checks out renderers via a queue.Queue, which is independent of how the pool was populated)

Users who need fast startup can opt back into parallel construction via RENDERERS_POOL_INIT_WORKERS=N.

Behavior

RENDERERS_POOL_INIT_WORKERS Resolved workers (size=32)
unset (default) 1 (serial)
4 4
16 8 (clamped to max 8)
0, -2, not-an-int 1 (defensive fallback)

Test plan

  • New unit tests in tests/test_pool_init_workers.py cover: default, env opt-in, clamped to size, clamped to 8, invalid/negative/zero fallback
  • Full suite: pytest tests/ → 1049 passed, 73 skipped, 1 xfailed
  • Behavioral confirmation against scale RL rollouts: the NotImplementedError race disappears with workers=1

Notes

  • This is a follow-up to [DIAGNOSTIC - DO NOT MERGE] fix(load_tokenizer): default to use_fast=True #26 (use_fast=True default). Both changes target the same class of init-time failures
  • I've kept min(size, 8) clamping for parallel mode since the GIL-bound Python portion of from_pretrained doesn't scale past ~8 workers
  • Behavior change for callers who relied on the previous parallel default: startup is ~10s slower per env-worker process. If that matters, set RENDERERS_POOL_INIT_WORKERS=8

Default ``RendererPool`` construction to serial (``workers=1``). In some
deployments, concurrent ``AutoTokenizer.from_pretrained`` calls during pool
build have surfaced an intermittent ``NotImplementedError`` from the
transformers Python tokenizer fallback path. The race is rare but
catastrophic: it poisons one slot of the pool, after which random rollouts
hit the bad slot and raise. Observed against PrimeIntellect/GLM-4.5-Air
under scale RL rollouts, even with use_fast=True (#26).

Trade-off: serial pool build of a 32-slot pool adds ~10-15s to startup vs
parallel. Pool build is one-time per env-worker, off the steady-state path.

Users who need fast startup can opt back into parallel construction via
``RENDERERS_POOL_INIT_WORKERS``. The resolved value is clamped to
``[1, min(size, 8)]``; invalid values fall back to 1.
@hallerite hallerite marked this pull request as draft May 13, 2026 15:19
@hallerite hallerite changed the title fix(pool): default to serial pool init; env opt-in for parallel [DIAGNOSTIC - DO NOT MERGE] fix(pool): default to serial pool init; env opt-in for parallel May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant