[DIAGNOSTIC - DO NOT MERGE] fix(pool): default to serial pool init; env opt-in for parallel by hallerite · Pull Request #29 · PrimeIntellect-ai/renderers

hallerite · 2026-05-13T15:18:40Z

Summary

Default RendererPool construction to serial (workers=1) to eliminate a rare-but-catastrophic race during concurrent AutoTokenizer.from_pretrained in pool build
Add RENDERERS_POOL_INIT_WORKERS env var to opt back into parallel construction (clamped to [1, min(size, 8)])
Unit tests for the env-var resolution logic

Why

In scale RL rollouts against PrimeIntellect/GLM-4.5-Air, we observed intermittent NotImplementedError raised from the transformers Python tokenizer fallback path (tokenization_python.PythonBackend.__init__ → _add_tokens → get_vocab) during RendererPool.__init__. The race is rare but catastrophic: it poisons one slot of the pool, after which random rollouts that happen to check out that slot raise.

Forcing use_fast=True (#26) makes the failure mode loud (OSError if fast files are missing) and fixes the common case, but in our cluster we still saw intermittent failures during parallel pool construction. Serializing pool build (workers=1) eliminated them.

The cost of going serial is bounded:

Pool size 32, ~300 ms per from_pretrained call → ~10s of additional pool-build time on startup
Pool build runs once per env-worker process, off the steady-state rollout path
Steady-state cost is unchanged (the pool checks out renderers via a queue.Queue, which is independent of how the pool was populated)

Users who need fast startup can opt back into parallel construction via RENDERERS_POOL_INIT_WORKERS=N.

Behavior

`RENDERERS_POOL_INIT_WORKERS`	Resolved workers (size=32)
unset (default)	1 (serial)
`4`	4
`16`	8 (clamped to max 8)
`0`, `-2`, `not-an-int`	1 (defensive fallback)

Test plan

New unit tests in tests/test_pool_init_workers.py cover: default, env opt-in, clamped to size, clamped to 8, invalid/negative/zero fallback
Full suite: pytest tests/ → 1049 passed, 73 skipped, 1 xfailed
Behavioral confirmation against scale RL rollouts: the NotImplementedError race disappears with workers=1

Notes

This is a follow-up to [DIAGNOSTIC - DO NOT MERGE] fix(load_tokenizer): default to use_fast=True #26 (use_fast=True default). Both changes target the same class of init-time failures
I've kept min(size, 8) clamping for parallel mode since the GIL-bound Python portion of from_pretrained doesn't scale past ~8 workers
Behavior change for callers who relied on the previous parallel default: startup is ~10s slower per env-worker process. If that matters, set RENDERERS_POOL_INIT_WORKERS=8

Default ``RendererPool`` construction to serial (``workers=1``). In some deployments, concurrent ``AutoTokenizer.from_pretrained`` calls during pool build have surfaced an intermittent ``NotImplementedError`` from the transformers Python tokenizer fallback path. The race is rare but catastrophic: it poisons one slot of the pool, after which random rollouts hit the bad slot and raise. Observed against PrimeIntellect/GLM-4.5-Air under scale RL rollouts, even with use_fast=True (#26). Trade-off: serial pool build of a 32-slot pool adds ~10-15s to startup vs parallel. Pool build is one-time per env-worker, off the steady-state path. Users who need fast startup can opt back into parallel construction via ``RENDERERS_POOL_INIT_WORKERS``. The resolved value is clamped to ``[1, min(size, 8)]``; invalid values fall back to 1.

hallerite marked this pull request as draft May 13, 2026 15:19

hallerite changed the title ~~fix(pool): default to serial pool init; env opt-in for parallel~~ [DIAGNOSTIC - DO NOT MERGE] fix(pool): default to serial pool init; env opt-in for parallel May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DIAGNOSTIC - DO NOT MERGE] fix(pool): default to serial pool init; env opt-in for parallel#29

[DIAGNOSTIC - DO NOT MERGE] fix(pool): default to serial pool init; env opt-in for parallel#29
hallerite wants to merge 1 commit into
mainfrom
fix/pool-init-workers-env

hallerite commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented May 13, 2026

Summary

Why

Behavior

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant