[DIAGNOSTIC - DO NOT MERGE] fix(pool): default to serial pool init; env opt-in for parallel#29
Draft
hallerite wants to merge 1 commit into
Draft
[DIAGNOSTIC - DO NOT MERGE] fix(pool): default to serial pool init; env opt-in for parallel#29hallerite wants to merge 1 commit into
hallerite wants to merge 1 commit into
Conversation
Default ``RendererPool`` construction to serial (``workers=1``). In some deployments, concurrent ``AutoTokenizer.from_pretrained`` calls during pool build have surfaced an intermittent ``NotImplementedError`` from the transformers Python tokenizer fallback path. The race is rare but catastrophic: it poisons one slot of the pool, after which random rollouts hit the bad slot and raise. Observed against PrimeIntellect/GLM-4.5-Air under scale RL rollouts, even with use_fast=True (#26). Trade-off: serial pool build of a 32-slot pool adds ~10-15s to startup vs parallel. Pool build is one-time per env-worker, off the steady-state path. Users who need fast startup can opt back into parallel construction via ``RENDERERS_POOL_INIT_WORKERS``. The resolved value is clamped to ``[1, min(size, 8)]``; invalid values fall back to 1.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RendererPoolconstruction to serial (workers=1) to eliminate a rare-but-catastrophic race during concurrentAutoTokenizer.from_pretrainedin pool buildRENDERERS_POOL_INIT_WORKERSenv var to opt back into parallel construction (clamped to[1, min(size, 8)])Why
In scale RL rollouts against
PrimeIntellect/GLM-4.5-Air, we observed intermittentNotImplementedErrorraised from the transformers Python tokenizer fallback path (tokenization_python.PythonBackend.__init__ → _add_tokens → get_vocab) duringRendererPool.__init__. The race is rare but catastrophic: it poisons one slot of the pool, after which random rollouts that happen to check out that slot raise.Forcing
use_fast=True(#26) makes the failure mode loud (OSErrorif fast files are missing) and fixes the common case, but in our cluster we still saw intermittent failures during parallel pool construction. Serializing pool build (workers=1) eliminated them.The cost of going serial is bounded:
from_pretrainedcall → ~10s of additional pool-build time on startupqueue.Queue, which is independent of how the pool was populated)Users who need fast startup can opt back into parallel construction via
RENDERERS_POOL_INIT_WORKERS=N.Behavior
RENDERERS_POOL_INIT_WORKERS4160,-2,not-an-intTest plan
tests/test_pool_init_workers.pycover: default, env opt-in, clamped tosize, clamped to 8, invalid/negative/zero fallbackpytest tests/→ 1049 passed, 73 skipped, 1 xfailedNotImplementedErrorrace disappears withworkers=1Notes
use_fast=Truedefault). Both changes target the same class of init-time failuresmin(size, 8)clamping for parallel mode since the GIL-bound Python portion offrom_pretraineddoesn't scale past ~8 workersRENDERERS_POOL_INIT_WORKERS=8