Refresh JWT on-demand to avoid 401 Invalid JWT on long-running requests#39
Open
yusudz wants to merge 1 commit intothinking-machines-lab:mainfrom
Open
Refresh JWT on-demand to avoid 401 Invalid JWT on long-running requests#39yusudz wants to merge 1 commit intothinking-machines-lab:mainfrom
yusudz wants to merge 1 commit intothinking-machines-lab:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Long-running Tinker operations (multi-minute rollouts polling
retrieve_future, long evals, telemetry batches on slow networks) die withAuthenticationError: 401 Invalid JWTeven whenTINKER_API_KEYis a long-livedtml-...credential.Three things in
_jwt_auth.pylikely combine to cause this:get_token()was passive — it returnedself._tokenunconditionally, so any missed/late background refresh leaked an expired JWT straight into outgoing requests._refresh_loop's sleep wasmax(_RETRY_DELAY_SECS=60, exp - now - 300). When the token had under a minute of life (e.g. a shadow-holder seed received near expiry, or after one transient refresh failure), the loop slept past expiry while requests kept using the cached token.retry_handler.is_retryable_status_codecovers408/409/429/5xx, so 401 is not retried — a single stale-JWT request is fatal._fetch()itself runs through a separate auth pool with the durable credential, so refresh always can succeed; the bug is in when we trigger it.Fix
get_token()is now proactive: refreshes when the cached JWT has ≤ 60s of runway, with double-checked locking so concurrent callers share one in-flight/auth/tokenrequest.get_token(), fall back to the cached token and log a warning. If the token really is dead the request surfaces a clean 401; in-flight requests on the same provider can survive transient refresh blips.max(60, ...)floor withmax(0, ...)so it refreshes immediately when already inside the refresh window. Add an explicitawait asyncio.sleep(_RETRY_DELAY_SECS)in the loop'sexcept Exceptionbranch so removing the floor doesn't tight-loop on persistent failures._refresh_lockaround the loop's_fetch()so the background loop and on-demand callers never fire two parallel/auth/tokenrequests.Test plan
pytest src/tinker/lib/_jwt_auth_test.py— 20 passed (13 existing + 7 new)pytest src/tinker/lib/internal_client_holder_test.py— 5 passed (only other consumer ofJwtAuthProvider)ruff checkclean on touched filespyrighton_jwt_auth.py: 19 → 12 errors (all preexistingUnknownpropagation from the untyped auth client; net reduction is from a newself._token: strannotation)New tests cover: cached fresh token is not refetched; near-expiry / already-expired / unparseable cached tokens trigger refetch; 5 concurrent
get_token()callers share one in-flight/auth/token;get_token()returns cached token + logs warning when refresh fails; returnsNonewhen there's no seed and refresh fails.Out of scope
retry_handler(defense-in-depth, but a deeper change to the retry/auth boundary; the proactiveget_token()should be sufficient on its own).