Skip to content

Refresh JWT on-demand to avoid 401 Invalid JWT on long-running requests#39

Open
yusudz wants to merge 1 commit intothinking-machines-lab:mainfrom
yusudz:yury/fix/jwt-on-demand-refresh
Open

Refresh JWT on-demand to avoid 401 Invalid JWT on long-running requests#39
yusudz wants to merge 1 commit intothinking-machines-lab:mainfrom
yusudz:yury/fix/jwt-on-demand-refresh

Conversation

@yusudz
Copy link
Copy Markdown

@yusudz yusudz commented Apr 24, 2026

Summary

Long-running Tinker operations (multi-minute rollouts polling retrieve_future, long evals, telemetry batches on slow networks) die with AuthenticationError: 401 Invalid JWT even when TINKER_API_KEY is a long-lived tml-... credential.

Three things in _jwt_auth.py likely combine to cause this:

  1. get_token() was passive — it returned self._token unconditionally, so any missed/late background refresh leaked an expired JWT straight into outgoing requests.
  2. _refresh_loop's sleep was max(_RETRY_DELAY_SECS=60, exp - now - 300). When the token had under a minute of life (e.g. a shadow-holder seed received near expiry, or after one transient refresh failure), the loop slept past expiry while requests kept using the cached token.
  3. retry_handler.is_retryable_status_code covers 408/409/429/5xx, so 401 is not retried — a single stale-JWT request is fatal. _fetch() itself runs through a separate auth pool with the durable credential, so refresh always can succeed; the bug is in when we trigger it.

Fix

  • get_token() is now proactive: refreshes when the cached JWT has ≤ 60s of runway, with double-checked locking so concurrent callers share one in-flight /auth/token request.
  • On refresh failure inside get_token(), fall back to the cached token and log a warning. If the token really is dead the request surfaces a clean 401; in-flight requests on the same provider can survive transient refresh blips.
  • Replace the loop's max(60, ...) floor with max(0, ...) so it refreshes immediately when already inside the refresh window. Add an explicit await asyncio.sleep(_RETRY_DELAY_SECS) in the loop's except Exception branch so removing the floor doesn't tight-loop on persistent failures.
  • Hold the same _refresh_lock around the loop's _fetch() so the background loop and on-demand callers never fire two parallel /auth/token requests.

Test plan

  • pytest src/tinker/lib/_jwt_auth_test.py — 20 passed (13 existing + 7 new)
  • pytest src/tinker/lib/internal_client_holder_test.py — 5 passed (only other consumer of JwtAuthProvider)
  • ruff check clean on touched files
  • pyright on _jwt_auth.py: 19 → 12 errors (all preexisting Unknown propagation from the untyped auth client; net reduction is from a new self._token: str annotation)

New tests cover: cached fresh token is not refetched; near-expiry / already-expired / unparseable cached tokens trigger refetch; 5 concurrent get_token() callers share one in-flight /auth/token; get_token() returns cached token + logs warning when refresh fails; returns None when there's no seed and refresh fails.

Out of scope

  • Adding 401 → force-refresh-and-retry to retry_handler (defense-in-depth, but a deeper change to the retry/auth boundary; the proactive get_token() should be sufficient on its own).
  • Server-side JWT TTL / minting changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant