fix(claude-auth): dedupe OAuth refresh and honor 429 backoff#2568
fix(claude-auth): dedupe OAuth refresh and honor 429 backoff#2568codeg-dev wants to merge 2 commits intorouter-for-me:mainfrom
Conversation
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
There was a problem hiding this comment.
Code Review
This pull request enhances the Claude token refresh logic by implementing request deduplication using singleflight and a backoff mechanism for rate-limited (429) responses. It also introduces structured error handling to distinguish between retryable and non-retryable failures. Feedback includes addressing a potential data race in the state reset function, preventing a memory leak in the refresh block map, improving context handling in the single-flight logic, and refining error reporting during retries.
| func resetClaudeRefreshState() { | ||
| claudeRefreshMu.Lock() | ||
| defer claudeRefreshMu.Unlock() | ||
| claudeRefreshBlock = make(map[string]time.Time) | ||
| claudeRefreshGroup = singleflight.Group{} | ||
| } |
There was a problem hiding this comment.
Reassigning the global claudeRefreshGroup in resetClaudeRefreshState while other goroutines might be calling Do on it in RefreshTokens creates a data race. singleflight.Group is a struct containing internal state (maps and mutexes), and reinitializing it without synchronization is unsafe. Since this function appears to be used for testing, consider using a pointer for the group and protecting its access/reassignment with a mutex, or avoiding the reset entirely by using unique keys in tests.
| var ( | ||
| claudeRefreshGroup singleflight.Group | ||
| claudeRefreshMu sync.Mutex | ||
| claudeRefreshBlock = make(map[string]time.Time) |
There was a problem hiding this comment.
The claudeRefreshBlock map grows indefinitely as it stores a block timestamp for every refresh token that encounters a 429 error. There is no mechanism to prune expired entries, and entries are only removed on a successful refresh. In a long-running proxy handling many unique tokens, this will lead to a gradual memory leak. Consider using a TTL-based cache or implementing a periodic cleanup routine to remove entries where the block time has passed.
| result, err, _ := claudeRefreshGroup.Do(refreshToken, func() (interface{}, error) { | ||
| return o.refreshTokensSingleFlight(context.WithoutCancel(ctx), refreshToken) | ||
| }) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| tokenData, ok := result.(*ClaudeTokenData) | ||
| if !ok || tokenData == nil { | ||
| return nil, fmt.Errorf("token refresh failed: invalid single-flight result") | ||
| } | ||
| return tokenData, nil |
There was a problem hiding this comment.
singleflight.Do blocks the calling goroutine until the underlying function returns, even if the caller's context is canceled. In a high-concurrency environment, this can lead to goroutine accumulation if upstream refreshes are slow and clients disconnect. It is better to use DoChan and a select statement to respect ctx.Done(), while allowing the refresh to continue in the background via context.WithoutCancel.
ch := claudeRefreshGroup.DoChan(refreshToken, func() (interface{}, error) {
return o.refreshTokensSingleFlight(context.WithoutCancel(ctx), refreshToken)
})
select {
case <-ctx.Done():
return nil, ctx.Err()
case res := <-ch:
if res.Err != nil {
return nil, res.Err
}
tokenData, ok := res.Val.(*ClaudeTokenData)
if !ok || tokenData == nil {
return nil, fmt.Errorf("token refresh failed: invalid single-flight result")
}
return tokenData, nil
}| if !isClaudeRefreshRetryable(err) { | ||
| break | ||
| } | ||
| } | ||
|
|
||
| return nil, fmt.Errorf("token refresh failed after %d attempts: %w", maxRetries, lastErr) |
There was a problem hiding this comment.
When a non-retryable error occurs (like a 401 or a 429 block), the loop breaks immediately, but the final error message still claims the failure happened 'after %d attempts' (using the maximum retry count). This is misleading to the user. It would be better to return the error immediately if it's not retryable, or accurately report the number of attempts made.
| if !isClaudeRefreshRetryable(err) { | |
| break | |
| } | |
| } | |
| return nil, fmt.Errorf("token refresh failed after %d attempts: %w", maxRetries, lastErr) | |
| if !isClaudeRefreshRetryable(err) { | |
| return nil, err | |
| } | |
| } | |
| return nil, fmt.Errorf("token refresh failed after %d attempts: %w", maxRetries, lastErr) |
Summary
singleflightRetry-Afterfor refresh endpoint429responses and block immediate replaysValidation
go test ./internal/auth/claudego test ./internal/runtime/executor -run TestApplyClaudeHeadersFixes #2567.
AI authorship: prepared with OpenCode / Sisyphus assistance.