Skip to content

fix(claude-auth): dedupe OAuth refresh and honor 429 backoff#2568

Open
codeg-dev wants to merge 2 commits intorouter-for-me:mainfrom
codeg-dev:fix/claude-refresh-429-backoff
Open

fix(claude-auth): dedupe OAuth refresh and honor 429 backoff#2568
codeg-dev wants to merge 2 commits intorouter-for-me:mainfrom
codeg-dev:fix/claude-refresh-429-backoff

Conversation

@codeg-dev
Copy link
Copy Markdown

Summary

  • dedupe concurrent Claude OAuth refresh requests per refresh token with singleflight
  • honor Retry-After for refresh endpoint 429 responses and block immediate replays
  • route the Claude executor through the retry-aware refresh path and add Claude auth regression coverage

Validation

  • go test ./internal/auth/claude
  • go test ./internal/runtime/executor -run TestApplyClaudeHeaders
  • verified the patched Linux binary on a live CLIProxyAPI deployment serving Claude requests successfully

Fixes #2567.

AI authorship: prepared with OpenCode / Sisyphus assistance.

codeg-dev and others added 2 commits April 6, 2026 17:16
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the Claude token refresh logic by implementing request deduplication using singleflight and a backoff mechanism for rate-limited (429) responses. It also introduces structured error handling to distinguish between retryable and non-retryable failures. Feedback includes addressing a potential data race in the state reset function, preventing a memory leak in the refresh block map, improving context handling in the single-flight logic, and refining error reporting during retries.

Comment on lines +54 to +59
func resetClaudeRefreshState() {
claudeRefreshMu.Lock()
defer claudeRefreshMu.Unlock()
claudeRefreshBlock = make(map[string]time.Time)
claudeRefreshGroup = singleflight.Group{}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Reassigning the global claudeRefreshGroup in resetClaudeRefreshState while other goroutines might be calling Do on it in RefreshTokens creates a data race. singleflight.Group is a struct containing internal state (maps and mutexes), and reinitializing it without synchronization is unsafe. Since this function appears to be used for testing, consider using a pointer for the group and protecting its access/reassignment with a mutex, or avoiding the reset entirely by using unique keys in tests.

var (
claudeRefreshGroup singleflight.Group
claudeRefreshMu sync.Mutex
claudeRefreshBlock = make(map[string]time.Time)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The claudeRefreshBlock map grows indefinitely as it stores a block timestamp for every refresh token that encounters a 429 error. There is no mechanism to prune expired entries, and entries are only removed on a successful refresh. In a long-running proxy handling many unique tokens, this will lead to a gradual memory leak. Consider using a TTL-based cache or implementing a periodic cleanup routine to remove entries where the block time has passed.

Comment on lines +322 to +332
result, err, _ := claudeRefreshGroup.Do(refreshToken, func() (interface{}, error) {
return o.refreshTokensSingleFlight(context.WithoutCancel(ctx), refreshToken)
})
if err != nil {
return nil, err
}
tokenData, ok := result.(*ClaudeTokenData)
if !ok || tokenData == nil {
return nil, fmt.Errorf("token refresh failed: invalid single-flight result")
}
return tokenData, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

singleflight.Do blocks the calling goroutine until the underlying function returns, even if the caller's context is canceled. In a high-concurrency environment, this can lead to goroutine accumulation if upstream refreshes are slow and clients disconnect. It is better to use DoChan and a select statement to respect ctx.Done(), while allowing the refresh to continue in the background via context.WithoutCancel.

	ch := claudeRefreshGroup.DoChan(refreshToken, func() (interface{}, error) {
		return o.refreshTokensSingleFlight(context.WithoutCancel(ctx), refreshToken)
	})

	select {
	case <-ctx.Done():
		return nil, ctx.Err()
	case res := <-ch:
		if res.Err != nil {
			return nil, res.Err
		}
		tokenData, ok := res.Val.(*ClaudeTokenData)
		if !ok || tokenData == nil {
			return nil, fmt.Errorf("token refresh failed: invalid single-flight result")
		}
		return tokenData, nil
	}

Comment on lines +461 to 466
if !isClaudeRefreshRetryable(err) {
break
}
}

return nil, fmt.Errorf("token refresh failed after %d attempts: %w", maxRetries, lastErr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When a non-retryable error occurs (like a 401 or a 429 block), the loop breaks immediately, but the final error message still claims the failure happened 'after %d attempts' (using the maximum retry count). This is misleading to the user. It would be better to return the error immediately if it's not retryable, or accurately report the number of attempts made.

Suggested change
if !isClaudeRefreshRetryable(err) {
break
}
}
return nil, fmt.Errorf("token refresh failed after %d attempts: %w", maxRetries, lastErr)
if !isClaudeRefreshRetryable(err) {
return nil, err
}
}
return nil, fmt.Errorf("token refresh failed after %d attempts: %w", maxRetries, lastErr)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Claude OAuth refresh can stampede and replay 429s

1 participant