fix: fail over codex accounts after retryable 2xx outcomes#223
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This change fixes unpinned Codex account failover so account switching is decided from the final proxy
AttemptOutcome, not only the first raw upstream HTTP status.Previously, the auto-selected Codex account path would switch to the next account when the initial send failed or when the raw upstream response was non-2xx, but it stopped account failover as soon as the upstream returned a raw 2xx response. That meant retryable "fake success" cases discovered later in proxy post-processing, such as an empty chat completion after Codex-to-chat transformation, were returned as a final 502 instead of continuing to the next available Codex account.
This PR makes the unpinned Codex account pool behave like the intended account-pool policy:
Success(non-2xx)RetryableFatalandSkippedAuthRoot Cause
The old Codex failover path in
retry_with_next_codex_account(...)decided whether to advance to the next account beforefinalize_attempt(...)ran. As a result, it only had access to the rawreqwest::Responseand could not see retryable outcomes that are produced later by proxy response handling.Fix
The failover loop now finalizes each attempted Codex account into a real
AttemptOutcomefirst, then decides whether to continue account failover from that final outcome. This keeps upstream ordering unchanged and only changes the semantics inside the unpinned Codex account pool.A regression test was added for the missing case: the first Codex account returns HTTP 200 but transforms into an empty chat completion, and the proxy must fail over to the next Codex account instead of returning 502 immediately.
Validation
cargo test -p token_proxy_core chat_request_failovers_to_next_codex_account_after_empty_2xx_response -- --nocapturecargo test -p token_proxy_core codex_account -- --nocapturecargo test -p token_proxy_core messages_request_failovers_to_next_kiro_account_before_next_upstream -- --nocapturecargo fmt --all --check