fix(channel): reselect used channels when all exhausted in retry by oldjs · Pull Request #10 · mxyhi/new-api

oldjs · 2026-04-26T03:48:58Z

问题现象

生产环境偶发以下错误（特别是上游 429 限流密集时）：

分组 default 下模型 deepseek-ai/DeepSeek-V4-Pro 的可用渠道不存在（retry）

明明分组下有多个可用渠道、RetryTimes=50 也设了，但还有大量重试余额时就直接失败，没有 failover。

根因

这个 fork 把上游的 GetRandomSatisfiedChannel（按 retry 编号索引 priority 层 + 同层随机加权选）重写成了 GetNextSatisfiedChannel + SWRR 调度器，并通过 ContextKeyUsedChannels 严格排除已用渠道。

调用链：

controller/relay.go 主循环每次失败把当前渠道加入 ContextKeyUsedChannels
service.CacheGetNextSatisfiedChannel → model.GetNextSatisfiedChannel 把这个集合作为 excludedChannelIDs
model.selectNextChannelFromBuckets → channelSchedulerState.next(excludedChannelIDs) 跳过所有被排除的渠道

当某个 priority 层下所有渠道同时被打上 429 时（瞬时限流爆发就会这样）：

几次 retry 后 excludedChannelIDs 覆盖该层全部渠道
next() 在每个 priority 都返回 0
selectNextChannelFromBuckets 遍历完所有 priority 后返回 (nil, nil)
controller/relay.go 报 分组 X 下模型 Y 的可用渠道不存在（retry） 并 break，丢掉剩余 retry 配额

对比上游：上游的 retry >= len(uniquePriorities) 时 clamp 到最低 priority，继续随机加权选（允许重选已尝试过的渠道）。所以上游能把 RetryTimes 用满。429 是瞬时错误，几百毫秒内就可能恢复，重选已用渠道在第二轮经常能成功。

修复

在 service/channel_select.go 的 CacheGetNextSatisfiedChannel 非 auto 路径下，当严格排除返回 (nil, nil) 且 used 非空时，再调一次 GetNextSatisfiedChannel(group, model, nil)（不带排除）。

仿上游精神：穷尽所有渠道后允许重选。仍受 controller/relay.go 主循环 RetryTimes 上限保护，不会死循环。

ContextKeyUsedChannels 保留不清空 —— 它还被 controller/relay.go 的重试日志（"重试: A->B->A"）和 admin info 字段（use_channel）使用，清空会丢失故障路径信息。

Auto group 路径未改动，它有自己的 group-hopping fallback 语义，需要单独考虑（用户报告的问题不在 auto 路径）。

安全验证

检查点	结论
不死循环	fallback 是同步二次调用（最多 1 次），无递归无 goroutine。主循环受 `RetryTimes` 上限保护。
auto 路径未受影响	改动严格在 `else` 分支内（`param.TokenGroup != \"auto\"`），auto 分支完全保留。
部分渠道可用时排除逻辑不变	第一次拿到非 nil channel 时直接 return，跳过 fallback。`if channel == nil && len(usedChannelIDs) > 0` 双条件保护。
零渠道不 panic	`usedChannelIDs` 空时不进 fallback；非空但无 buckets 时第二次调用也返回 nil，安全返回。

测试

新增 4 个测试覆盖 service/channel_select_test.go：

TestCacheGetNextSatisfiedChannelNonAutoFallbackWhenAllChannelsExhausted —— 核心修复：所有渠道用过时 fallback 重选已用渠道
TestCacheGetNextSatisfiedChannelNonAutoFirstCallNoUsedChannels —— 首次调用无 used 时正常选第一个，不触发 fallback
TestCacheGetNextSatisfiedChannelNonAutoNoChannelsAtAll —— 零渠道返回 nil，不死循环
TestCacheGetNextSatisfiedChannelNonAutoExcludesUsedWhenOthersAvailable —— 部分渠道可用时仍正常排除已用

go test ./... 全部通过（root package web/dist embed 失败 pre-existing 与本 PR 无关）。

影响范围

仅影响非 auto 分组的渠道重试 failover 行为
渠道选择正常路径（首次选择、有可用渠道时）零变化
仅在"所有渠道都失败"的边界场景下放宽排除，恢复上游"重选已用渠道"的语义

…roup Root cause: this fork rewrote upstream's GetRandomSatisfiedChannel into GetNextSatisfiedChannel + SWRR scheduler that strictly excludes already-tried channels via ContextKeyUsedChannels. When all channels in a group fail (e.g. multiple upstreams hit by 429 in a burst), excludedChannelIDs ends up covering every channel in every priority bucket, scheduler.next() returns 0 for each priority, selectNextChannelFromBuckets returns (nil, nil), and the relay loop reports "分组 X 下模型 Y 的可用渠道不存在（retry）" - even when RetryTimes=50 still has plenty of budget left. Upstream's behavior is different: retry index drives priority selection and clamps to the last priority once exceeded, so within that lowest priority the random weighted picker keeps choosing channels (including already-tried ones). 429 is transient; reselecting a previously-failed channel a few hundred ms later often succeeds. Fix: in CacheGetNextSatisfiedChannel's non-auto path, when the strict- exclusion call returns (nil, nil) and used channels exist, retry once without the exclusion. The relay loop's RetryTimes ceiling still bounds total attempts, so this cannot loop forever. ContextKeyUsedChannels itself is preserved for logging ("重试: A->B->A") and admin info dumps. Auto-group path is intentionally left unchanged - it has its own group- hopping fallback semantics that need separate consideration. Tests: - All channels exhausted -> fallback reselects one of the used channels. - First call with no used channels -> normal selection, no fallback. - Group has zero matching channels -> returns nil, no infinite loop. - Some channels still available -> still excludes used as before. Note: previous commit f7f78c20 ("prioritize 429 in shouldRetry") was based on a misread of the root cause - shouldRetry was already returning true for upstream 429 in default config. That change still hardens shouldRetry for future affinity configurations and tightens skipRetry handling, so it stays. This commit is the actual fix for the user-reported "no channel available" issue.

oldjs · 2026-04-26T04:10:35Z

重新从独立分支提一个干净的 PR（fork main 后续要加自用 CI workflow，避免污染本 PR diff），新 PR 内容相同。

oldjs closed this Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(channel): reselect used channels when all exhausted in retry#10

fix(channel): reselect used channels when all exhausted in retry#10
oldjs wants to merge 1 commit intomxyhi:mainfrom
oldjs:main

oldjs commented Apr 26, 2026

Uh oh!

oldjs commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oldjs commented Apr 26, 2026

问题现象

根因

修复

安全验证

测试

影响范围

Uh oh!

oldjs commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant