Skip to content

fix(channel): reselect used channels when all exhausted in retry#10

Closed
oldjs wants to merge 1 commit intomxyhi:mainfrom
oldjs:main
Closed

fix(channel): reselect used channels when all exhausted in retry#10
oldjs wants to merge 1 commit intomxyhi:mainfrom
oldjs:main

Conversation

@oldjs
Copy link
Copy Markdown

@oldjs oldjs commented Apr 26, 2026

问题现象

生产环境偶发以下错误(特别是上游 429 限流密集时):

分组 default 下模型 deepseek-ai/DeepSeek-V4-Pro 的可用渠道不存在(retry)

明明分组下有多个可用渠道、RetryTimes=50 也设了,但还有大量重试余额时就直接失败,没有 failover。

根因

这个 fork 把上游的 GetRandomSatisfiedChannel(按 retry 编号索引 priority 层 + 同层随机加权选)重写成了 GetNextSatisfiedChannel + SWRR 调度器,并通过 ContextKeyUsedChannels 严格排除已用渠道

调用链:

  1. controller/relay.go 主循环每次失败把当前渠道加入 ContextKeyUsedChannels
  2. service.CacheGetNextSatisfiedChannelmodel.GetNextSatisfiedChannel 把这个集合作为 excludedChannelIDs
  3. model.selectNextChannelFromBucketschannelSchedulerState.next(excludedChannelIDs) 跳过所有被排除的渠道

当某个 priority 层下所有渠道同时被打上 429 时(瞬时限流爆发就会这样):

  • 几次 retry 后 excludedChannelIDs 覆盖该层全部渠道
  • next() 在每个 priority 都返回 0
  • selectNextChannelFromBuckets 遍历完所有 priority 后返回 (nil, nil)
  • controller/relay.go分组 X 下模型 Y 的可用渠道不存在(retry) 并 break,丢掉剩余 retry 配额

对比上游:上游的 retry >= len(uniquePriorities) 时 clamp 到最低 priority,继续随机加权选(允许重选已尝试过的渠道)。所以上游能把 RetryTimes 用满。429 是瞬时错误,几百毫秒内就可能恢复,重选已用渠道在第二轮经常能成功。

修复

service/channel_select.goCacheGetNextSatisfiedChannel 非 auto 路径下,当严格排除返回 (nil, nil) 且 used 非空时,再调一次 GetNextSatisfiedChannel(group, model, nil)(不带排除)。

仿上游精神:穷尽所有渠道后允许重选。仍受 controller/relay.go 主循环 RetryTimes 上限保护,不会死循环。

ContextKeyUsedChannels 保留不清空 —— 它还被 controller/relay.go 的重试日志("重试: A->B->A")和 admin info 字段(use_channel)使用,清空会丢失故障路径信息。

Auto group 路径未改动,它有自己的 group-hopping fallback 语义,需要单独考虑(用户报告的问题不在 auto 路径)。

安全验证

检查点 结论
不死循环 fallback 是同步二次调用(最多 1 次),无递归无 goroutine。主循环受 RetryTimes 上限保护。
auto 路径未受影响 改动严格在 else 分支内(param.TokenGroup != \"auto\"),auto 分支完全保留。
部分渠道可用时排除逻辑不变 第一次拿到非 nil channel 时直接 return,跳过 fallback。if channel == nil && len(usedChannelIDs) > 0 双条件保护。
零渠道不 panic usedChannelIDs 空时不进 fallback;非空但无 buckets 时第二次调用也返回 nil,安全返回。

测试

新增 4 个测试覆盖 service/channel_select_test.go

  • TestCacheGetNextSatisfiedChannelNonAutoFallbackWhenAllChannelsExhausted —— 核心修复:所有渠道用过时 fallback 重选已用渠道
  • TestCacheGetNextSatisfiedChannelNonAutoFirstCallNoUsedChannels —— 首次调用无 used 时正常选第一个,不触发 fallback
  • TestCacheGetNextSatisfiedChannelNonAutoNoChannelsAtAll —— 零渠道返回 nil,不死循环
  • TestCacheGetNextSatisfiedChannelNonAutoExcludesUsedWhenOthersAvailable —— 部分渠道可用时仍正常排除已用

go test ./... 全部通过(root package web/dist embed 失败 pre-existing 与本 PR 无关)。

影响范围

  • 仅影响非 auto 分组的渠道重试 failover 行为
  • 渠道选择正常路径(首次选择、有可用渠道时)零变化
  • 仅在"所有渠道都失败"的边界场景下放宽排除,恢复上游"重选已用渠道"的语义

…roup

Root cause: this fork rewrote upstream's GetRandomSatisfiedChannel into
GetNextSatisfiedChannel + SWRR scheduler that strictly excludes already-tried
channels via ContextKeyUsedChannels. When all channels in a group fail (e.g.
multiple upstreams hit by 429 in a burst), excludedChannelIDs ends up
covering every channel in every priority bucket, scheduler.next() returns 0
for each priority, selectNextChannelFromBuckets returns (nil, nil), and the
relay loop reports "分组 X 下模型 Y 的可用渠道不存在(retry)" - even when
RetryTimes=50 still has plenty of budget left.

Upstream's behavior is different: retry index drives priority selection and
clamps to the last priority once exceeded, so within that lowest priority
the random weighted picker keeps choosing channels (including already-tried
ones). 429 is transient; reselecting a previously-failed channel a few
hundred ms later often succeeds.

Fix: in CacheGetNextSatisfiedChannel's non-auto path, when the strict-
exclusion call returns (nil, nil) and used channels exist, retry once
without the exclusion. The relay loop's RetryTimes ceiling still bounds
total attempts, so this cannot loop forever. ContextKeyUsedChannels itself
is preserved for logging ("重试: A->B->A") and admin info dumps.

Auto-group path is intentionally left unchanged - it has its own group-
hopping fallback semantics that need separate consideration.

Tests:
- All channels exhausted -> fallback reselects one of the used channels.
- First call with no used channels -> normal selection, no fallback.
- Group has zero matching channels -> returns nil, no infinite loop.
- Some channels still available -> still excludes used as before.

Note: previous commit f7f78c20 ("prioritize 429 in shouldRetry") was based
on a misread of the root cause - shouldRetry was already returning true for
upstream 429 in default config. That change still hardens shouldRetry for
future affinity configurations and tightens skipRetry handling, so it stays.
This commit is the actual fix for the user-reported "no channel available"
issue.
@oldjs
Copy link
Copy Markdown
Author

oldjs commented Apr 26, 2026

重新从独立分支提一个干净的 PR(fork main 后续要加自用 CI workflow,避免污染本 PR diff),新 PR 内容相同。

@oldjs oldjs closed this Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant