fix(channel): reselect used channels when all exhausted in retry#11
Open
oldjs wants to merge 1 commit intomxyhi:mainfrom
Open
fix(channel): reselect used channels when all exhausted in retry#11oldjs wants to merge 1 commit intomxyhi:mainfrom
oldjs wants to merge 1 commit intomxyhi:mainfrom
Conversation
…roup
Root cause: this fork rewrote upstream's GetRandomSatisfiedChannel into
GetNextSatisfiedChannel + SWRR scheduler that strictly excludes already-tried
channels via ContextKeyUsedChannels. When all channels in a group fail (e.g.
multiple upstreams hit by 429 in a burst), excludedChannelIDs ends up
covering every channel in every priority bucket, scheduler.next() returns 0
for each priority, selectNextChannelFromBuckets returns (nil, nil), and the
relay loop reports "分组 X 下模型 Y 的可用渠道不存在(retry)" - even when
RetryTimes=50 still has plenty of budget left.
Upstream's behavior is different: retry index drives priority selection and
clamps to the last priority once exceeded, so within that lowest priority
the random weighted picker keeps choosing channels (including already-tried
ones). 429 is transient; reselecting a previously-failed channel a few
hundred ms later often succeeds.
Fix: in CacheGetNextSatisfiedChannel's non-auto path, when the strict-
exclusion call returns (nil, nil) and used channels exist, retry once
without the exclusion. The relay loop's RetryTimes ceiling still bounds
total attempts, so this cannot loop forever. ContextKeyUsedChannels itself
is preserved for logging ("重试: A->B->A") and admin info dumps.
Auto-group path is intentionally left unchanged - it has its own group-
hopping fallback semantics that need separate consideration.
Tests:
- All channels exhausted -> fallback reselects one of the used channels.
- First call with no used channels -> normal selection, no fallback.
- Group has zero matching channels -> returns nil, no infinite loop.
- Some channels still available -> still excludes used as before.
Note: previous commit f7f78c20 ("prioritize 429 in shouldRetry") was based
on a misread of the root cause - shouldRetry was already returning true for
upstream 429 in default config. That change still hardens shouldRetry for
future affinity configurations and tightens skipRetry handling, so it stays.
This commit is the actual fix for the user-reported "no channel available"
issue.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
问题现象
生产环境偶发以下错误(特别是上游 429 限流密集时):
明明分组下有多个可用渠道、
RetryTimes=50也设了,但还有大量重试余额时就直接失败,没有 failover。根因
这个 fork 把上游的
GetRandomSatisfiedChannel(按 retry 编号索引 priority 层 + 同层随机加权选)重写成了GetNextSatisfiedChannel+ SWRR 调度器,并通过ContextKeyUsedChannels严格排除已用渠道。调用链:
controller/relay.go主循环每次失败把当前渠道加入ContextKeyUsedChannelsservice.CacheGetNextSatisfiedChannel→model.GetNextSatisfiedChannel把这个集合作为excludedChannelIDsmodel.selectNextChannelFromBuckets→channelSchedulerState.next(excludedChannelIDs)跳过所有被排除的渠道当某个 priority 层下所有渠道同时被打上 429 时(瞬时限流爆发就会这样):
excludedChannelIDs覆盖该层全部渠道next()在每个 priority 都返回 0selectNextChannelFromBuckets遍历完所有 priority 后返回(nil, nil)controller/relay.go报分组 X 下模型 Y 的可用渠道不存在(retry)并 break,丢掉剩余 retry 配额对比上游:上游的
retry >= len(uniquePriorities)时 clamp 到最低 priority,继续随机加权选(允许重选已尝试过的渠道)。所以上游能把 RetryTimes 用满。429 是瞬时错误,几百毫秒内就可能恢复,重选已用渠道在第二轮经常能成功。修复
在
service/channel_select.go的CacheGetNextSatisfiedChannel非 auto 路径下,当严格排除返回(nil, nil)且 used 非空时,再调一次GetNextSatisfiedChannel(group, model, nil)(不带排除)。仿上游精神:穷尽所有渠道后允许重选。仍受
controller/relay.go主循环RetryTimes上限保护,不会死循环。ContextKeyUsedChannels保留不清空 —— 它还被controller/relay.go的重试日志("重试: A->B->A")和 admin info 字段(use_channel)使用,清空会丢失故障路径信息。Auto group 路径未改动,它有自己的 group-hopping fallback 语义,需要单独考虑(用户报告的问题不在 auto 路径)。
安全验证
RetryTimes上限保护。else分支内(param.TokenGroup != "auto"),auto 分支完全保留。if channel == nil && len(usedChannelIDs) > 0双条件保护。usedChannelIDs空时不进 fallback;非空但无 buckets 时第二次调用也返回 nil,安全返回。测试
新增 4 个测试覆盖
service/channel_select_test.go:TestCacheGetNextSatisfiedChannelNonAutoFallbackWhenAllChannelsExhausted—— 核心修复:所有渠道用过时 fallback 重选已用渠道TestCacheGetNextSatisfiedChannelNonAutoFirstCallNoUsedChannels—— 首次调用无 used 时正常选第一个,不触发 fallbackTestCacheGetNextSatisfiedChannelNonAutoNoChannelsAtAll—— 零渠道返回 nil,不死循环TestCacheGetNextSatisfiedChannelNonAutoExcludesUsedWhenOthersAvailable—— 部分渠道可用时仍正常排除已用go test ./...全部通过(root packageweb/distembed 失败 pre-existing 与本 PR 无关)。影响范围