Stop Codex harness retry loops on TPM exhaustion and unfinished-goal errors#42420
Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR tightens retry behavior in the Node.js harnesses used by gh-aw engine runners to avoid wasting tokens on deterministic failures (TPM exhaustion and “unfinished goal” router-state errors), and updates tests + docs to match the new retry policy.
Changes:
- Stop Codex harness fresh-run retries when output indicates token-per-minute (TPM) rate-limit exhaustion.
- Extend the shared non-retryable retry-guard to detect the “unfinished goal” error variant as a goal-already-active condition.
- Update harness unit tests and engine docs to reflect the fixed retry budget and the new non-retryable conditions.
Show a summary per file
| File | Description |
|---|---|
| docs/src/content/docs/reference/engines.md | Documents the current built-in harness retry budget and how to customize retry behavior via a custom Copilot harness. |
| actions/setup/js/harness_retry_guard.cjs | Adds an additional non-retryable pattern for the “unfinished goal” wording. |
| actions/setup/js/harness_retry_guard.test.cjs | Adds coverage to ensure the new “unfinished goal” wording is detected as non-retryable. |
| actions/setup/js/codex_harness.cjs | Introduces TPM-specific detection and short-circuits the retry loop on TPM exhaustion. |
| actions/setup/js/codex_harness.test.cjs | Adds unit tests for TPM detection and updates retry-policy expectations accordingly. |
Review details
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 5/5 changed files
- Comments generated: 0
- Review effort level: Low
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #42420 does not have the 'implementation' label and has 0 new lines of code in business logic directories (threshold: 100). |
|
✅ Test Quality Sentinel completed test quality analysis. |
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
Skills-Based Review 🧠Applied 📋 Key Themes & HighlightsKey Themes
Positive Highlights
@copilot please address the review comments above.
|
There was a problem hiding this comment.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 61.9 AIC · ⌖ 7.9 AIC · ⊞ 6.5K
Comment /matt to run again
| const nonRetryableGuard = detectNonRetryableHarnessGuard(result.output); | ||
| if (nonRetryableGuard.aiCreditsExceeded || nonRetryableGuard.awfAPIProxyBlockingRequests || nonRetryableGuard.goalAlreadyActive || nonRetryableGuard.maxRunsExceeded) return false; | ||
| const isRateLimit = isRateLimitError(result.output); | ||
| if (isTokenPerMinuteRateLimitError(result.output)) return false; |
There was a problem hiding this comment.
[/tdd] The existing does not retry when rate-limit reconnects are exhausted (N/N pattern) test (line 518) uses output that also contains on tokens per min — meaning this new TPM guard fires before the reconnect-exhausted check, and that test now exercises the TPM path rather than the reconnect-exhausted path for TPM output.
💡 How to fix the coverage gap
Add a companion reconnect-exhausted test that uses non-TPM reconnect output so the isRateLimit && isReconnectExhaustedError guard stays specifically covered:
it("does not retry when rate-limit reconnects are exhausted (non-TPM rate limit)", () => {
const output =
'{"type":"error","message":"Reconnecting... 1/5 (stream disconnected: RateLimitError)"}\n' +
'{"type":"error","message":"Reconnecting... 5/5 (stream disconnected: RateLimitError)"}';
const result = { exitCode: 1, hasOutput: true, output };
expect(shouldRetry(result, 0)).toBe(false);
});Without this, accidentally deleting the reconnect-exhausted guard for TPM scenarios would go undetected.
@copilot please address this.
| // Fresh-run retries immediately repeat the same prompt workload and can quickly | ||
| // drain available credits without making forward progress. | ||
| if (isTokenPerMinuteRateLimitError(result.output)) { | ||
| log(`attempt ${attempt + 1}: token-per-minute rate limit detected — not retrying (fresh runs can further drain token budget)`); |
There was a problem hiding this comment.
[/diagnosing-bugs] The structured failure log above (line ~460) records isRateLimitError=true but doesn't include the new isTokenPerMinuteRateLimitError flag. When debugging why a run didn't retry, operators will see "rate-limit" in the structured log but have to search for the break-message later to learn the TPM guard fired.
💡 Suggested addition to the structured log
Add isTokenPerMinuteRateLimitError to the existing failure log:
// near line 460
log(
`attempt ${attempt + 1} failed:` +
` ...` +
` isRateLimitError=${isRateLimit}` +
+ ` isTokenPerMinuteRateLimitError=${isTokenPerMinuteRateLimitError(result.output)}` +
` ...`
);This keeps all guard-state visible at the same log line and avoids the need to scroll forward to find the break-message.
@copilot please address this.
| const result = { | ||
| exitCode: 1, | ||
| hasOutput: false, | ||
| output: '{"type":"error","message":"Rate limit reached for gpt-4o-mini in organization org-xxx on tokens per min (TPM): Limit 200000, Used 50000, Requested 35000. Please try again in 615ms."}', |
There was a problem hiding this comment.
[/tdd] The new test only covers hasOutput: false. TPM limits can trigger mid-run after partial output is written — worth adding a hasOutput: true variant to confirm the guard fires unconditionally.
💡 Suggested additional test
it("does not retry on token-per-minute rate limit wording even with partial output", () => {
const result = {
exitCode: 1,
hasOutput: true,
output: 'Rate limit reached for gpt-4o-mini in organization org-xxx on tokens per min (TPM): Limit 200000, Used 199000, Requested 1200.',
};
expect(shouldRetry(result, 0)).toBe(false);
});This guards against a future refactor that might inadvertently make the TPM check conditional on hasOutput.
@copilot please address this.
| expect(result.goalAlreadyActive).toBe(true); | ||
| }); | ||
|
|
||
| it("detects goal already active for unfinished-goal wording", () => { |
There was a problem hiding this comment.
[/tdd] The new test passes a plain string. Real router errors arrive JSON-wrapped, so the pattern's real-world input is '{"type":"error","message":"cannot create a new goal..."}'. A JSON-wrapped variant would give higher confidence that the [\\s\\S]*? part of the pattern handles the surrounding JSON safely.
💡 Suggested additional test
it("detects goal already active for unfinished-goal wording (JSON-wrapped)", () => {
const result = detectNonRetryableHarnessGuard(
'{"type":"error","message":"cannot create a new goal because this thread has an unfinished goal; complete the existing goal first"}'
);
expect(result.goalAlreadyActive).toBe(true);
});@copilot please address this.
🧪 Test Quality Sentinel — Score: 80/100 ✅ ExcellentTests in this PR directly verify two new non-retryable error-detection contracts (TPM exhaustion and unfinished-goal wording), all targeting error paths with correct behavioral assertions. 📊 Metrics & ClassificationFile-level summary
Per-test classification
Score components
Implementation-test ratio: 1/4 = 25% (threshold: 30%) ✅ 🔍 Flagged: Test #3 — implementation_detail concernTest: The Severity: Low — this is a pre-existing test smell, not introduced by this PR. The real imported Suggestion (non-blocking): Consider exporting Verdict: ✅ APPROVE — 80/100. New tests cover the two behavioral contracts introduced by this PR (TPM exhaustion detection and unfinished-goal error variant) with appropriate error-path assertions. Implementation-test ratio is 25%, within the 30% threshold. No guideline violations. References: §28423209654
|
There was a problem hiding this comment.
Review: Stop Codex harness retry loops on TPM exhaustion and unfinished-goal errors
The overall approach is correct and well-tested. The TPM guard prevents token drain on deterministic failures, and the new goalAlreadyActive variant is handled cleanly. All three harnesses confirmed to use MAX_RETRIES = 3, so the docs update is accurate.
3 non-blocking suggestions:
| # | File | Concern |
|---|---|---|
| 1 | codex_harness.cjs:67 |
TOKEN_PER_MIN_RATE_LIMIT_PATTERN is too generic — risks false positive when non-TPM rate-limit output happens to mention "on tokens per min" in a log line |
| 2 | codex_harness.cjs:524 |
isTokenPerMinuteRateLimitError re-scans output for isRateLimitError even though isRateLimit was already computed at line 450 |
| 3 | harness_retry_guard.cjs:8 |
[\s\S]*? in the new goal pattern is unbounded; the separator is a known ; — a bounded class like [^;]{0,120} is more precise |
None of these are blockers — the change is safe to merge as-is. See inline comments for suggested fixes.
🧵 Reviewed using Impeccable skills by Impeccable Skills Reviewer · 54.1 AIC · ⌖ 6.94 AIC · ⊞ 4.9K
| // the human-readable message Codex emits inside "Reconnecting..." / error lines: | ||
| // "Rate limit reached for <model> in organization <org> on tokens per min (TPM): ..." | ||
| const RATE_LIMIT_ERROR_PATTERN = /rate_limit_exceeded|429 Too Many Requests|RateLimitError|Rate limit reached for [^\s]+(?: in organization [^\s]+)? on tokens per min/i; | ||
| const TOKEN_PER_MIN_RATE_LIMIT_PATTERN = /on tokens per min/i; |
There was a problem hiding this comment.
Over-broad pattern risks false positives.
TOKEN_PER_MIN_RATE_LIMIT_PATTERN = /on tokens per min/i is a very generic substring. isTokenPerMinuteRateLimitError guards it with isRateLimitError, but that check also matches rate_limit_exceeded and 429 Too Many Requests — both of which could appear in output that also contains the phrase on tokens per min for an unrelated reason (e.g., a Codex diagnostic note about the token budget). In such a case a non-TPM rate-limit failure would be misclassified as TPM and a viable retry would be suppressed.
A self-contained, specific pattern removes the need for the && guard:
const TOKEN_PER_MIN_RATE_LIMIT_PATTERN =
/Rate limit reached for \S+(?:\s+in organization \S+)? on tokens per min/i;This matches only the well-known OpenAI TPM error structure and lets isTokenPerMinuteRateLimitError drop the isRateLimitError(output) && guard entirely.
@copilot please address this.
| // Token-per-minute limits indicate exhausted budget for the current workload profile. | ||
| // Fresh-run retries immediately repeat the same prompt workload and can quickly | ||
| // drain available credits without making forward progress. | ||
| if (isTokenPerMinuteRateLimitError(result.output)) { |
There was a problem hiding this comment.
isTokenPerMinuteRateLimitError re-scans output for isRateLimitError internally, even though isRateLimit was already computed at line 450. For typical Codex sessions the output buffer can be several megabytes; re-applying two regexes to it is minor but avoidable.
If the pattern is tightened to the specific OpenAI TPM structure (see comment on line 67), the && guard is unnecessary and the function becomes a single-pass check. Alternatively, pass the pre-computed result:
if (isRateLimit && isTokenPerMinuteRateLimitError(result.output)) {@copilot please address this.
|
|
||
| const AWF_API_PROXY_BLOCKING_REQUESTS_PATTERNS = [/\bawf\b.*\bapi[\s_-]*proxy\b.*\bblocking requests\b/i, /\bapi[\s_-]*proxy\b.*\bblocking requests\b/i, /\bapi[\s_-]*proxy\b.*\bblocked requests?\b/i, /\bDIFC_FILTERED\b/]; | ||
| const GOAL_ALREADY_ACTIVE_PATTERNS = [/\bthis thread already has a goal\b[\s\S]*?\buse update_goal\b/i]; | ||
| const GOAL_ALREADY_ACTIVE_PATTERNS = [/\bthis thread already has a goal\b[\s\S]*?\buse update_goal\b/i, /\bcannot create a new goal because this thread has an unfinished goal\b[\s\S]*?\bcomplete the existing goal first\b/i]; |
There was a problem hiding this comment.
[\s\S]*? can match across unbounded distances. The new pattern:
/\bcannot create a new goal ... \b[\s\S]*?\bcomplete the existing goal first\b/i
uses a lazy wildcard that will span as much text as needed to find both anchors. In real error output the two phrases are always separated by ; — a fixed, known delimiter. Tightening to something like [^;]{0,120} (or simply ;\s*) would both reduce any theoretical ReDoS surface and make the intent explicit:
/\bcannot create a new goal because this thread has an unfinished goal\b[^;]{0,120}\bcomplete the existing goal first\b/i(The existing first pattern has the same structure; this is consistent with it but worth hardening together.)
@copilot please address this.
There was a problem hiding this comment.
Non-blocking observations — intent is correct, implementation has a few rough edges
The core logic (skip retries on TPM exhaustion and unfinished-goal errors) is sound and the harness/retry-guard pattern detection works correctly. Three non-blocking issues noted inline:
Issues found
-
** is over-broad** (line 67): The loose pattern can false-positive when combined with any other rate-limit trigger in the same output — use the precise wording already in instead.
-
**Double call in ** (line 524): is already computed at line 450; calling internally makes the expensive regex run twice. Inline as .
-
Test mirrors production logic instead of testing it (test line 439): The local helper is a hand-written copy of the retry conditions in . The actual at is not exercised by any test, so a future regression there would go undetected.
🔎 Code quality review by PR Code Quality Reviewer · 93 AIC · ⌖ 7.26 AIC · ⊞ 1.6K
Comment /review to run again
| // the human-readable message Codex emits inside "Reconnecting..." / error lines: | ||
| // "Rate limit reached for <model> in organization <org> on tokens per min (TPM): ..." | ||
| const RATE_LIMIT_ERROR_PATTERN = /rate_limit_exceeded|429 Too Many Requests|RateLimitError|Rate limit reached for [^\s]+(?: in organization [^\s]+)? on tokens per min/i; | ||
| const TOKEN_PER_MIN_RATE_LIMIT_PATTERN = /on tokens per min/i; |
There was a problem hiding this comment.
Over-broad TPM pattern risks suppressing retries on unrelated errors: TOKEN_PER_MIN_RATE_LIMIT_PATTERN = /on tokens per min/i is a much looser match than the TPM-specific tail already present in RATE_LIMIT_ERROR_PATTERN.
💡 Suggested fix
If result.output ever contains two unrelated lines — one matching rate_limit_exceeded and another containing an unrelated "on tokens per min" reference — isTokenPerMinuteRateLimitError will return true and silently break the retry loop for what may have been a recoverable error.
Since RATE_LIMIT_ERROR_PATTERN already encodes the full OpenAI TPM error phrase, eliminate the separate looser constant and match precisely:
function isTokenPerMinuteRateLimitError(output) {
return /Rate limit reached for [^\s]+(?: in organization [^\s]+)? on tokens per min/i.test(output);
}This matches the exact OpenAI wording documented in the comment at line 65, avoids the false-positive scenario, and removes the need for TOKEN_PER_MIN_RATE_LIMIT_PATTERN as a separate exported-or-module-scoped constant.
| // Token-per-minute limits indicate exhausted budget for the current workload profile. | ||
| // Fresh-run retries immediately repeat the same prompt workload and can quickly | ||
| // drain available credits without making forward progress. | ||
| if (isTokenPerMinuteRateLimitError(result.output)) { |
There was a problem hiding this comment.
isRateLimitError runs twice on the same output string: isRateLimit is already computed at line 450. Calling isTokenPerMinuteRateLimitError(result.output) here triggers a second full regex sweep via the internal isRateLimitError call, with no benefit.
💡 Suggested fix
Since isRateLimit is in scope, eliminate the redundant call by inlining the TPM check:
// Line 524 – replace:
if (isTokenPerMinuteRateLimitError(result.output)) {
// with:
if (isRateLimit && TOKEN_PER_MIN_RATE_LIMIT_PATTERN.test(result.output)) {(Or use the more precise pattern suggested for line 67 directly here.)
This also makes the code's intent clearer: TPM detection is a specialisation of the general rate-limit path, which is exactly what the surrounding code structure implies.
| const nonRetryableGuard = detectNonRetryableHarnessGuard(result.output); | ||
| if (nonRetryableGuard.aiCreditsExceeded || nonRetryableGuard.awfAPIProxyBlockingRequests || nonRetryableGuard.goalAlreadyActive || nonRetryableGuard.maxRunsExceeded) return false; | ||
| const isRateLimit = isRateLimitError(result.output); | ||
| if (isTokenPerMinuteRateLimitError(result.output)) return false; |
There was a problem hiding this comment.
Test verifies a local copy of retry logic, not the production main() code path: the shouldRetry function defined at line 431 is a hand-written mirror of the retry conditions inside main(). Adding the TPM guard here validates pattern matching but does not exercise the actual break at line 524 of codex_harness.cjs.
💡 Why this matters
If the TPM check in main() is accidentally removed, reordered, or placed inside a condition that gates it away, all tests in this describe block will continue to pass — because they call the local shouldRetry mirror, not the real production code.
The test also cements the double-isRateLimitError call from the production code: line 438 computes const isRateLimit = isRateLimitError(result.output) and line 439 calls isTokenPerMinuteRateLimitError which calls isRateLimitError again internally.
Consider either:
- Exporting a thin
_shouldRetry(result, attempt, flags)predicate fromcodex_harness.cjsthat encapsulates the retry decision and can be tested directly, or - Adding an integration test that exercises
main()end-to-end with a stubbed subprocess returning a TPM error and asserts it does not invoke the subprocess a second time.
🤖 PR Triage — Run §28445959231
Score breakdown: Impact 34/50 · Urgency 20/30 · Quality 12/20 Stops infinite Codex retry loops on TPM exhaustion and unfinished-goal errors, preventing unnecessary token burn. CI passing, low-risk fix. Labels already applied.
|
|
@copilot run pr-finisher skill https://github.com/github/gh-aw/actions/runs/28423209643/job/84221720858 |
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Implemented a pr-finisher pass and pushed Addressed the failing run feedback by tightening TPM detection, logging TPM guard state in attempt diagnostics, avoiding redundant rate-limit rescans, and adding the requested test coverage updates (non-TPM reconnect exhaustion, TPM with partial output, and JSON-wrapped unfinished-goal cases). Local validation passed: CI on this new head won’t auto-rerun from my push; please re-trigger workflows to verify remote checks on the updated commit. |
Codex runs were repeatedly re-invoked after deterministic failures, burning tokens without forward progress. In particular, token-per-minute exhaustion and a newer “unfinished goal” router error variant were not being treated as non-retryable.
Retry guard: token budget exhaustion
on tokens per min) incodex_harness.cjs.Retry guard: goal-state regression
harness_retry_guard.cjsto recognize:cannot create a new goal because this thread has an unfinished goal; complete the existing goal firstTargeted harness test updates
actions/setup/js/codex_harness.test.cjsactions/setup/js/harness_retry_guard.test.cjsDocs: retry-count configuration clarity
docs/src/content/docs/reference/engines.mdto document current built-in harness retry budget and where custom retry behavior can be introduced.