fix(harness): exit 0 when expected safe-outputs already produced despite numerous permission-denied#41675
Conversation
…utputs already produced When a Copilot/Codex/Claude agent successfully emits required safe-outputs (e.g. add_comment, submit_pull_request_review) but also triggers numerous permission-denied signals from optional/exploratory bash commands, the copilot-harness was classifying the run as failure (exit 1) via hasNumerousPermissionDenied — a false-red. Fix: before applying the terminal permission-denied verdict, check whether the safe-outputs file already contains at least one non-diagnostic entry (i.e. any type other than noop/missing_tool/report_incomplete). If expected output was already produced, suppress the terminal verdict and exit 0 instead. Changes: - safeoutputs_cli.cjs: add hasExpectedSafeOutputs() helper + DIAGNOSTIC_SAFE_OUTPUT_TYPES - copilot_harness.cjs: check hasExpectedSafeOutputs before emitting missing_tool - codex_harness.cjs: same suppression logic - claude_harness.cjs: same suppression logic - safeoutputs_cli.test.cjs: 14 unit tests for hasExpectedSafeOutputs - copilot_harness.test.cjs: 2 integration tests for suppression behavior Closes #41636 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR adjusts the agent harnesses’ terminal failure classification so runs that already produced meaningful safe-outputs (e.g. add_comment, submit_pull_request_review) are not incorrectly marked as failures when a later “numerous permission-denied” threshold is reached by optional/exploratory commands.
Changes:
- Add
hasExpectedSafeOutputs()to detect whether the safe-outputs JSONL contains at least one non-diagnostic entry. - Update Copilot/Codex/Claude harnesses to suppress the terminal
permission_deniedverdict (and exit 0) when expected outputs already exist. - Add unit/integration tests covering the new safe-outputs detection and the Copilot harness suppression behavior.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/safeoutputs_cli.cjs | Adds helper to detect non-diagnostic safe-output entries and exports it for harness use. |
| actions/setup/js/safeoutputs_cli.test.cjs | Adds unit tests for hasExpectedSafeOutputs() across valid/invalid/empty/missing JSONL scenarios. |
| actions/setup/js/copilot_harness.cjs | Guards the numerous-permission-denied terminal path by checking safe-outputs first and exiting 0 when work already succeeded. |
| actions/setup/js/codex_harness.cjs | Applies the same permission-denied suppression guard as Copilot harness. |
| actions/setup/js/claude_harness.cjs | Applies the same permission-denied suppression guard as Copilot harness. |
| actions/setup/js/copilot_harness.test.cjs | Adds integration tests for suppression vs. normal terminal behavior under numerous permission-denied conditions. |
Review details
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 6/6 changed files
- Comments generated: 2
- Review effort level: Low
| /** | ||
| * Read the safe-outputs JSONL file and check whether it contains at least one | ||
| * non-diagnostic entry (i.e. an output that represents real, task-level work). | ||
| * Returns true when such an entry exists; false otherwise. | ||
| * Used by harnesses to suppress the terminal "numerous permission-denied" verdict | ||
| * when the agent already produced the expected output — preventing false-red runs. |
| const result = spawnSync(process.execPath, ["copilot_harness.cjs", process.execPath, stubPath, "--prompt-file", promptPath], { | ||
| cwd: path.dirname(require.resolve("./copilot_harness.cjs")), | ||
| env: { ...process.env, COPILOT_HARNESS_STUB_CALLS: callsPath, GH_AW_SAFE_OUTPUTS: safeOutputsPath, GH_AW_SAFEOUTPUTS_CLI: "true" }, | ||
| encoding: "utf8", | ||
| timeout: 15000, | ||
| }); | ||
| // Harness exits 1 because no expected output was produced | ||
| expect(result.status).toBe(1); | ||
| expect(result.stderr).toContain("detected numerous permission-denied issues — not retrying"); |
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
✅ Test Quality Sentinel completed test quality analysis. All safe-output actions (add_comment + submit_pull_request_review APPROVE) were successfully completed in the first agent run of this workflow run (28237898971). This is a duplicate invocation — no further actions needed. |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #41675 does not have the 'implementation' label and has 0 new lines of code in business logic directories (threshold: 100). |
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /diagnose, /tdd, and /zoom-out to a focused bug-fix that eliminates false-red runs caused by permission-denied noise after core work completes.
The root cause analysis is solid, the fix is correctly applied across all three harnesses, and the unit test coverage for hasExpectedSafeOutputs is thorough (14 cases including malformed input, injected readers, and all diagnostic type combinations). Overall this is a well-scoped, well-tested fix.
📋 Key Themes & Findings
Observations
- Missing integration tests for
claude_harnessandcodex_harness: Both test files exist but the new suppression path is untested there. Since all three harnesses use identical guard code, a future divergence would only be caught by the copilot tests. (/tdd— see inline comment oncopilot_harness.test.cjs) DIAGNOSTIC_SAFE_OUTPUT_TYPESnot exported: This constant is the semantic contract for the "expected vs diagnostic" distinction; exporting it would allow tests to stay in sync automatically rather than hardcoding the same strings. (/zoom-out— see inline comment onsafeoutputs_cli.cjs:159)- Partial-output suppression: The "first non-diagnostic entry = core work succeeded" heuristic is intentional and correct for the primary scenario, but worth documenting — a task that emitted only partial outputs would also suppress the verdict. (
/diagnose— see inline comment onsafeoutputs_cli.cjs:196) null/undefinedpath not explicitly tested: Minor gap alongside the""" test. (**/tdd`**)
Positive Highlights
- ✅ Root cause clearly identified and precisely addressed — touches only the terminal-verdict branch
- ✅ Identical guard logic applied consistently across all three harnesses
- ✅ Excellent unit test coverage for
hasExpectedSafeOutputsincluding injectedreadFileSyncfor testability - ✅ Integration tests cover both the suppression path and the negative (no-output) path
- ✅ Log message at suppression site is descriptive and will help future debugging
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 65.3 AIC · ⌖ 7.37 AIC · ⊞ 6.5K
| }); | ||
| }); | ||
|
|
||
| describe("permission-denied suppression when expected safe-outputs already produced", () => { |
There was a problem hiding this comment.
[/tdd] Integration coverage exists only for copilot_harness; claude_harness.test.cjs and codex_harness.test.cjs already exist but have no tests for the new suppression path.
💡 Suggestion
Add parallel integration tests in each of those files (same stub pattern as lines 1912–1945 here):
describe("permission-denied suppression when expected safe-outputs already produced", () => {
it("exits 0 and suppresses verdict when safe-output already written", () => { ... });
it("exits 1 when no expected safe-outputs present", () => { ... });
});All three harnesses use identical guard logic, so divergence from a future refactor or merge conflict would only be caught in the copilot harness today.
| * Diagnostic safe-output types that represent infrastructure signals, not task-level work. | ||
| * Entries with these types are excluded when checking whether expected outputs were produced. | ||
| */ | ||
| const DIAGNOSTIC_SAFE_OUTPUT_TYPES = new Set(["noop", "missing_tool", "report_incomplete"]); |
There was a problem hiding this comment.
[/zoom-out] DIAGNOSTIC_SAFE_OUTPUT_TYPES is not exported, making it harder for callers and tests to stay in sync if types are added or removed.
💡 Suggestion
Export the constant so external code can enumerate diagnostic types without hardcoding the same strings:
// safeoutputs_cli.cjs
module.exports = {
// ...
DIAGNOSTIC_SAFE_OUTPUT_TYPES,
hasExpectedSafeOutputs,
};Tests can then import it to drive parametric cases (e.g. for (const t of DIAGNOSTIC_SAFE_OUTPUT_TYPES)) rather than repeating "noop", "missing_tool", "report_incomplete" in separate test files.
| if (!trimmed) continue; | ||
| try { | ||
| const parsed = JSON.parse(trimmed); | ||
| if (parsed && parsed.type && !DIAGNOSTIC_SAFE_OUTPUT_TYPES.has(parsed.type)) { |
There was a problem hiding this comment.
[/diagnose] The heuristic "any single non-diagnostic entry = core work succeeded" is correct for the primary false-red scenario, but it also fires when an agent produced only a partial set of required outputs before hitting permission denials.
💡 Context
For example, an agent that wrote create_pull_request_review_comment but never reached submit_pull_request_review would still suppress the verdict, causing the harness to exit 0 while only half the task's outputs are present. The downstream safe-output runner would process an incomplete set silently.
This is a known trade-off (the PR description acknowledges "at least one"), but it's worth documenting explicitly in the JSDoc:
/**
* ...
* Note: returns true on the *first* non-diagnostic entry found — it does not
* verify that all required outputs for the task were written. A partial output
* set will still suppress the permission-denied verdict.
*/No code change needed; just a documentation nudge so future readers understand the intentional looseness of this check.
| expect(hasExpectedSafeOutputs("/tmp/nonexistent-safe-outputs-expected.jsonl")).toBe(false); | ||
| }); | ||
|
|
||
| it("returns false when safeOutputsPath is empty", () => { |
There was a problem hiding this comment.
[/tdd] The test covers the empty-string path but not null or undefined explicitly. While the harnesses guard with safeOutputsPath && before calling the function, the function is exported and could be called without that guard.
💡 Suggestion
it("returns false when safeOutputsPath is null", () => {
expect(hasExpectedSafeOutputs(null)).toBe(false);
});
it("returns false when safeOutputsPath is undefined", () => {
expect(hasExpectedSafeOutputs(undefined)).toBe(false);
});The internal if (!safeOutputsPath) return false guard already handles these, so these tests would just confirm the contract for external callers.
There was a problem hiding this comment.
The core fix is sound: the hasExpectedSafeOutputs helper is well-tested in isolation and the harness guard correctly suppresses the false-red verdict only when real task outputs exist.
Findings summary
Blocking
None.
Non-blocking (2 comments)
DIAGNOSTIC_SAFE_OUTPUT_TYPESnegative allowlist (safeoutputs_cli.cjsline 159) — any new diagnostic type added without updating this set will be treated as real task success, and non-stringtypevalues bypass the check entirely. Atypeof parsed.type === "string"guard is the minimal fix.- No integration tests for codex/claude harness suppression paths (
codex_harness.cjs) — the identical guard is copy-pasted into all three harnesses but integration-tested only incopilot_harness.test.cjs.
Two existing review comments (function name semantics, GH_AW_SAFEOUTPUTS_CLI: "true" portability) are already filed and not duplicated here.
🔎 Code quality review by PR Code Quality Reviewer · 78.1 AIC · ⌖ 7.43 AIC · ⊞ 5.2K
| * Diagnostic safe-output types that represent infrastructure signals, not task-level work. | ||
| * Entries with these types are excluded when checking whether expected outputs were produced. | ||
| */ | ||
| const DIAGNOSTIC_SAFE_OUTPUT_TYPES = new Set(["noop", "missing_tool", "report_incomplete"]); |
There was a problem hiding this comment.
Negative allowlist will silently misclassify any new diagnostic safe-output type added in the future, causing false-green suppression when no real task work was done.
💡 Details
DIAGNOSTIC_SAFE_OUTPUT_TYPES is a closed negative allowlist. If a new diagnostic-only type is ever added to the safe-outputs system (e.g. heartbeat, progress_update, trace) without also updating this set, hasExpectedSafeOutputs will treat it as a real task output and suppress the permission-denied verdict — turning a genuine failure into a silent exit 0.
A secondary robustness gap: parsed.type is not validated as a string, so a malformed entry like {"type": 42} has a truthy type that is not in the Set, causing the function to return true incorrectly.
Option A — minimal fix (typeof guard):
if (parsed && typeof parsed.type === "string" && !DIAGNOSTIC_SAFE_OUTPUT_TYPES.has(parsed.type)) {Option B — flip to a positive allowlist (explicit, but must be updated as new tools are added):
const REAL_SAFE_OUTPUT_TYPES = new Set([
"add_comment", "create_pull_request", "submit_pull_request_review",
"create_pull_request_review_comment", "create_check_run",
// extend here when new task-output tools are added
]);
if (parsed && typeof parsed.type === "string" && REAL_SAFE_OUTPUT_TYPES.has(parsed.type)) {Option A is the minimal safe fix regardless of which approach is preferred long-term.
| } | ||
|
|
||
| if (hasNumerousPermissionDenied) { | ||
| // If the agent already produced expected safe-outputs, the permission-denied |
There was a problem hiding this comment.
The suppression path added here (and in claude_harness.cjs) has no integration tests, so a regression in either harness would ship silently.
💡 Details
copilot_harness.test.cjs grows two integration tests that validate the suppression and the non-suppression paths end-to-end. The identical guard block was copy-pasted into codex_harness.cjs and claude_harness.cjs, but neither harness has a corresponding test. The unit tests in safeoutputs_cli.test.cjs cover hasExpectedSafeOutputs in isolation but do not exercise the harness integration glue (the lastExitCode = 0; break; path, the safeOutputsPath && guard, the log message). A typo, wrong variable, or subtle difference in how the other harnesses handle lastExitCode after the loop would go undetected.
The quickest fix is to add the same two describe-block tests to the test suites for codex_harness and claude_harness, or factor them into a shared helper both suites call.
🧪 Test Quality Sentinel Report
📊 Metrics & Test Classification (17 tests analyzed)
Go: 0 ( Score breakdown: behavioral 40/40 + edge-case 22.9/30 + low-duplication 15/20 + proportional-growth 0/10 = 78
|
Agent runs that successfully emitted required safe-outputs (e.g.
add_comment,submit_pull_request_review) were being markedfailure(exit 1) when thepermissionDeniedCount >= 5threshold fired on optional/exploratory bash commands attempted after the core work completed. The classifier never consulted the safe-outputs file, so fully successful tasks produced false-red runs.Root cause
hasNumerousPermissionDeniedwas a terminal condition regardless of whether the agent had already produced its expected output. Permission denials from plumbing attempts (python3 -c, stdin redirection, etc.) inflated the counter even when real work was done.Fix
Before applying the
permission_deniedterminal verdict in all three harnesses, check whether the safe-outputs file already contains at least one non-diagnostic entry. If it does, suppress the verdict and exit 0.Changes
safeoutputs_cli.cjs— addshasExpectedSafeOutputs(): reads the safe-outputs JSONL and returnstrueif at least one entry has a type outsideDIAGNOSTIC_SAFE_OUTPUT_TYPES(noop,missing_tool,report_incomplete)copilot_harness.cjs,codex_harness.cjs,claude_harness.cjs— guard thehasNumerousPermissionDeniedterminal block withhasExpectedSafeOutputs; suppression path exits 0 without emittingmissing_toolsafeoutputs_cli.test.cjs— 14 unit tests covering all diagnostic/non-diagnostic type combinations, missing file, empty file, and injected-reader pathscopilot_harness.test.cjs— 2 integration tests: suppression fires when expected output exists; normal terminal path fires when it does not