Skip to content

[aw-failures] Copilot CLI false-red — runs marked failure (exit 1) after safe-outputs succeed, via "numerous permission denied" [Content truncated due to length] #41636

Description

@github-actions

Copilot CLI false-red — runs marked failure (exit 1) after safe-outputs already succeeded, via "numerous permission denied" terminal classification

Investigated window: 6h ending 2026-06-26 08:04 UTC. Dominant recurring, untracked failure signature.

Problem statement

  1. Copilot-CLI agent jobs that successfully emit their required safe-outputs are still concluded failure (process exit code 1).
  2. The copilot-harness classifies the attempt as failureClass=permission_denied with hasNumerousPermissionDenied=true (permissionDeniedCount>=5), treats it as a terminal "missing tool/permission" issue, does not retry, and exits 1.
  3. The permission-denial counter increments on every disallowed Bash command form the agent attempts, including optional/exploratory variants — even after the real work is done. The classifier never consults whether expected safe-outputs were produced, so a fully successful task is reported as a red run (false-red), polluting CI signal and masking true success.

Affected workflows and run IDs

Confirmed (exact signature verified from run artifacts):

  • PR Description Updater§28222667341 (07:00 UTC). Log: permissionDeniedCount=5, hasNumerousPermissionDenied=true, agent output PR #41608 description updated successfully, missing_tool emitted, not retrying (classified as missing tool/permission issue), Process completed with exit code 1.
  • Test Quality Sentinel§28221623302 (06:35 UTC). safeoutputs.jsonl contains a completed add_comment (full Test Quality Report → PR Update gh aw update --org to support workflow-targeted updates and repo prefiltering #41617) and a submit_pull_request_review (APPROVE). Both succeeded; the run then emitted missing_tool with reason: "missing tool/permission issue: numerous permission denied errors detected" and exited 1. Run did real work: 794k tokens, 21 turns, 12m0s.

Recurrence (same workflow + engine, same window, conclusion=failure; signature strongly suspected):

  • Test Quality Sentinel§28218968236 (05:22 UTC), §28215990234 (03:52 UTC). TQS failed 3× in the window while interleaved successful runs (06:03, 04:54, 04:38) prove the workflow itself is healthy.

Same engine, root cause unconfirmed (may differ): Changeset Generator 28215703557, Go Logger Enhancement 28217478494, Daily AstroStyleLite Markdown Spellcheck 28217439569.

Evidence

From 28221623302 artifacts (safeoutputs.jsonl), the missing_tool record's alternatives field shows the agent looping on denied Bash command forms before the terminal classification fired:

  • python3 -c "print(json.dumps(...))" pipelines
  • safeoutputs add_comment . < /tmp/gh-aw/agent/payload.json stdin redirection
  • with open('/tmp/gh-aw/agent/payload.json','w') as f: ...

These are exploratory plumbing attempts; the agent had already emitted valid add_comment + submit_pull_request_review items. The denials were on extra/optional invocations, yet they tripped the >=5 terminal threshold.

Probable root cause

  1. Classification ignores the success signal. copilot-harness derives a terminal permission_denied verdict purely from permissionDeniedCount, without checking whether outputs.jsonl already contains valid safe-output items. A run that produced its required output is reported as failed.
  2. safeoutputs CLI ergonomics. The agent reaches for python3 -c JSON-encode pipelines and stdin-redirection forms that the Bash allowlist blocks, inflating permissionDeniedCount even though a simpler allowed safeoutputs <tool> --param value invocation exists.

Proposed remediation

  1. In copilot-harness failure classification, suppress the hasNumerousPermissionDenied terminal verdict and exit 0 when the run produced ≥1 expected safe-output item in outputs.jsonl. Permission-denials should then yield at most a warning + missing_tool record, not a red run.
  2. Optionally: only count denials toward the threshold for commands the agent was required to run (not optional/exploratory variants), or make the threshold configurable.
  3. Reduce denial generation: prominently document the exact allowed safeoutputs <tool> --param value form, and/or widen the Bash allowlist to cover the common safeoutputs <tool> . < file.json and small python3 -c JSON-encode forms the agent reaches for.

Success criteria / verification

  • A re-run of PR Description Updater / Test Quality Sentinel that emits its safe-outputs concludes success (green), even when permission-denials occurred.
  • permissionDeniedCount no longer forces exit 1 when outputs.jsonl is non-empty.
  • Scheduled Test Quality Sentinel false-red rate for this signature drops to ~0.

Existing-issue correlation

Distinct from all open agentic-workflows issues: #41195 (BYOK 403 — genuine provider-auth failure, no work produced), #41455 (firewall/DNS startup), #41456 (patch-parser under-detection), #41355 (workflow_call permissions), #41293 (Copilot Python SDK ModuleNotFound). None covers the "succeeds-then-marked-failed via permission-denied count" signature.

Other observations this window (not filed — insufficient/ambiguous evidence)

References:

Generated by 🔍 [aw] Failure Investigator (6h) · 238.7 AIC · ⌖ 36.5 AIC · ⊞ 5.3K ·

  • expires on Jul 3, 2026, 12:16 AM UTC-08:00

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions