[aw-failures] Copilot CLI false-red — runs marked failure (exit 1) after safe-outputs succeed, via "numerous permission denied" 
[Content truncated due to length]

### Copilot CLI false-red — runs marked `failure` (exit 1) after safe-outputs already succeeded, via "numerous permission denied" terminal classification

Investigated window: 6h ending 2026-06-26 08:04 UTC. Dominant recurring, **untracked** failure signature.

### Problem statement

1. Copilot-CLI agent jobs that **successfully emit their required safe-outputs** are still concluded `failure` (process exit code 1).
2. The `copilot-harness` classifies the attempt as `failureClass=permission_denied` with `hasNumerousPermissionDenied=true` (`permissionDeniedCount>=5`), treats it as a terminal "missing tool/permission" issue, **does not retry**, and exits 1.
3. The permission-denial counter increments on every disallowed Bash command form the agent *attempts*, including optional/exploratory variants — even after the real work is done. The classifier never consults whether expected safe-outputs were produced, so a fully successful task is reported as a red run (false-red), polluting CI signal and masking true success.

### Affected workflows and run IDs

**Confirmed (exact signature verified from run artifacts):**
- **PR Description Updater** — [§28222667341](https://github.com/github/gh-aw/actions/runs/28222667341) (07:00 UTC). Log: `permissionDeniedCount=5`, `hasNumerousPermissionDenied=true`, agent output `PR #41608 description updated successfully`, `missing_tool emitted`, `not retrying (classified as missing tool/permission issue)`, `Process completed with exit code 1`.
- **Test Quality Sentinel** — [§28221623302](https://github.com/github/gh-aw/actions/runs/28221623302) (06:35 UTC). `safeoutputs.jsonl` contains a completed `add_comment` (full Test Quality Report → PR #41617) **and** a `submit_pull_request_review` (`APPROVE`). Both succeeded; the run then emitted `missing_tool` with `reason: "missing tool/permission issue: numerous permission denied errors detected"` and exited 1. Run did real work: 794k tokens, 21 turns, 12m0s.

**Recurrence (same workflow + engine, same window, `conclusion=failure`; signature strongly suspected):**
- **Test Quality Sentinel** — [§28218968236](https://github.com/github/gh-aw/actions/runs/28218968236) (05:22 UTC), [§28215990234](https://github.com/github/gh-aw/actions/runs/28215990234) (03:52 UTC). TQS failed 3× in the window while interleaved successful runs (06:03, 04:54, 04:38) prove the workflow itself is healthy.

**Same engine, root cause unconfirmed (may differ):** Changeset Generator `28215703557`, Go Logger Enhancement `28217478494`, Daily AstroStyleLite Markdown Spellcheck `28217439569`.

### Evidence

From `28221623302` artifacts (`safeoutputs.jsonl`), the `missing_tool` record's `alternatives` field shows the agent looping on **denied** Bash command forms before the terminal classification fired:
- `python3 -c "print(json.dumps(...))"` pipelines
- `safeoutputs add_comment . < /tmp/gh-aw/agent/payload.json` stdin redirection
- `with open('/tmp/gh-aw/agent/payload.json','w') as f: ...`

These are exploratory plumbing attempts; the agent *had already* emitted valid `add_comment` + `submit_pull_request_review` items. The denials were on extra/optional invocations, yet they tripped the `>=5` terminal threshold.

### Probable root cause

1. **Classification ignores the success signal.** `copilot-harness` derives a terminal `permission_denied` verdict purely from `permissionDeniedCount`, without checking whether `outputs.jsonl` already contains valid safe-output items. A run that produced its required output is reported as failed.
2. **safeoutputs CLI ergonomics.** The agent reaches for `python3 -c` JSON-encode pipelines and stdin-redirection forms that the Bash allowlist blocks, inflating `permissionDeniedCount` even though a simpler allowed `safeoutputs <tool> --param value` invocation exists.

### Proposed remediation

1. In `copilot-harness` failure classification, **suppress the `hasNumerousPermissionDenied` terminal verdict and exit 0** when the run produced ≥1 expected safe-output item in `outputs.jsonl`. Permission-denials should then yield at most a warning + `missing_tool` record, not a red run.
2. Optionally: only count denials toward the threshold for commands the agent was *required* to run (not optional/exploratory variants), or make the threshold configurable.
3. Reduce denial generation: prominently document the exact allowed `safeoutputs <tool> --param value` form, and/or widen the Bash allowlist to cover the common `safeoutputs <tool> . < file.json` and small `python3 -c` JSON-encode forms the agent reaches for.

### Success criteria / verification

- A re-run of PR Description Updater / Test Quality Sentinel that emits its safe-outputs concludes `success` (green), even when permission-denials occurred.
- `permissionDeniedCount` no longer forces exit 1 when `outputs.jsonl` is non-empty.
- Scheduled Test Quality Sentinel false-red rate for this signature drops to ~0.

### Existing-issue correlation

Distinct from all open `agentic-workflows` issues: #41195 (BYOK 403 — genuine provider-auth failure, *no* work produced), #41455 (firewall/DNS startup), #41456 (patch-parser under-detection), #41355 (`workflow_call` permissions), #41293 (Copilot Python SDK `ModuleNotFound`). None covers the "succeeds-then-marked-failed via permission-denied count" signature.

### Other observations this window (not filed — insufficient/ambiguous evidence)

- **GitHub Remote MCP Authentication Test** `28221753707` failed at step *Start MCP Gateway* with 0 token usage (agent never invoked); raw gateway error not captured. Possibly related to #41455 but a different step — insufficient evidence to file.
- **Daily Sub-Agent Model Resolution Audit** (Codex) `28221799739` — 0 token usage, no clear signature in available logs.
- **Code Simplifier** `28217567247` — Copilot engine; consistent with existing #41195.
- 6 Smoke CI / Q runs were `cancelled` (not failures) and were excluded.

**References:**
- https://github.com/github/gh-aw/actions/runs/28222667341
- https://github.com/github/gh-aw/actions/runs/28221623302
- https://github.com/github/gh-aw/actions/runs/28218968236







> Generated by [🔍 [aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/28225339885) · 238.7 AIC · ⌖ 36.5 AIC · ⊞ 5.3K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)
> - [x] expires  on Jul 3, 2026, 12:16 AM UTC-08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[aw-failures] Copilot CLI false-red — runs marked failure (exit 1) after safe-outputs succeed, via "numerous permission denied" [Content truncated due to length] #41636

Copilot CLI false-red — runs marked `failure` (exit 1) after safe-outputs already succeeded, via "numerous permission denied" terminal classification

Problem statement

Affected workflows and run IDs

Evidence

Probable root cause

Proposed remediation

Success criteria / verification

Existing-issue correlation

Other observations this window (not filed — insufficient/ambiguous evidence)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[aw-failures] Copilot CLI false-red — runs marked failure (exit 1) after safe-outputs succeed, via "numerous permission denied" [Content truncated due to length] #41636

Description

Copilot CLI false-red — runs marked failure (exit 1) after safe-outputs already succeeded, via "numerous permission denied" terminal classification

Problem statement

Affected workflows and run IDs

Evidence

Probable root cause

Proposed remediation

Success criteria / verification

Existing-issue correlation

Other observations this window (not filed — insufficient/ambiguous evidence)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Copilot CLI false-red — runs marked `failure` (exit 1) after safe-outputs already succeeded, via "numerous permission denied" terminal classification