test(orchestration): probe #511 root causes — premature-END (refuted) + frustration handling (reproduced) by wyuc · Pull Request #599 · THU-MAIC/OpenMAIC

wyuc · 2026-05-26T05:06:25Z

Probes #511 root causes via two complementary evals under `eval/orchestration/`. Originally framed as a premature-END regression guard (item 2 of #598); broadened after prod-data investigation showed premature-END isn't the actual symptom in prod.

Findings (TL;DR)

premature-END is not a real bug in prod. Past 3 days of chat-adapter director calls (~47k on google:gemini-3-flash-preview, the only model in prod chat) show 1054 END decisions, 0 of which match Multi-agent dialogue: off-topic replies, role confusion, premature discussion end #511's "user asks substantive Q → END" shape. The 3 END-after-[User] cases are ack-style closures ("明白" / "OK" / "谢谢老师") and legitimate.
The real symptom is "director ignores frustration signal". One prod user complained "你答非所问" 7 times in 4 minutes; director routed to another peer agent every single time instead of cueing USER or pivoting to the teacher.
Rule fix lifts directorate behavior 16% → 88% mean correct rate on synthesized frustration scenarios.

What's in this PR (2 evals)

`eval/orchestration/runner.ts` — premature-END regression guard

5 scenarios modelled on Multi-agent dialogue: off-topic replies, role confusion, premature discussion end #511's Tiananmen-3D trace, soft pushback, topic pivot, brief acknowledgement, teacher-signals-end-then-user-objects
A/B: post-fix (current main = post-fix(orchestration): restore agent attribution in director summary #554) vs pre-fix (rules 10/11/12 stripped + pre-fix(orchestration): restore agent attribution in director summary #554 [User]/[Assistant] summary labels)
9 models × 5 scenarios × 5 samples: 0% END both variants across the board. Confirms fix(orchestration): restore agent attribution in director summary #554 doesn't measurably change director END behavior on these prompts.
Kept as a regression guard: if anyone later changes the prompt and END rate jumps, this eval fails.

`eval/orchestration/frustration-runner.ts` — director frustration handling (NEW)

5 scenarios where user expresses frustration and an agent has already replied without acknowledging it (matching the prod shape where complaint isn't the last line — agent-loop has dispatched 1+ agents after)
Topics: math (二次函数对称轴), biology (光合暗反应), English (主谓一致), physics (惯性质量), calculus (product rule)
Decision rule-judged: USER | TEACHER | OTHER_AGENT | END
A/B: baseline (current main) vs with_rule (baseline + appended `# Handling User Feedback` rule)

Result on google:gemini-3-flash-preview (the prod model):

Scenario	baseline	with_rule	Δ
math_quadratic_axis_generic_complaint	0%	100%	+100%
bio_dark_reaction_explicit_correction	0%	40%	+40%
english_team_isare_request_redo	80%	100%	+20%
physics_inertial_mass_second_complaint	0%	100%	+100%
calculus_product_rule_english	0%	100%	+100%
mean	16%	88%	+72%

4/5 scenarios reproduce the prod bug (baseline 0%, all picks land on OTHER_AGENT — matching the 答非所问 prod trace).

Per-scenario verdict

3 scenarios PASS the strict bar (with_rule ≥ 70% AND Δ ≥ 30pp)
bio_dark_reaction: with_rule still 60% OTHER_AGENT — rule wording needs to force role=teacher more explicitly (follow-up)
english_team_isare: baseline already 80% TEACHER — scenario doesn't reproduce the bug strongly, follow-up to tighten

What's NOT in this PR

The actual prompt fix (modifying `lib/prompts/templates/director/system.md` to inject the rule) is a separate follow-up once the rule wording is tightened to also pass the bio scenario. Plus the agent-side prompt likely needs a parallel "Responding to user feedback" rule — agent SP currently has zero handling for user dissatisfaction (validated by reading `buildStructuredPrompt` + `agent-system/system.md`).

Out-of-scope discoveries worth flagging

Prod is still running pre-#554 director prompt (#554 not yet deployed). And `agent-loop.ts:614` maps every non-`cue_user` outcome to `status: 'completed'`, conflating LLM-END with empty_turns / no_done / max_turns. Both are independent of #554 — happy to follow up in separate PRs if desired.

Test plan

`pnpm check` clean
`pnpm test` — no new failures (9 pre-existing ssrf-guard DNS-bound failures unrelated)
`EVAL_DIRECTOR_MODEL=google:gemini-3-flash-preview pnpm eval:orchestration` — premature-END eval passes (post-fix regression guard PASS, no scenario discriminates which is the expected null finding)
`EVAL_DIRECTOR_MODEL=google:gemini-3-flash-preview npx tsx --env-file=.env.local eval/orchestration/frustration-runner.ts` — 3/5 scenarios PASS, mean Δ=72pp, bug reproduces

Want to wire `eval:orchestration-frustration` into a package.json script before flipping to Ready — will do in a follow-up commit.

Adds eval/orchestration/ following the outline-language pattern: a runner that A/B-tests the director against the same scenario with two prompt variants — current main (post-#554) and a synthesised pre-#554 baseline that strips the role-aware summary labels AND the new system.md rules 10/11/12 together. Pass criterion is now framed as a regression guard rather than a fixture discrimination test: every scenario's post-fix END rate must stay below EVAL_END_THRESHOLD (default 20%). The pre-vs-post Δ is reported as informational data, since #554's reviewer feedback was that earlier fixtures didn't discriminate. Empirical finding across 7 shipped model configs (gpt-4.1-mini, gpt-4o-mini, gpt-5.4-nano, qwen-plus, qwen3.5-flash, deepseek-chat, deepseek-v4-flash, gemini-2.5-flash, claude-haiku-4-5) with 5 scenarios modelled on #511 (incl. the exact Tiananmen-3D objection trace, soft pushback after a long resolved-looking discussion, topic pivot, brief acknowledgement, and explicit teacher-signals-end-then-user-objects): no scenario produced a non-zero END rate in either variant. Useful data for the #554/#598 discussion — the prompt-layer rules don't measurably change behavior on these shipped models with these prompts, but the eval is now wired so future regressions show up.

follow-up) Mined prod data (3 days of chat-adapter director calls, ~47k samples on gemini-3-flash-preview which is the only model in prod chat) and found: - 0 / 1054 END decisions match #511's "user asks substantive Q → premature END" - BUT one user complained "你答非所问" seven times in 4 minutes; director routed to another peer agent every time instead of cueing USER or re-routing to the teacher So the real #511 symptom in prod is "director ignores frustration signal, keeps fueling agent variety", not premature END. This commit adds an A/B eval scoped to that: - 5 synthesised scenarios across math / biology / English grammar / physics / calculus, each capturing the prod shape: user asks specific Q → agents drift onto adjacent topics → user expresses frustration → another agent continues to ignore. Director is asked to pick next agent. - Decision rule-judged into USER | TEACHER | OTHER_AGENT | END. - baseline = current main director system.md; with_rule = same + appended "Handling User Feedback" rule injecting USER-or-teacher route on frustration signals. Run on prod model (google:gemini-3-flash-preview), 5 samples per variant: | Scenario | baseline | with_rule | Δ | |-----------------------------------------|----------|-----------|------| | math_quadratic_axis_generic_complaint | 0% | 100% | 100% | | bio_dark_reaction_explicit_correction | 0% | 40% | 40% | | english_team_isare_request_redo | 80% | 100% | 20% | | physics_inertial_mass_second_complaint | 0% | 100% | 100% | | calculus_product_rule_english | 0% | 100% | 100% | | **mean** | **16%** | **88%** | **72%** | Bug reproduced (4/5 scenarios show baseline=0% picking OTHER_AGENT, matching prod behavior). Rule lifts the mean by 72pp. Two scenarios don't meet the per-scenario PASS bar: - bio_dark_reaction: with_rule still picks the assistant 60% of the time (rule wording isn't strong enough to force role=teacher specifically) - english_team_isare: baseline is already 80% TEACHER — scenario doesn't reproduce the bug strongly, follow-up work to tighten Both are scenario / rule-wording iterations, not a directional issue. Scope: this commit is the eval only. The actual prompt fix (modifying lib/prompts/templates/director/system.md to add the rule) is a separate follow-up once the rule wording is tightened to PASS bio_dark_reaction too.

wyuc mentioned this pull request May 26, 2026

Follow-up to #554: remaining items for #511 #598

Open

ashutoshrana mentioned this pull request May 26, 2026

fix(orchestration): add agent encoding note, tighten Rule 11, add edge-case tests [AI-assisted] #600

Closed

8 tasks

wyuc changed the title ~~test(orchestration): premature-END regression eval (follow-up to #554)~~ test(orchestration): probe #511 root causes — premature-END (refuted) + frustration handling (reproduced) May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(orchestration): probe #511 root causes — premature-END (refuted) + frustration handling (reproduced)#599

test(orchestration): probe #511 root causes — premature-END (refuted) + frustration handling (reproduced)#599
wyuc wants to merge 2 commits into
mainfrom
feat/598-eval-orchestration

wyuc commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wyuc commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Findings (TL;DR)

What's in this PR (2 evals)

`eval/orchestration/runner.ts` — premature-END regression guard

`eval/orchestration/frustration-runner.ts` — director frustration handling (NEW)

Result on google:gemini-3-flash-preview (the prod model):

Per-scenario verdict

What's NOT in this PR

Out-of-scope discoveries worth flagging

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wyuc commented May 26, 2026 •

edited

Loading