Skip to content

test(orchestration): probe #511 root causes — premature-END (refuted) + frustration handling (reproduced)#599

Draft
wyuc wants to merge 2 commits into
mainfrom
feat/598-eval-orchestration
Draft

test(orchestration): probe #511 root causes — premature-END (refuted) + frustration handling (reproduced)#599
wyuc wants to merge 2 commits into
mainfrom
feat/598-eval-orchestration

Conversation

@wyuc
Copy link
Copy Markdown
Contributor

@wyuc wyuc commented May 26, 2026

Probes #511 root causes via two complementary evals under `eval/orchestration/`. Originally framed as a premature-END regression guard (item 2 of #598); broadened after prod-data investigation showed premature-END isn't the actual symptom in prod.

Findings (TL;DR)

  1. premature-END is not a real bug in prod. Past 3 days of chat-adapter director calls (~47k on google:gemini-3-flash-preview, the only model in prod chat) show 1054 END decisions, 0 of which match Multi-agent dialogue: off-topic replies, role confusion, premature discussion end #511's "user asks substantive Q → END" shape. The 3 END-after-[User] cases are ack-style closures ("明白" / "OK" / "谢谢老师") and legitimate.
  2. The real symptom is "director ignores frustration signal". One prod user complained "你答非所问" 7 times in 4 minutes; director routed to another peer agent every single time instead of cueing USER or pivoting to the teacher.
  3. Rule fix lifts directorate behavior 16% → 88% mean correct rate on synthesized frustration scenarios.

What's in this PR (2 evals)

`eval/orchestration/runner.ts` — premature-END regression guard

`eval/orchestration/frustration-runner.ts` — director frustration handling (NEW)

  • 5 scenarios where user expresses frustration and an agent has already replied without acknowledging it (matching the prod shape where complaint isn't the last line — agent-loop has dispatched 1+ agents after)
  • Topics: math (二次函数对称轴), biology (光合暗反应), English (主谓一致), physics (惯性质量), calculus (product rule)
  • Decision rule-judged: USER | TEACHER | OTHER_AGENT | END
  • A/B: baseline (current main) vs with_rule (baseline + appended `# Handling User Feedback` rule)

Result on google:gemini-3-flash-preview (the prod model):

Scenario baseline with_rule Δ
math_quadratic_axis_generic_complaint 0% 100% +100%
bio_dark_reaction_explicit_correction 0% 40% +40%
english_team_isare_request_redo 80% 100% +20%
physics_inertial_mass_second_complaint 0% 100% +100%
calculus_product_rule_english 0% 100% +100%
mean 16% 88% +72%

4/5 scenarios reproduce the prod bug (baseline 0%, all picks land on OTHER_AGENT — matching the 答非所问 prod trace).

Per-scenario verdict

  • 3 scenarios PASS the strict bar (with_rule ≥ 70% AND Δ ≥ 30pp)
  • bio_dark_reaction: with_rule still 60% OTHER_AGENT — rule wording needs to force role=teacher more explicitly (follow-up)
  • english_team_isare: baseline already 80% TEACHER — scenario doesn't reproduce the bug strongly, follow-up to tighten

What's NOT in this PR

The actual prompt fix (modifying `lib/prompts/templates/director/system.md` to inject the rule) is a separate follow-up once the rule wording is tightened to also pass the bio scenario. Plus the agent-side prompt likely needs a parallel "Responding to user feedback" rule — agent SP currently has zero handling for user dissatisfaction (validated by reading `buildStructuredPrompt` + `agent-system/system.md`).

Out-of-scope discoveries worth flagging

Prod is still running pre-#554 director prompt (#554 not yet deployed). And `agent-loop.ts:614` maps every non-`cue_user` outcome to `status: 'completed'`, conflating LLM-END with empty_turns / no_done / max_turns. Both are independent of #554 — happy to follow up in separate PRs if desired.

Test plan

  • `pnpm check` clean
  • `pnpm test` — no new failures (9 pre-existing ssrf-guard DNS-bound failures unrelated)
  • `EVAL_DIRECTOR_MODEL=google:gemini-3-flash-preview pnpm eval:orchestration` — premature-END eval passes (post-fix regression guard PASS, no scenario discriminates which is the expected null finding)
  • `EVAL_DIRECTOR_MODEL=google:gemini-3-flash-preview npx tsx --env-file=.env.local eval/orchestration/frustration-runner.ts` — 3/5 scenarios PASS, mean Δ=72pp, bug reproduces

Want to wire `eval:orchestration-frustration` into a package.json script before flipping to Ready — will do in a follow-up commit.

Adds eval/orchestration/ following the outline-language pattern: a runner
that A/B-tests the director against the same scenario with two prompt
variants — current main (post-#554) and a synthesised pre-#554 baseline
that strips the role-aware summary labels AND the new system.md rules
10/11/12 together.

Pass criterion is now framed as a regression guard rather than a fixture
discrimination test: every scenario's post-fix END rate must stay below
EVAL_END_THRESHOLD (default 20%). The pre-vs-post Δ is reported as
informational data, since #554's reviewer feedback was that earlier
fixtures didn't discriminate.

Empirical finding across 7 shipped model configs (gpt-4.1-mini,
gpt-4o-mini, gpt-5.4-nano, qwen-plus, qwen3.5-flash, deepseek-chat,
deepseek-v4-flash, gemini-2.5-flash, claude-haiku-4-5) with 5 scenarios
modelled on #511 (incl. the exact Tiananmen-3D objection trace, soft
pushback after a long resolved-looking discussion, topic pivot, brief
acknowledgement, and explicit teacher-signals-end-then-user-objects):
no scenario produced a non-zero END rate in either variant. Useful
data for the #554/#598 discussion — the prompt-layer rules don't
measurably change behavior on these shipped models with these prompts,
but the eval is now wired so future regressions show up.
 follow-up)

Mined prod data (3 days of chat-adapter director calls, ~47k samples on
gemini-3-flash-preview which is the only model in prod chat) and found:
- 0 / 1054 END decisions match #511's "user asks substantive Q → premature END"
- BUT one user complained "你答非所问" seven times in 4 minutes; director
  routed to another peer agent every time instead of cueing USER or
  re-routing to the teacher

So the real #511 symptom in prod is "director ignores frustration signal,
keeps fueling agent variety", not premature END.

This commit adds an A/B eval scoped to that:
- 5 synthesised scenarios across math / biology / English grammar / physics
  / calculus, each capturing the prod shape: user asks specific Q → agents
  drift onto adjacent topics → user expresses frustration → another agent
  continues to ignore. Director is asked to pick next agent.
- Decision rule-judged into USER | TEACHER | OTHER_AGENT | END.
- baseline = current main director system.md; with_rule = same +
  appended "Handling User Feedback" rule injecting USER-or-teacher route
  on frustration signals.

Run on prod model (google:gemini-3-flash-preview), 5 samples per variant:

| Scenario                                | baseline | with_rule | Δ    |
|-----------------------------------------|----------|-----------|------|
| math_quadratic_axis_generic_complaint   |   0%     |   100%    | 100% |
| bio_dark_reaction_explicit_correction   |   0%     |    40%    |  40% |
| english_team_isare_request_redo         |  80%     |   100%    |  20% |
| physics_inertial_mass_second_complaint  |   0%     |   100%    | 100% |
| calculus_product_rule_english           |   0%     |   100%    | 100% |
| **mean**                                | **16%**  |  **88%**  | **72%** |

Bug reproduced (4/5 scenarios show baseline=0% picking OTHER_AGENT,
matching prod behavior). Rule lifts the mean by 72pp.

Two scenarios don't meet the per-scenario PASS bar:
- bio_dark_reaction: with_rule still picks the assistant 60% of the time
  (rule wording isn't strong enough to force role=teacher specifically)
- english_team_isare: baseline is already 80% TEACHER — scenario doesn't
  reproduce the bug strongly, follow-up work to tighten

Both are scenario / rule-wording iterations, not a directional issue.

Scope: this commit is the eval only. The actual prompt fix (modifying
lib/prompts/templates/director/system.md to add the rule) is a separate
follow-up once the rule wording is tightened to PASS bio_dark_reaction
too.
@wyuc wyuc changed the title test(orchestration): premature-END regression eval (follow-up to #554) test(orchestration): probe #511 root causes — premature-END (refuted) + frustration handling (reproduced) May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant