test(orchestration): probe #511 root causes — premature-END (refuted) + frustration handling (reproduced)#599
Draft
wyuc wants to merge 2 commits into
Draft
test(orchestration): probe #511 root causes — premature-END (refuted) + frustration handling (reproduced)#599wyuc wants to merge 2 commits into
wyuc wants to merge 2 commits into
Conversation
Adds eval/orchestration/ following the outline-language pattern: a runner that A/B-tests the director against the same scenario with two prompt variants — current main (post-#554) and a synthesised pre-#554 baseline that strips the role-aware summary labels AND the new system.md rules 10/11/12 together. Pass criterion is now framed as a regression guard rather than a fixture discrimination test: every scenario's post-fix END rate must stay below EVAL_END_THRESHOLD (default 20%). The pre-vs-post Δ is reported as informational data, since #554's reviewer feedback was that earlier fixtures didn't discriminate. Empirical finding across 7 shipped model configs (gpt-4.1-mini, gpt-4o-mini, gpt-5.4-nano, qwen-plus, qwen3.5-flash, deepseek-chat, deepseek-v4-flash, gemini-2.5-flash, claude-haiku-4-5) with 5 scenarios modelled on #511 (incl. the exact Tiananmen-3D objection trace, soft pushback after a long resolved-looking discussion, topic pivot, brief acknowledgement, and explicit teacher-signals-end-then-user-objects): no scenario produced a non-zero END rate in either variant. Useful data for the #554/#598 discussion — the prompt-layer rules don't measurably change behavior on these shipped models with these prompts, but the eval is now wired so future regressions show up.
follow-up) Mined prod data (3 days of chat-adapter director calls, ~47k samples on gemini-3-flash-preview which is the only model in prod chat) and found: - 0 / 1054 END decisions match #511's "user asks substantive Q → premature END" - BUT one user complained "你答非所问" seven times in 4 minutes; director routed to another peer agent every time instead of cueing USER or re-routing to the teacher So the real #511 symptom in prod is "director ignores frustration signal, keeps fueling agent variety", not premature END. This commit adds an A/B eval scoped to that: - 5 synthesised scenarios across math / biology / English grammar / physics / calculus, each capturing the prod shape: user asks specific Q → agents drift onto adjacent topics → user expresses frustration → another agent continues to ignore. Director is asked to pick next agent. - Decision rule-judged into USER | TEACHER | OTHER_AGENT | END. - baseline = current main director system.md; with_rule = same + appended "Handling User Feedback" rule injecting USER-or-teacher route on frustration signals. Run on prod model (google:gemini-3-flash-preview), 5 samples per variant: | Scenario | baseline | with_rule | Δ | |-----------------------------------------|----------|-----------|------| | math_quadratic_axis_generic_complaint | 0% | 100% | 100% | | bio_dark_reaction_explicit_correction | 0% | 40% | 40% | | english_team_isare_request_redo | 80% | 100% | 20% | | physics_inertial_mass_second_complaint | 0% | 100% | 100% | | calculus_product_rule_english | 0% | 100% | 100% | | **mean** | **16%** | **88%** | **72%** | Bug reproduced (4/5 scenarios show baseline=0% picking OTHER_AGENT, matching prod behavior). Rule lifts the mean by 72pp. Two scenarios don't meet the per-scenario PASS bar: - bio_dark_reaction: with_rule still picks the assistant 60% of the time (rule wording isn't strong enough to force role=teacher specifically) - english_team_isare: baseline is already 80% TEACHER — scenario doesn't reproduce the bug strongly, follow-up work to tighten Both are scenario / rule-wording iterations, not a directional issue. Scope: this commit is the eval only. The actual prompt fix (modifying lib/prompts/templates/director/system.md to add the rule) is a separate follow-up once the rule wording is tightened to PASS bio_dark_reaction too.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Probes #511 root causes via two complementary evals under `eval/orchestration/`. Originally framed as a premature-END regression guard (item 2 of #598); broadened after prod-data investigation showed premature-END isn't the actual symptom in prod.
Findings (TL;DR)
What's in this PR (2 evals)
`eval/orchestration/runner.ts` — premature-END regression guard
`eval/orchestration/frustration-runner.ts` — director frustration handling (NEW)
Result on google:gemini-3-flash-preview (the prod model):
4/5 scenarios reproduce the prod bug (baseline 0%, all picks land on OTHER_AGENT — matching the 答非所问 prod trace).
Per-scenario verdict
What's NOT in this PR
The actual prompt fix (modifying `lib/prompts/templates/director/system.md` to inject the rule) is a separate follow-up once the rule wording is tightened to also pass the bio scenario. Plus the agent-side prompt likely needs a parallel "Responding to user feedback" rule — agent SP currently has zero handling for user dissatisfaction (validated by reading `buildStructuredPrompt` + `agent-system/system.md`).
Out-of-scope discoveries worth flagging
Prod is still running pre-#554 director prompt (#554 not yet deployed). And `agent-loop.ts:614` maps every non-`cue_user` outcome to `status: 'completed'`, conflating LLM-END with empty_turns / no_done / max_turns. Both are independent of #554 — happy to follow up in separate PRs if desired.
Test plan
Want to wire `eval:orchestration-frustration` into a package.json script before flipping to Ready — will do in a follow-up commit.