Context
We found a cross-layer bug where a successful pipeline_run for an action-only pipeline (react/forward/delete_message) ended up showing 0 in scheduler/UI because the system treated the result like a generation run and derived task result from citations instead of from actual processed-message semantics.
The deeper problem was not just the runtime bug itself, but that the test suite allowed it to pass:
- tests covered status/form more than business meaning
- layers interpreted pipeline result semantics independently
- there was no shared invariant asserting that executor, service, dispatcher, CLI, web, and agent tools all agree on what a successful pipeline result means
This issue tracks hardening the tests so similar semantic drift bugs get caught earlier.
Summary
Strengthen tests around pipeline_run so they validate business invariants, not just “no exception” or 200 OK. The main goal is to lock result_kind, result_count, and messages_collected into a single cross-layer contract for generation pipelines, action-only pipelines, and mixed pipelines.
Key Changes
- Lock a single pipeline result contract as the test target:
result_kind
result_count
messages_collected
generated_text
metadata
- Expand
UnifiedDispatcher success-path tests:
- assert exact
messages_collected
- cover generation run, action-only run, mixed run, and empty successful run
- verify dispatcher uses semantic result fields rather than recomputing from
citations
- Expand
PipelineExecutor and action-handler tests:
react / forward / delete_message success counting
- partial-success and total-failure cases
- mixed graph precedence: generation semantics win over action counters
- Expand CLI tests:
pipeline runs and run-show assert result_kind / result_count
- add regression case for
completed run with empty generated_text but nonzero action result
- Expand web scheduler tests:
- assert concrete result cells/labels for
pipeline_run
- cover action-only label, generation label, and mixed task pages
- verify web uses prepared run semantics instead of guessing from indirect fields
- Expand agent-tools tests:
list_pipeline_runs and get_pipeline_run should surface result semantics for action-only runs
- Add shared scenario factories/fixtures:
- generation run fixture
- action-only run fixture
- mixed run fixture
- reuse them across dispatcher/CLI/web/agent tests to prevent scenario drift
- Add at least one cross-layer regression suite:
executor/service -> generation_run -> collection_task -> scheduler render
Test Scenarios
PipelineExecutor
- action-only
react pipeline -> processed_messages, count > 0
- all actions fail ->
processed_messages, count = 0
- generation pipeline with text/citations ->
generated_items
- mixed generation + actions ->
generated_items, with action counters preserved for diagnostics
UnifiedDispatcher
- run returns
result_count=3 -> task stores messages_collected=3
- run returns empty text but nonzero
result_count -> task still stores nonzero semantic result
- generation run still maps correctly to generation-oriented result semantics
CLI
pipeline runs prints correct semantic result summary for generation and action-only runs
run-show prints semantic fields even when generated text is empty
Web
/scheduler shows action-oriented label for action-only pipeline run
/scheduler shows generation-oriented label for generation run
- mixed task page keeps non-pipeline tasks unaffected
Agent Tools
list_pipeline_runs and get_pipeline_run surface action-only result meaningfully
Anti-regression
- add at least one end-to-end-style semantic test chain across executor/service/storage/task/UI boundaries
Important Interface Assumptions
GenerationRun.result_kind and GenerationRun.result_count should be treated by tests as the semantic API, even if currently backed by metadata
collection_tasks.messages_collected for pipeline_run is derived domain output, not a generic UI counter
- tests that care about semantics should prefer realistic
GenerationRun fixtures over loose SimpleNamespace stubs
Acceptance Criteria
- A successful action-only pipeline run cannot silently regress to
0 in CLI/web while remaining green in tests
- Dispatcher, CLI, web, and agent tools all fail tests if they reinterpret pipeline result semantics inconsistently
- Route tests for scheduler pages assert semantic content, not just response status
- Mock-heavy success tests for
pipeline_run validate result meaning, not only completion status
Context
We found a cross-layer bug where a successful
pipeline_runfor an action-only pipeline (react/forward/delete_message) ended up showing0in scheduler/UI because the system treated the result like a generation run and derived task result fromcitationsinstead of from actual processed-message semantics.The deeper problem was not just the runtime bug itself, but that the test suite allowed it to pass:
This issue tracks hardening the tests so similar semantic drift bugs get caught earlier.
Summary
Strengthen tests around
pipeline_runso they validate business invariants, not just “no exception” or200 OK. The main goal is to lockresult_kind,result_count, andmessages_collectedinto a single cross-layer contract for generation pipelines, action-only pipelines, and mixed pipelines.Key Changes
result_kindresult_countmessages_collectedgenerated_textmetadataUnifiedDispatchersuccess-path tests:messages_collectedcitationsPipelineExecutorand action-handler tests:react/forward/delete_messagesuccess countingpipeline runsandrun-showassertresult_kind/result_countcompletedrun with emptygenerated_textbut nonzero action resultpipeline_runlist_pipeline_runsandget_pipeline_runshould surface result semantics for action-only runsexecutor/service -> generation_run -> collection_task -> scheduler renderTest Scenarios
PipelineExecutor
reactpipeline ->processed_messages, count > 0processed_messages, count = 0generated_itemsgenerated_items, with action counters preserved for diagnosticsUnifiedDispatcher
result_count=3-> task storesmessages_collected=3result_count-> task still stores nonzero semantic resultCLI
pipeline runsprints correct semantic result summary for generation and action-only runsrun-showprints semantic fields even when generated text is emptyWeb
/schedulershows action-oriented label for action-only pipeline run/schedulershows generation-oriented label for generation runAgent Tools
list_pipeline_runsandget_pipeline_runsurface action-only result meaningfullyAnti-regression
Important Interface Assumptions
GenerationRun.result_kindandGenerationRun.result_countshould be treated by tests as the semantic API, even if currently backed bymetadatacollection_tasks.messages_collectedforpipeline_runis derived domain output, not a generic UI counterGenerationRunfixtures over looseSimpleNamespacestubsAcceptance Criteria
0in CLI/web while remaining green in testspipeline_runvalidate result meaning, not only completion status