test hardening: protect pipeline result semantics across executor, dispatcher, CLI, web, and agent tools

## Context

We found a cross-layer bug where a successful `pipeline_run` for an action-only pipeline (`react`/`forward`/`delete_message`) ended up showing `0` in scheduler/UI because the system treated the result like a generation run and derived task result from `citations` instead of from actual processed-message semantics.

The deeper problem was not just the runtime bug itself, but that the test suite allowed it to pass:
- tests covered status/form more than business meaning
- layers interpreted pipeline result semantics independently
- there was no shared invariant asserting that executor, service, dispatcher, CLI, web, and agent tools all agree on what a successful pipeline result means

This issue tracks hardening the tests so similar semantic drift bugs get caught earlier.

## Summary

Strengthen tests around `pipeline_run` so they validate business invariants, not just “no exception” or `200 OK`. The main goal is to lock `result_kind`, `result_count`, and `messages_collected` into a single cross-layer contract for generation pipelines, action-only pipelines, and mixed pipelines.

## Key Changes

- Lock a single pipeline result contract as the test target:
  - `result_kind`
  - `result_count`
  - `messages_collected`
  - `generated_text`
  - `metadata`
- Expand `UnifiedDispatcher` success-path tests:
  - assert exact `messages_collected`
  - cover generation run, action-only run, mixed run, and empty successful run
  - verify dispatcher uses semantic result fields rather than recomputing from `citations`
- Expand `PipelineExecutor` and action-handler tests:
  - `react` / `forward` / `delete_message` success counting
  - partial-success and total-failure cases
  - mixed graph precedence: generation semantics win over action counters
- Expand CLI tests:
  - `pipeline runs` and `run-show` assert `result_kind` / `result_count`
  - add regression case for `completed` run with empty `generated_text` but nonzero action result
- Expand web scheduler tests:
  - assert concrete result cells/labels for `pipeline_run`
  - cover action-only label, generation label, and mixed task pages
  - verify web uses prepared run semantics instead of guessing from indirect fields
- Expand agent-tools tests:
  - `list_pipeline_runs` and `get_pipeline_run` should surface result semantics for action-only runs
- Add shared scenario factories/fixtures:
  - generation run fixture
  - action-only run fixture
  - mixed run fixture
  - reuse them across dispatcher/CLI/web/agent tests to prevent scenario drift
- Add at least one cross-layer regression suite:
  - `executor/service -> generation_run -> collection_task -> scheduler render`

## Test Scenarios

### PipelineExecutor
- action-only `react` pipeline -> `processed_messages`, count > 0
- all actions fail -> `processed_messages`, count = 0
- generation pipeline with text/citations -> `generated_items`
- mixed generation + actions -> `generated_items`, with action counters preserved for diagnostics

### UnifiedDispatcher
- run returns `result_count=3` -> task stores `messages_collected=3`
- run returns empty text but nonzero `result_count` -> task still stores nonzero semantic result
- generation run still maps correctly to generation-oriented result semantics

### CLI
- `pipeline runs` prints correct semantic result summary for generation and action-only runs
- `run-show` prints semantic fields even when generated text is empty

### Web
- `/scheduler` shows action-oriented label for action-only pipeline run
- `/scheduler` shows generation-oriented label for generation run
- mixed task page keeps non-pipeline tasks unaffected

### Agent Tools
- `list_pipeline_runs` and `get_pipeline_run` surface action-only result meaningfully

### Anti-regression
- add at least one end-to-end-style semantic test chain across executor/service/storage/task/UI boundaries

## Important Interface Assumptions

- `GenerationRun.result_kind` and `GenerationRun.result_count` should be treated by tests as the semantic API, even if currently backed by `metadata`
- `collection_tasks.messages_collected` for `pipeline_run` is derived domain output, not a generic UI counter
- tests that care about semantics should prefer realistic `GenerationRun` fixtures over loose `SimpleNamespace` stubs

## Acceptance Criteria

- A successful action-only pipeline run cannot silently regress to `0` in CLI/web while remaining green in tests
- Dispatcher, CLI, web, and agent tools all fail tests if they reinterpret pipeline result semantics inconsistently
- Route tests for scheduler pages assert semantic content, not just response status
- Mock-heavy success tests for `pipeline_run` validate result meaning, not only completion status


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test hardening: protect pipeline result semantics across executor, dispatcher, CLI, web, and agent tools #463

Context

Summary

Key Changes

Test Scenarios

PipelineExecutor

UnifiedDispatcher

CLI

Web

Agent Tools

Anti-regression

Important Interface Assumptions

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

test hardening: protect pipeline result semantics across executor, dispatcher, CLI, web, and agent tools #463

Description

Context

Summary

Key Changes

Test Scenarios

PipelineExecutor

UnifiedDispatcher

CLI

Web

Agent Tools

Anti-regression

Important Interface Assumptions

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions