Skip to content

test hardening: protect pipeline result semantics across executor, dispatcher, CLI, web, and agent tools #463

@axisrow

Description

@axisrow

Context

We found a cross-layer bug where a successful pipeline_run for an action-only pipeline (react/forward/delete_message) ended up showing 0 in scheduler/UI because the system treated the result like a generation run and derived task result from citations instead of from actual processed-message semantics.

The deeper problem was not just the runtime bug itself, but that the test suite allowed it to pass:

  • tests covered status/form more than business meaning
  • layers interpreted pipeline result semantics independently
  • there was no shared invariant asserting that executor, service, dispatcher, CLI, web, and agent tools all agree on what a successful pipeline result means

This issue tracks hardening the tests so similar semantic drift bugs get caught earlier.

Summary

Strengthen tests around pipeline_run so they validate business invariants, not just “no exception” or 200 OK. The main goal is to lock result_kind, result_count, and messages_collected into a single cross-layer contract for generation pipelines, action-only pipelines, and mixed pipelines.

Key Changes

  • Lock a single pipeline result contract as the test target:
    • result_kind
    • result_count
    • messages_collected
    • generated_text
    • metadata
  • Expand UnifiedDispatcher success-path tests:
    • assert exact messages_collected
    • cover generation run, action-only run, mixed run, and empty successful run
    • verify dispatcher uses semantic result fields rather than recomputing from citations
  • Expand PipelineExecutor and action-handler tests:
    • react / forward / delete_message success counting
    • partial-success and total-failure cases
    • mixed graph precedence: generation semantics win over action counters
  • Expand CLI tests:
    • pipeline runs and run-show assert result_kind / result_count
    • add regression case for completed run with empty generated_text but nonzero action result
  • Expand web scheduler tests:
    • assert concrete result cells/labels for pipeline_run
    • cover action-only label, generation label, and mixed task pages
    • verify web uses prepared run semantics instead of guessing from indirect fields
  • Expand agent-tools tests:
    • list_pipeline_runs and get_pipeline_run should surface result semantics for action-only runs
  • Add shared scenario factories/fixtures:
    • generation run fixture
    • action-only run fixture
    • mixed run fixture
    • reuse them across dispatcher/CLI/web/agent tests to prevent scenario drift
  • Add at least one cross-layer regression suite:
    • executor/service -> generation_run -> collection_task -> scheduler render

Test Scenarios

PipelineExecutor

  • action-only react pipeline -> processed_messages, count > 0
  • all actions fail -> processed_messages, count = 0
  • generation pipeline with text/citations -> generated_items
  • mixed generation + actions -> generated_items, with action counters preserved for diagnostics

UnifiedDispatcher

  • run returns result_count=3 -> task stores messages_collected=3
  • run returns empty text but nonzero result_count -> task still stores nonzero semantic result
  • generation run still maps correctly to generation-oriented result semantics

CLI

  • pipeline runs prints correct semantic result summary for generation and action-only runs
  • run-show prints semantic fields even when generated text is empty

Web

  • /scheduler shows action-oriented label for action-only pipeline run
  • /scheduler shows generation-oriented label for generation run
  • mixed task page keeps non-pipeline tasks unaffected

Agent Tools

  • list_pipeline_runs and get_pipeline_run surface action-only result meaningfully

Anti-regression

  • add at least one end-to-end-style semantic test chain across executor/service/storage/task/UI boundaries

Important Interface Assumptions

  • GenerationRun.result_kind and GenerationRun.result_count should be treated by tests as the semantic API, even if currently backed by metadata
  • collection_tasks.messages_collected for pipeline_run is derived domain output, not a generic UI counter
  • tests that care about semantics should prefer realistic GenerationRun fixtures over loose SimpleNamespace stubs

Acceptance Criteria

  • A successful action-only pipeline run cannot silently regress to 0 in CLI/web while remaining green in tests
  • Dispatcher, CLI, web, and agent tools all fail tests if they reinterpret pipeline result semantics inconsistently
  • Route tests for scheduler pages assert semantic content, not just response status
  • Mock-heavy success tests for pipeline_run validate result meaning, not only completion status

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions