Skip to content

ci: examples-windows + examples-linux failing since 2026-05-26 — rust-dataflow-git pin stale (74 commits behind main) #1945

@heyong4725

Description

@heyong4725

Symptom

ci/circleci: examples-windows (and examples-linux on most runs) has been failing on every main commit since 027d38ef (PR #1887, merged 2026-05-26 05:16 UTC) — 9 consecutive reds at the time of filing, no greens since.

Both jobs fail at the same step, "Rust Git Dataflow example" (cargo run --example rust-dataflow-git), with a CircleCI no_output_timeout: 30m cap:

[success ]  19.7m  Build examples + CLI binary
[success ]  10.4m  Rust Dataflow example
[timedout]  28.9m  Rust Git Dataflow example   ← here
[timedout]   0.0m  Build timed out

Sample failing runs:

Bisect

Commit PR examples-windows
536c5116 #1918 ✅ SUCCESS (last green)
027d38ef #1887 "Stop properly on stop message" first red
7aad98c7 #1888
4c376497 #1929
8e2f0233 #1931
630eec2b #1935
eab18ab0 #1787 (zenoh-routing rewrite)
2a69ad72 #1941
f8aebe18 #1883
9f4242b6 #1884

Root cause hypothesis

examples/rust-dataflow-git/dataflow.yml pins each node to a specific dora commit via rev::

nodes:
  - id: rust-node
    git: https://github.com/dora-rs/dora.git
    rev: 10cf7fe9c082caaa90679bcca48c873cdc16311b
    ...

10cf7fe9 is from 2026-04-28 — 74 commits behind current main. The file's own comment names exactly this scenario:

Smoke-tests git-sourced nodes. Pins to a dora commit (not a released tag) so the example runs against matching message-format versions without needing a release. Update rev: when a message-format-breaking change lands on main — otherwise the CI job catches the mismatch and signals a compatibility break, which is the whole point of this test.

Between 10cf7fe9 and current main, message-format-adjacent files have churned heavily (1598 ins, 349 del across libraries/message/ and node/daemon source). The most likely culprits:

The fact that 027d38ef is the first red (and the bisect-immediate-predecessor merge to land after 5/24) is most likely coincidental — the drift was already accumulating; the pin reached the breaking threshold around then. PR #1887 itself only touched README + python sender + smoke tests, so it almost certainly isn't the direct cause despite the bisect position.

Why it's slipped through the merge queue

Branch protection on main requires only these checks: Format, Clippy, Check, Typos, Audit (cargo-audit + cargo-deny), Unwrap budget, License check. examples-windows and examples-linux aren't in the required list, so Trunk Merge Queue doesn't gate on them. PRs land regardless.

Reproduction

# On current main (or any commit since 027d38ef):
cargo run --example rust-dataflow-git
# Hangs indefinitely; nodes built from pinned rev 10cf7fe9 don't exit.

On Linux it hangs ~48m before the CircleCI wall, on Windows ~29m. The hang itself probably starts much earlier; the no-output-timeout is what eventually kills it.

Proposed fixes

Short-term (unblock CI immediately)

Bump rev: in examples/rust-dataflow-git/dataflow.yml to a current main commit (e.g., 9f4242b6 or whatever's latest at fix time). One-line change in YAML. CI should go green on the next push.

If post-bump CI is STILL red at the same step, that confirms the regression isn't purely message-format drift and there's a real bug to chase — but I expect the bump alone is enough.

Medium-term (prevent recurrence)

Either:

  1. Add examples-windows (or at least the examples-linux subset) to the required-checks list for main. This ensures Trunk Merge Queue gates on it. Risk: any flake blocks all PRs.

  2. Set up a scheduled job that runs cargo run --example rust-dataflow-git against HEAD of main once per day and auto-bumps the pin on success. Removes the manual maintenance step the file's comment relies on.

  3. Document in CONTRIBUTING.md (or the dataflow.yml comment more prominently) that any PR touching libraries/message/, apis/rust/node/src/, or daemon protocol code MUST bump the pin in the same PR. Trusts the contributor, costs nothing structurally.

(1) is the cheapest to implement but adds load to every PR. (3) is the cheapest in CI cost but trusts process. (2) is the most automated.

Severity

Medium. The bug is in CI-only test infrastructure; doesn't affect users. But:

  • Every PR's CI looks "red" on examples-windows/linux, which trains maintainers to ignore those statuses (bad habit).
  • The whole point of this example is to catch message-format breaks. A persistently-red detector that everyone tunes out is worse than no detector.
  • After 9 consecutive reds without anyone addressing it, the "is the test wrong, or is the code wrong?" question gets harder to answer for the next real break.

Acceptance criteria

  • examples/rust-dataflow-git/dataflow.yml rev: updated to a current main commit
  • First main CI run after the bump shows examples-windows and examples-linux
  • Process decision recorded (one of the three medium-term options) in an ADR or CONTRIBUTING.md snippet

Related

cc @phil-opp (author of #1887, also touched the python-dataflow sender — though the bisect almost certainly fingers the wrong PR)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions