Skip to content

fix(ci): split coverage step to avoid Linux hang#94

Merged
JerrettDavis merged 3 commits into
mainfrom
fix/ci-coverage-hang
Apr 29, 2026
Merged

fix(ci): split coverage step to avoid Linux hang#94
JerrettDavis merged 3 commits into
mainfrom
fix/ci-coverage-hang

Conversation

@JerrettDavis

Copy link
Copy Markdown
Owner

Problem

The CI workflow's combined "Test with coverage" step was hanging on Ubuntu Linux runners for up to 55–60 minutes before hitting the job timeout. Tests themselves were always passing — the hang occurred during coverage data collection, not test execution.

Root cause: When dotnet test runs the full solution with --collect:'XPlat Code Coverage', the test host for ExperimentFramework.Tests (2,094 tests, many project refs) crashes during AfterTestRunEnd coverage collection. On Linux this manifests as a silent hang — the socket doesn't reset on crash, so the runner waits indefinitely until the 60-minute job timeout fires.

A secondary cause: three Distributed.Redis test classes in ExperimentFramework.Tests start Docker containers via Testcontainers but had no [Trait("Category","integration")] marker, so they ran during the standard CI filter and blocked the test host waiting for Docker.

Fix

  1. Split the combined step into two separate steps:

    • Test (verify correctness) — runs the full solution without coverage collection; no hang, proves tests pass
    • Collect coverage (per-project) — runs each project individually with --collect; isolated test hosts avoid the multi-session crash. The resulting coverage.cobertura.xml files are picked up by the existing reportgenerator step unchanged.
  2. Add --no-build to the test step to prevent a redundant rebuild race condition after the Build (Release) step.

  3. Add missing packages.lock.json files for Governance.Persistence.Tests and Governance.Persistence.Redis.Tests so the NuGet cache key covers all test projects.

  4. Quarantine Testcontainers Redis tests — add [Trait("Category","integration")] to the three Distributed.Redis test classes so the standard CI filter correctly excludes them.

What was not changed

  • Test logic is unchanged — no tests were deleted or skipped
  • Coverage thresholds and reporting are unchanged
  • The same fix was applied to the release job's test step

🤖 Generated with Claude Code

GitHub Copilot and others added 3 commits April 28, 2026 19:42
The dotnet test step was re-building multi-target source projects (e.g.
ExperimentFramework.Audit targets net8/9/10) immediately before launching
the net10.0 test host, which caused the Audit.Tests host to hang
indefinitely during coverlet data-collector initialisation.

Adding --no-build skips the redundant rebuild (the Build (Release) step
already produced all required binaries) and removes the timing window
that triggered the hang.

Also adds missing packages.lock.json for Governance.Persistence.Tests
and Governance.Persistence.Redis.Tests so the NuGet cache key covers
all test projects and the --use-lock-file restore does not regenerate
lock files during the test step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… hang

When dotnet test runs the full solution with --collect:'XPlat Code Coverage',
the test host process for ExperimentFramework.Tests (2094 tests, many project
refs) crashes during AfterTestRunEnd coverage data collection. On Linux CI
this manifests as a 55-minute silent hang (socket doesn't RST on crash),
consuming the entire 60-minute job timeout.

Fix: split the combined 'Test with coverage' step into two steps:
1. 'Test (verify correctness)' - full solution run, no coverage, no hang
2. 'Collect coverage (per-project)' - each project tested individually;
   isolated test hosts avoid the multi-session crash. The resulting
   coverage.cobertura.xml files are picked up by the existing reportgenerator
   step unchanged.

Applied the same split to the release job's test step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The three Distributed.Redis test classes start real Docker containers via
Testcontainers but had no [Trait("Category","integration")] marker, so
they ran in the standard CI filter and blocked the test host waiting for
Docker — causing the hang that appeared after Dashboard.UI.Tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA 85b1650.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

@JerrettDavis JerrettDavis merged commit 9bb2da6 into main Apr 29, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant