fix(ci): split coverage step to avoid Linux hang#94
Merged
Conversation
The dotnet test step was re-building multi-target source projects (e.g. ExperimentFramework.Audit targets net8/9/10) immediately before launching the net10.0 test host, which caused the Audit.Tests host to hang indefinitely during coverlet data-collector initialisation. Adding --no-build skips the redundant rebuild (the Build (Release) step already produced all required binaries) and removes the timing window that triggered the hang. Also adds missing packages.lock.json for Governance.Persistence.Tests and Governance.Persistence.Redis.Tests so the NuGet cache key covers all test projects and the --use-lock-file restore does not regenerate lock files during the test step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… hang When dotnet test runs the full solution with --collect:'XPlat Code Coverage', the test host process for ExperimentFramework.Tests (2094 tests, many project refs) crashes during AfterTestRunEnd coverage data collection. On Linux CI this manifests as a 55-minute silent hang (socket doesn't RST on crash), consuming the entire 60-minute job timeout. Fix: split the combined 'Test with coverage' step into two steps: 1. 'Test (verify correctness)' - full solution run, no coverage, no hang 2. 'Collect coverage (per-project)' - each project tested individually; isolated test hosts avoid the multi-session crash. The resulting coverage.cobertura.xml files are picked up by the existing reportgenerator step unchanged. Applied the same split to the release job's test step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The three Distributed.Redis test classes start real Docker containers via
Testcontainers but had no [Trait("Category","integration")] marker, so
they ran in the standard CI filter and blocked the test host waiting for
Docker — causing the hang that appeared after Dashboard.UI.Tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Snapshot WarningsEnsure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice. Scanned FilesNone |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The CI workflow's combined "Test with coverage" step was hanging on Ubuntu Linux runners for up to 55–60 minutes before hitting the job timeout. Tests themselves were always passing — the hang occurred during coverage data collection, not test execution.
Root cause: When
dotnet testruns the full solution with--collect:'XPlat Code Coverage', the test host forExperimentFramework.Tests(2,094 tests, many project refs) crashes duringAfterTestRunEndcoverage collection. On Linux this manifests as a silent hang — the socket doesn't reset on crash, so the runner waits indefinitely until the 60-minute job timeout fires.A secondary cause: three
Distributed.Redistest classes inExperimentFramework.Testsstart Docker containers via Testcontainers but had no[Trait("Category","integration")]marker, so they ran during the standard CI filter and blocked the test host waiting for Docker.Fix
Split the combined step into two separate steps:
Test (verify correctness)— runs the full solution without coverage collection; no hang, proves tests passCollect coverage (per-project)— runs each project individually with--collect; isolated test hosts avoid the multi-session crash. The resultingcoverage.cobertura.xmlfiles are picked up by the existingreportgeneratorstep unchanged.Add
--no-buildto the test step to prevent a redundant rebuild race condition after theBuild (Release)step.Add missing
packages.lock.jsonfiles forGovernance.Persistence.TestsandGovernance.Persistence.Redis.Testsso the NuGet cache key covers all test projects.Quarantine Testcontainers Redis tests — add
[Trait("Category","integration")]to the threeDistributed.Redistest classes so the standard CI filter correctly excludes them.What was not changed
🤖 Generated with Claude Code