Skip to content

Commit f8cd32e

Browse files
authored
doc: update README and include protocol for handling reliability issues (#678)
1 parent 0eea23d commit f8cd32e

File tree

1 file changed

+105
-105
lines changed

1 file changed

+105
-105
lines changed

README.md

+105-105
Original file line numberDiff line numberDiff line change
@@ -5,19 +5,17 @@ This repo is used for tracking flaky tests on the Node.js CI and fixing them.
55
**Current status**: work in progress. Please go to the issue tracker to discuss!
66

77
<!-- TOC -->
8-
9-
- [Updating this repo](#updating-this-repo)
10-
- [The Goal](#the-goal)
8+
- [Node.js Core CI Reliability](#nodejs-core-ci-reliability)
9+
- [Updating this repo](#updating-this-repo)
10+
- [The Goal](#the-goal)
1111
- [The Definition of Green](#the-definition-of-green)
12-
- [CI Health History](#ci-health-history)
13-
- [Handling Failed CI runs](#handling-failed-ci-runs)
14-
- [Flaky Tests](#flaky-tests)
15-
- [Identifying Flaky Tests](#identifying-flaky-tests)
16-
- [When Discovering a Potential New Flake on the CI](#when-discovering-a-potential-new-flake-on-the-ci)
17-
- [Infrastructure failures](#infrastructure-failures)
18-
- [Build File Failures](#build-file-failures)
19-
- [TODO](#todo)
20-
12+
- [CI Health History](#ci-health-history)
13+
- [Protocols in improving CI reliability](#protocols-in-improving-ci-reliability)
14+
- [Identifying flaky JS tests](#identifying-flaky-js-tests)
15+
- [Handling flaky JS tests](#handling-flaky-js-tests)
16+
- [Identifying infrastructure issues](#identifying-infrastructure-issues)
17+
- [Handling infrastructure issues](#handling-infrastructure-issues)
18+
- [TODO](#todo)
2119
<!-- /TOC -->
2220

2321
## Updating this repo
@@ -50,104 +48,106 @@ Make the CI green again.
5048

5149
## CI Health History
5250

53-
See https://nodejs-ci-health.mmarchini.me/#/job-summary
54-
55-
| UTC Time | RUNNING | SUCCESS | UNSTABLE | ABORTED | FAILURE | Green Rate |
56-
| ---------------- | ------- | ------- | -------- | ------- | ------- | ---------- |
57-
| 2018-06-01 20:00 | 1 | 1 | 15 | 11 | 72 | 1.13% |
58-
| 2018-06-03 11:36 | 3 | 6 | 21 | 10 | 60 | 6.89% |
59-
| 2018-06-04 15:00 | 0 | 9 | 26 | 10 | 55 | 10.00% |
60-
| 2018-06-15 17:42 | 1 | 27 | 4 | 17 | 51 | 32.93% |
61-
| 2018-06-24 18:11 | 0 | 27 | 2 | 8 | 63 | 29.35% |
62-
| 2018-07-08 19:40 | 1 | 35 | 2 | 4 | 58 | 36.84% |
63-
| 2018-07-18 20:46 | 2 | 38 | 4 | 5 | 51 | 40.86% |
64-
| 2018-07-24 22:30 | 2 | 46 | 3 | 4 | 45 | 48.94% |
65-
| 2018-08-01 19:11 | 4 | 17 | 2 | 2 | 75 | 18.09% |
66-
| 2018-08-14 15:42 | 5 | 22 | 0 | 14 | 59 | 27.16% |
67-
| 2018-08-22 13:22 | 2 | 29 | 4 | 9 | 56 | 32.58% |
68-
| 2018-10-31 13:28 | 0 | 40 | 13 | 4 | 43 | 41.67% |
69-
| 2018-11-19 10:32 | 0 | 48 | 8 | 5 | 39 | 50.53% |
70-
| 2018-12-08 20:37 | 2 | 18 | 4 | 3 | 73 | 18.95% |
71-
72-
## Handling Failed CI runs
73-
74-
### Flaky Tests
75-
76-
TODO: automate all of this in ncu-ci
77-
78-
#### Identifying Flaky Tests
79-
80-
When checking the CI results of a PR, if there is one or more failed tests (with
81-
`not ok` as the TAP result):
82-
83-
1. If the failed test is not related to the PR (does not touch the modified
84-
code path), search the test name in the issue tracker of this repo. If there
85-
is an existing issue, add a reply there using the [reproduction template](./templates/repro.txt),
86-
and open a pull request updating `flakes.json`.
87-
2. If there are no new existing issues about the test, run the CI again. If the
88-
failure disappears in the next run, then it is potential flake. See
89-
[When discovering a potential flake on the CI](#when-discovering-a-potential-new-flake-on-the-ci)
90-
on what to do for a new flake.
91-
3. If the failure reproduces in the next run, it is likely that the failure is
92-
related to the PR. Do not re-run CI without code changes in the next 24
93-
hours, try to debug the failure.
94-
4. If the cause of the failure still cannot be identified 24 hours later, and
95-
the code has not been changed, start a CI run and see if the failure
96-
disappears. Go back to step 3 if the failure still reproduces, and go to
97-
step 2 if the failure disappears.
98-
99-
#### When Discovering a Potential New Flake on the CI
100-
101-
1. Open an issue in this repo using [the flake issue template](./templates/flake.txt):
51+
[A GitHub workflow](.github/workflows/reliability_report.yml) is run every day
52+
to produce reliability reports of the `node-test-pull-request` CI and post
53+
it to [the issue tracker](https://github.com/nodejs/reliability/issues).
54+
55+
## Protocols in improving CI reliability
56+
57+
Most work starts with opening the issue tracker of this repository and
58+
reading the latest report. If the report is missing, see
59+
[the actions page](https://github.com/nodejs/reliability/actions) for
60+
details. GitHub's API restricts the length of issue messages, so
61+
whenever the report is too long the workflow can fail to post the
62+
issue. But it should still leave a summary in the actions page.
63+
64+
### Identifying flaky JS tests
65+
66+
1. Check out the `JSTest Failure` section of the latest reliability report.
67+
It contains information about the JS tests that failed more than 1 pull
68+
requests in the last 100 `node-test-pull-request` CI runs. The more
69+
pull requests a test fail, the higher it would be ranked, and the more
70+
likely that it is a flake.
71+
2. Search the name of the test in [the Node.js issue tracker](https://github.com/nodejs/node/issues)
72+
and see if there is already an issue about it. If there is already
73+
an issue, check if the failures are similar. Comment with updates
74+
if necessary.
75+
3. If the flake isn't already tracked by an issue, continue to look into
76+
it. In the report of a JS test, check out the pull requests that it
77+
fails and see if there is a connection. If the pull requests appear to
78+
be unrelated, it is more likely that the test is a flake.
79+
4. Search the historical reliability reports with the name of the test in
80+
the reliability issue tracker, and see how long the flake has been showing
81+
up. Gather information from the historical reports, and
82+
[open an issue](https://github.com/nodejs/node/issues/new?assignees=&labels=flaky-test&projects=&template=4-report-a-flaky-test.yml)
83+
in the Node.js issue tracker to track the flake.
84+
85+
### Handling flaky JS tests
86+
87+
1. If the flake only starts to show up in the recent month, check the
88+
historical reports to see precisely when it starts to show up. Look at
89+
commits landing on the target branch around the same time using
90+
`https://github.com/nodejs/node/commits?since=YYYY-MM-DD`
91+
and see if there is any pull request that looks related. If one or
92+
more related pull requests can be found, ping the author or the
93+
reviewer of the pull request, or the team in charge of the
94+
related subsystem in the tracking issue or in private to see if
95+
they can come up with a fix to just deflake the test.
96+
2. If the test has been flaky for more than a month and no one is actively
97+
working on it, it is unlikely to go away on its own, and it's time
98+
to mark it as flaky. For example, if `parallel/some-flaky-test.js`
99+
has been flaky on Windows in the CI, after making sure that there is an
100+
issue tracking it, open a pull request to add the following entry to
101+
[`test/parallel/parallel.status`](https://github.com/nodejs/node/tree/main/test/parallel/parallel.status):
102+
103+
```
104+
[$system==win32]
105+
# https://github.com/nodejs/node/issues/<TRACKING_ISSUE_ID>
106+
some-flaky-test: PASS,FLAKY
107+
```
108+
109+
### Identifying infrastructure issues
110+
111+
In the reliability reports, `Jenkins Failure`, `Git Failure` and
112+
`Build Failure` are generally infrastructure issues and can be
113+
handled by the `nodejs/build` team. Typical infrastructure
114+
issues include:
102115

103-
- Title should be `Investigate path/under/the/test/directory/without/extension`,
104-
for example `Investigate async-hooks/test-zlib.zlib-binding.deflate`.
105-
106-
2. Add the `Flaky Test` label and relevant subsystem labels (TODO: create
107-
useful labels).
108-
109-
3. Open a pull request updating `flakes.json`.
110-
111-
4. Notify the subsystem team related to the flake.
112-
113-
### Infrastructure failures
114-
115-
When the CI run fails because:
116-
117-
- There are network connection issues
118-
- There are tests fail with `ENOSPAC` (No space left on device)
119116
- The CI machine has trouble pulling source code from the repository
120-
121-
Do the following:
122-
123-
1. Search in this repo with the error message and see if there is any open
124-
issue about this.
125-
2. If there is an existing issue, wait until the problem gets fixed.
126-
3. If there are no similar issues, open a new one with
127-
[the build infra issue template](./templates/infra.txt).
128-
4. Add label `Build Infra`.
129-
5. Notify the `@nodejs/build-infra` team in the issue.
130-
131-
### Build File Failures
132-
133-
When the CI run of a PR that does not touch the build files ends with build
134-
failures (e.g. the run ends before the test runner has a chance to run):
135-
136-
1. Search in this repo with the error message that contains keywords like
137-
`fatal`, `error`, etc.
138-
2. If there is a similar issue, add a reply there using the
139-
[reproduction template](./templates/build-file-repro.txt).
140-
3. If there are no similar issues, open a new one with
141-
[the build file issue template](./templates/build-file.txt).
142-
4. Add label `Build Files`.
143-
5. Notify the `@nodejs/build-files` team in the issue.
117+
- The CI machine has trouble communicating to the Jenkins server
118+
- Build timing out
119+
- Parent job fails to trigger sub builds
120+
121+
Sometimes infrastructure issues can show up in the tests too, for
122+
example tests can fail with `ENOSPAC` (No space left on device), and
123+
the machine needs to be cleaned up to release disk space.
124+
125+
Some infrastructure issues can go away on its own, but if the same kind
126+
of infrastructure issue has been failing multiple pull requests and
127+
persists for more than a day, it's time to take action.
128+
129+
### Handling infrastructure issues
130+
131+
Check out the [Node.js build issue tracker](https://github.com/nodejs/build/issues)
132+
to see if there is any open issue about this. If there isn't,
133+
open a new issue about it or ask around in the `#nodejs-build` channel
134+
in the OpenJS slack.
135+
136+
When reporting infrastructure issues, it's important to include
137+
information about the particular machines where the issues happen.
138+
On the Jenkins job page of the failed CI build where the infrastructure
139+
is reported in the logs (not to be confused with the parent build that
140+
trigger the sub build that has the issues), on the top-right
141+
corner, there is normally a line similar to
142+
`Took 16 sec on test-equinix-ubuntu2004_container-armv7l-1`.
143+
In this case, `test-equinix-ubuntu2004_container-armv7l-1`
144+
is the machine having infrastructure issues, and it's important
145+
to include this information in the report.
144146

145147
## TODO
146148

147-
- [ ] Settle down on the flake database schema
148149
- [ ] Read the flake database in ncu-ci so people can quickly tell if
149-
a failure is a flake
150+
a failure is a flake
150151
- [ ] Automate the report process in ncu-ci
151152
- [ ] Migrate existing issues in nodejs/node and nodejs/build, close outdated
152-
ones.
153-
- [ ] Automate CI health history tracking
153+
ones.

0 commit comments

Comments
 (0)