@@ -5,19 +5,17 @@ This repo is used for tracking flaky tests on the Node.js CI and fixing them.
5
5
** Current status** : work in progress. Please go to the issue tracker to discuss!
6
6
7
7
<!-- TOC -->
8
-
9
- - [ Updating this repo] ( #updating-this-repo )
10
- - [ The Goal] ( #the-goal )
8
+ - [ Node.js Core CI Reliability ] ( #nodejs-core-ci-reliability )
9
+ - [ Updating this repo] ( #updating-this-repo )
10
+ - [ The Goal] ( #the-goal )
11
11
- [ The Definition of Green] ( #the-definition-of-green )
12
- - [ CI Health History] ( #ci-health-history )
13
- - [ Handling Failed CI runs] ( #handling-failed-ci-runs )
14
- - [ Flaky Tests] ( #flaky-tests )
15
- - [ Identifying Flaky Tests] ( #identifying-flaky-tests )
16
- - [ When Discovering a Potential New Flake on the CI] ( #when-discovering-a-potential-new-flake-on-the-ci )
17
- - [ Infrastructure failures] ( #infrastructure-failures )
18
- - [ Build File Failures] ( #build-file-failures )
19
- - [ TODO] ( #todo )
20
-
12
+ - [ CI Health History] ( #ci-health-history )
13
+ - [ Protocols in improving CI reliability] ( #protocols-in-improving-ci-reliability )
14
+ - [ Identifying flaky JS tests] ( #identifying-flaky-js-tests )
15
+ - [ Handling flaky JS tests] ( #handling-flaky-js-tests )
16
+ - [ Identifying infrastructure issues] ( #identifying-infrastructure-issues )
17
+ - [ Handling infrastructure issues] ( #handling-infrastructure-issues )
18
+ - [ TODO] ( #todo )
21
19
<!-- /TOC -->
22
20
23
21
## Updating this repo
@@ -50,104 +48,106 @@ Make the CI green again.
50
48
51
49
## CI Health History
52
50
53
- See https://nodejs-ci-health.mmarchini.me/#/job-summary
54
-
55
- | UTC Time | RUNNING | SUCCESS | UNSTABLE | ABORTED | FAILURE | Green Rate |
56
- | ---------------- | ------- | ------- | -------- | ------- | ------- | ---------- |
57
- | 2018-06-01 20:00 | 1 | 1 | 15 | 11 | 72 | 1.13% |
58
- | 2018-06-03 11:36 | 3 | 6 | 21 | 10 | 60 | 6.89% |
59
- | 2018-06-04 15:00 | 0 | 9 | 26 | 10 | 55 | 10.00% |
60
- | 2018-06-15 17:42 | 1 | 27 | 4 | 17 | 51 | 32.93% |
61
- | 2018-06-24 18:11 | 0 | 27 | 2 | 8 | 63 | 29.35% |
62
- | 2018-07-08 19:40 | 1 | 35 | 2 | 4 | 58 | 36.84% |
63
- | 2018-07-18 20:46 | 2 | 38 | 4 | 5 | 51 | 40.86% |
64
- | 2018-07-24 22:30 | 2 | 46 | 3 | 4 | 45 | 48.94% |
65
- | 2018-08-01 19:11 | 4 | 17 | 2 | 2 | 75 | 18.09% |
66
- | 2018-08-14 15:42 | 5 | 22 | 0 | 14 | 59 | 27.16% |
67
- | 2018-08-22 13:22 | 2 | 29 | 4 | 9 | 56 | 32.58% |
68
- | 2018-10-31 13:28 | 0 | 40 | 13 | 4 | 43 | 41.67% |
69
- | 2018-11-19 10:32 | 0 | 48 | 8 | 5 | 39 | 50.53% |
70
- | 2018-12-08 20:37 | 2 | 18 | 4 | 3 | 73 | 18.95% |
71
-
72
- ## Handling Failed CI runs
73
-
74
- ### Flaky Tests
75
-
76
- TODO: automate all of this in ncu-ci
77
-
78
- #### Identifying Flaky Tests
79
-
80
- When checking the CI results of a PR, if there is one or more failed tests (with
81
- ` not ok ` as the TAP result):
82
-
83
- 1 . If the failed test is not related to the PR (does not touch the modified
84
- code path), search the test name in the issue tracker of this repo. If there
85
- is an existing issue, add a reply there using the [ reproduction template] ( ./templates/repro.txt ) ,
86
- and open a pull request updating ` flakes.json ` .
87
- 2 . If there are no new existing issues about the test, run the CI again. If the
88
- failure disappears in the next run, then it is potential flake. See
89
- [ When discovering a potential flake on the CI] ( #when-discovering-a-potential-new-flake-on-the-ci )
90
- on what to do for a new flake.
91
- 3 . If the failure reproduces in the next run, it is likely that the failure is
92
- related to the PR. Do not re-run CI without code changes in the next 24
93
- hours, try to debug the failure.
94
- 4 . If the cause of the failure still cannot be identified 24 hours later, and
95
- the code has not been changed, start a CI run and see if the failure
96
- disappears. Go back to step 3 if the failure still reproduces, and go to
97
- step 2 if the failure disappears.
98
-
99
- #### When Discovering a Potential New Flake on the CI
100
-
101
- 1 . Open an issue in this repo using [ the flake issue template] ( ./templates/flake.txt ) :
51
+ [ A GitHub workflow] ( .github/workflows/reliability_report.yml ) is run every day
52
+ to produce reliability reports of the ` node-test-pull-request ` CI and post
53
+ it to [ the issue tracker] ( https://github.com/nodejs/reliability/issues ) .
54
+
55
+ ## Protocols in improving CI reliability
56
+
57
+ Most work starts with opening the issue tracker of this repository and
58
+ reading the latest report. If the report is missing, see
59
+ [ the actions page] ( https://github.com/nodejs/reliability/actions ) for
60
+ details. GitHub's API restricts the length of issue messages, so
61
+ whenever the report is too long the workflow can fail to post the
62
+ issue. But it should still leave a summary in the actions page.
63
+
64
+ ### Identifying flaky JS tests
65
+
66
+ 1 . Check out the ` JSTest Failure ` section of the latest reliability report.
67
+ It contains information about the JS tests that failed more than 1 pull
68
+ requests in the last 100 ` node-test-pull-request ` CI runs. The more
69
+ pull requests a test fail, the higher it would be ranked, and the more
70
+ likely that it is a flake.
71
+ 2 . Search the name of the test in [ the Node.js issue tracker] ( https://github.com/nodejs/node/issues )
72
+ and see if there is already an issue about it. If there is already
73
+ an issue, check if the failures are similar. Comment with updates
74
+ if necessary.
75
+ 3 . If the flake isn't already tracked by an issue, continue to look into
76
+ it. In the report of a JS test, check out the pull requests that it
77
+ fails and see if there is a connection. If the pull requests appear to
78
+ be unrelated, it is more likely that the test is a flake.
79
+ 4 . Search the historical reliability reports with the name of the test in
80
+ the reliability issue tracker, and see how long the flake has been showing
81
+ up. Gather information from the historical reports, and
82
+ [ open an issue] ( https://github.com/nodejs/node/issues/new?assignees=&labels=flaky-test&projects=&template=4-report-a-flaky-test.yml )
83
+ in the Node.js issue tracker to track the flake.
84
+
85
+ ### Handling flaky JS tests
86
+
87
+ 1 . If the flake only starts to show up in the recent month, check the
88
+ historical reports to see precisely when it starts to show up. Look at
89
+ commits landing on the target branch around the same time using
90
+ ` https://github.com/nodejs/node/commits?since=YYYY-MM-DD `
91
+ and see if there is any pull request that looks related. If one or
92
+ more related pull requests can be found, ping the author or the
93
+ reviewer of the pull request, or the team in charge of the
94
+ related subsystem in the tracking issue or in private to see if
95
+ they can come up with a fix to just deflake the test.
96
+ 2 . If the test has been flaky for more than a month and no one is actively
97
+ working on it, it is unlikely to go away on its own, and it's time
98
+ to mark it as flaky. For example, if ` parallel/some-flaky-test.js `
99
+ has been flaky on Windows in the CI, after making sure that there is an
100
+ issue tracking it, open a pull request to add the following entry to
101
+ [ ` test/parallel/parallel.status ` ] ( https://github.com/nodejs/node/tree/main/test/parallel/parallel.status ) :
102
+
103
+ ```
104
+ [$system==win32]
105
+ # https://github.com/nodejs/node/issues/<TRACKING_ISSUE_ID>
106
+ some-flaky-test: PASS,FLAKY
107
+ ```
108
+
109
+ ### Identifying infrastructure issues
110
+
111
+ In the reliability reports, ` Jenkins Failure ` , ` Git Failure ` and
112
+ ` Build Failure ` are generally infrastructure issues and can be
113
+ handled by the ` nodejs/build ` team. Typical infrastructure
114
+ issues include:
102
115
103
- - Title should be ` Investigate path/under/the/test/directory/without/extension ` ,
104
- for example ` Investigate async-hooks/test-zlib.zlib-binding.deflate ` .
105
-
106
- 2 . Add the ` Flaky Test ` label and relevant subsystem labels (TODO: create
107
- useful labels).
108
-
109
- 3 . Open a pull request updating ` flakes.json ` .
110
-
111
- 4 . Notify the subsystem team related to the flake.
112
-
113
- ### Infrastructure failures
114
-
115
- When the CI run fails because:
116
-
117
- - There are network connection issues
118
- - There are tests fail with ` ENOSPAC ` (No space left on device)
119
116
- The CI machine has trouble pulling source code from the repository
120
-
121
- Do the following:
122
-
123
- 1 . Search in this repo with the error message and see if there is any open
124
- issue about this.
125
- 2 . If there is an existing issue, wait until the problem gets fixed.
126
- 3 . If there are no similar issues, open a new one with
127
- [ the build infra issue template] ( ./templates/infra.txt ) .
128
- 4 . Add label ` Build Infra ` .
129
- 5 . Notify the ` @nodejs/build-infra ` team in the issue.
130
-
131
- ### Build File Failures
132
-
133
- When the CI run of a PR that does not touch the build files ends with build
134
- failures (e.g. the run ends before the test runner has a chance to run):
135
-
136
- 1 . Search in this repo with the error message that contains keywords like
137
- ` fatal ` , ` error ` , etc.
138
- 2 . If there is a similar issue, add a reply there using the
139
- [ reproduction template] ( ./templates/build-file-repro.txt ) .
140
- 3 . If there are no similar issues, open a new one with
141
- [ the build file issue template] ( ./templates/build-file.txt ) .
142
- 4 . Add label ` Build Files ` .
143
- 5 . Notify the ` @nodejs/build-files ` team in the issue.
117
+ - The CI machine has trouble communicating to the Jenkins server
118
+ - Build timing out
119
+ - Parent job fails to trigger sub builds
120
+
121
+ Sometimes infrastructure issues can show up in the tests too, for
122
+ example tests can fail with ` ENOSPAC ` (No space left on device), and
123
+ the machine needs to be cleaned up to release disk space.
124
+
125
+ Some infrastructure issues can go away on its own, but if the same kind
126
+ of infrastructure issue has been failing multiple pull requests and
127
+ persists for more than a day, it's time to take action.
128
+
129
+ ### Handling infrastructure issues
130
+
131
+ Check out the [ Node.js build issue tracker] ( https://github.com/nodejs/build/issues )
132
+ to see if there is any open issue about this. If there isn't,
133
+ open a new issue about it or ask around in the ` #nodejs-build ` channel
134
+ in the OpenJS slack.
135
+
136
+ When reporting infrastructure issues, it's important to include
137
+ information about the particular machines where the issues happen.
138
+ On the Jenkins job page of the failed CI build where the infrastructure
139
+ is reported in the logs (not to be confused with the parent build that
140
+ trigger the sub build that has the issues), on the top-right
141
+ corner, there is normally a line similar to
142
+ ` Took 16 sec on test-equinix-ubuntu2004_container-armv7l-1 ` .
143
+ In this case, ` test-equinix-ubuntu2004_container-armv7l-1 `
144
+ is the machine having infrastructure issues, and it's important
145
+ to include this information in the report.
144
146
145
147
## TODO
146
148
147
- - [ ] Settle down on the flake database schema
148
149
- [ ] Read the flake database in ncu-ci so people can quickly tell if
149
- a failure is a flake
150
+ a failure is a flake
150
151
- [ ] Automate the report process in ncu-ci
151
152
- [ ] Migrate existing issues in nodejs/node and nodejs/build, close outdated
152
- ones.
153
- - [ ] Automate CI health history tracking
153
+ ones.
0 commit comments