doc: update README and include protocol for handling reliability issues (#678)

joyeecheung · web-flow · commit f8cd32e1c5bc · 2023-09-28T22:24:38.000+02:00
diff --git a/README.md b/README.md
@@ -5,19 +5,17 @@ This repo is used for tracking flaky tests on the Node.js CI and fixing them.
 **Current status**: work in progress. Please go to the issue tracker to discuss!
 
 <!-- TOC -->
-
-- [Updating this repo](#updating-this-repo)
-- [The Goal](#the-goal)
+- [Node.js Core CI Reliability](#nodejs-core-ci-reliability)
+  - [Updating this repo](#updating-this-repo)
+  - [The Goal](#the-goal)
     - [The Definition of Green](#the-definition-of-green)
-- [CI Health History](#ci-health-history)
-- [Handling Failed CI runs](#handling-failed-ci-runs)
-    - [Flaky Tests](#flaky-tests)
-        - [Identifying Flaky Tests](#identifying-flaky-tests)
-        - [When Discovering a Potential New Flake on the CI](#when-discovering-a-potential-new-flake-on-the-ci)
-    - [Infrastructure failures](#infrastructure-failures)
-    - [Build File Failures](#build-file-failures)
-- [TODO](#todo)
-
+  - [CI Health History](#ci-health-history)
+  - [Protocols in improving CI reliability](#protocols-in-improving-ci-reliability)
+    - [Identifying flaky JS tests](#identifying-flaky-js-tests)
+    - [Handling flaky JS tests](#handling-flaky-js-tests)
+    - [Identifying infrastructure issues](#identifying-infrastructure-issues)
+    - [Handling infrastructure issues](#handling-infrastructure-issues)
+  - [TODO](#todo)
 <!-- /TOC -->
 
 ## Updating this repo
@@ -50,104 +48,106 @@ Make the CI green again.
 
 ## CI Health History
 
-See https://nodejs-ci-health.mmarchini.me/#/job-summary
-
-| UTC Time         | RUNNING | SUCCESS | UNSTABLE | ABORTED | FAILURE | Green Rate |
-| ---------------- | ------- | ------- | -------- | ------- | ------- | ---------- |
-| 2018-06-01 20:00 | 1       | 1       | 15       | 11      | 72      | 1.13%      |
-| 2018-06-03 11:36 | 3       | 6       | 21       | 10      | 60      | 6.89%      |
-| 2018-06-04 15:00 | 0       | 9       | 26       | 10      | 55      | 10.00%     |
-| 2018-06-15 17:42 | 1       | 27      | 4        | 17      | 51      | 32.93%     |
-| 2018-06-24 18:11 | 0       | 27      | 2        | 8       | 63      | 29.35%     |
-| 2018-07-08 19:40 | 1       | 35      | 2        | 4       | 58      | 36.84%     |
-| 2018-07-18 20:46 | 2       | 38      | 4        | 5       | 51      | 40.86%     |
-| 2018-07-24 22:30 | 2       | 46      | 3        | 4       | 45      | 48.94%     |
-| 2018-08-01 19:11 | 4       | 17      | 2        | 2       | 75      | 18.09%     |
-| 2018-08-14 15:42 | 5       | 22      | 0        | 14      | 59      | 27.16%     |
-| 2018-08-22 13:22 | 2       | 29      | 4        | 9       | 56      | 32.58%     |
-| 2018-10-31 13:28 | 0       | 40      | 13       | 4       | 43      | 41.67%     |
-| 2018-11-19 10:32 | 0       | 48      | 8        | 5       | 39      | 50.53%     |
-| 2018-12-08 20:37 | 2       | 18      | 4        | 3       | 73      | 18.95%     |
-
-## Handling Failed CI runs
-
-### Flaky Tests
-
-TODO: automate all of this in ncu-ci
-
-#### Identifying Flaky Tests
-
-When checking the CI results of a PR, if there is one or more failed tests (with
-`not ok` as the TAP result):
-
-1.  If the failed test is not related to the PR (does not touch the modified
-    code path), search the test name in the issue tracker of this repo. If there
-    is an existing issue, add a reply there using the [reproduction template](./templates/repro.txt),
-    and open a pull request updating `flakes.json`.
-2.  If there are no new existing issues about the test, run the CI again. If the
-    failure disappears in the next run, then it is potential flake. See
-    [When discovering a potential flake on the CI](#when-discovering-a-potential-new-flake-on-the-ci)
-    on what to do for a new flake.
-3.  If the failure reproduces in the next run, it is likely that the failure is
-    related to the PR. Do not re-run CI without code changes in the next 24
-    hours, try to debug the failure.
-4.  If the cause of the failure still cannot be identified 24 hours later, and
-    the code has not been changed, start a CI run and see if the failure
-    disappears. Go back to step 3 if the failure still reproduces, and go to
-    step 2 if the failure disappears.
-
-#### When Discovering a Potential New Flake on the CI
-
-1.  Open an issue in this repo using [the flake issue template](./templates/flake.txt):
+[A GitHub workflow](.github/workflows/reliability_report.yml) is run every day
+to produce reliability reports of the `node-test-pull-request` CI and post
+it to [the issue tracker](https://github.com/nodejs/reliability/issues).
+
+## Protocols in improving CI reliability
+
+Most work starts with opening the issue tracker of this repository and
+reading the latest report. If the report is missing, see
+[the actions page](https://github.com/nodejs/reliability/actions) for
+details. GitHub's API restricts the length of issue messages, so
+whenever the report is too long the workflow can fail to post the
+issue. But it should still leave a summary in the actions page.
+
+### Identifying flaky JS tests
+
+1. Check out the `JSTest Failure` section of the latest reliability report.
+  It contains information about the JS tests that failed more than 1 pull
+  requests in the last 100 `node-test-pull-request` CI runs. The more
+  pull requests a test fail, the higher it would be ranked, and the more
+  likely that it is a flake.
+2. Search the name of the test in [the Node.js issue tracker](https://github.com/nodejs/node/issues)
+  and see if there is already an issue about it. If there is already
+  an issue, check if the failures are similar. Comment with updates
+  if necessary.
+3. If the flake isn't already tracked by an issue, continue to look into
+  it. In the report of a JS test, check out the pull requests that it
+  fails and see if there is a connection. If the pull requests appear to
+  be unrelated, it is more likely that the test is a flake.
+4. Search the historical reliability reports with the name of the test in
+  the reliability issue tracker, and see how long the flake has been showing
+  up. Gather information from the historical reports, and
+  [open an issue](https://github.com/nodejs/node/issues/new?assignees=&labels=flaky-test&projects=&template=4-report-a-flaky-test.yml)
+  in the Node.js issue tracker to track the flake.
+
+### Handling flaky JS tests
+
+1. If the flake only starts to show up in the recent month, check the
+  historical reports to see precisely when it starts to show up. Look at
+  commits landing on the target branch around the same time using
+  `https://github.com/nodejs/node/commits?since=YYYY-MM-DD`
+  and see if there is any pull request that looks related. If one or
+  more related pull requests can be found, ping the author or the
+  reviewer of the pull request, or the team in charge of the
+  related subsystem in the tracking issue or in private to see if
+  they can come up with a fix to just deflake the test.
+2. If the test has been flaky for more than a month and no one is actively
+  working on it, it is unlikely to go away on its own, and it's time
+  to mark it as flaky. For example, if `parallel/some-flaky-test.js`
+  has been flaky on Windows in the CI, after making sure that there is an
+  issue tracking it, open a pull request to add the following entry to
+  [`test/parallel/parallel.status`](https://github.com/nodejs/node/tree/main/test/parallel/parallel.status):
+
+   ```
+   [$system==win32]
+   # https://github.com/nodejs/node/issues/<TRACKING_ISSUE_ID>
+   some-flaky-test: PASS,FLAKY
+   ```
+
+### Identifying infrastructure issues
+
+In the reliability reports, `Jenkins Failure`, `Git Failure` and
+`Build Failure` are generally infrastructure issues and can be
+handled by the `nodejs/build` team. Typical infrastructure
+issues include:
 
-    - Title should be `Investigate path/under/the/test/directory/without/extension`,
-      for example `Investigate async-hooks/test-zlib.zlib-binding.deflate`.
-
-2.  Add the `Flaky Test` label and relevant subsystem labels (TODO: create
-    useful labels).
-
-3.  Open a pull request updating `flakes.json`.
-
-4.  Notify the subsystem team related to the flake.
-
-### Infrastructure failures
-
-When the CI run fails because:
-
-- There are network connection issues
-- There are tests fail with `ENOSPAC` (No space left on device)
 - The CI machine has trouble pulling source code from the repository
-
-Do the following:
-
-1.  Search in this repo with the error message and see if there is any open
-    issue about this.
-2.  If there is an existing issue, wait until the problem gets fixed.
-3.  If there are no similar issues, open a new one with
-    [the build infra issue template](./templates/infra.txt).
-4.  Add label `Build Infra`.
-5.  Notify the `@nodejs/build-infra` team in the issue.
-
-### Build File Failures
-
-When the CI run of a PR that does not touch the build files ends with build
-failures (e.g. the run ends before the test runner has a chance to run):
-
-1.  Search in this repo with the error message that contains keywords like
-    `fatal`, `error`, etc.
-2.  If there is a similar issue, add a reply there using the
-    [reproduction template](./templates/build-file-repro.txt).
-3.  If there are no similar issues, open a new one with
-    [the build file issue template](./templates/build-file.txt).
-4.  Add label `Build Files`.
-5.  Notify the `@nodejs/build-files` team in the issue.
+- The CI machine has trouble communicating to the Jenkins server
+- Build timing out
+- Parent job fails to trigger sub builds
+
+Sometimes infrastructure issues can show up in the tests too, for
+example tests can fail with `ENOSPAC` (No space left on device), and
+the machine needs to be cleaned up to release disk space.
+
+Some infrastructure issues can go away on its own, but if the same kind
+of infrastructure issue has been failing multiple pull requests and
+persists for more than a day, it's time to take action.
+
+### Handling infrastructure issues
+
+Check out the [Node.js build issue tracker](https://github.com/nodejs/build/issues)
+to see if there is any open issue about this. If there isn't,
+open a new issue about it or ask around in the `#nodejs-build` channel
+in the OpenJS slack.
+
+When reporting infrastructure issues, it's important to include
+information about the particular machines where the issues happen.
+On the Jenkins job page of the failed CI build where the infrastructure
+is reported in the logs (not to be confused with the parent build that
+trigger the sub build that has the issues), on the top-right
+corner, there is normally a line similar to
+`Took 16 sec on test-equinix-ubuntu2004_container-armv7l-1`.
+In this case, `test-equinix-ubuntu2004_container-armv7l-1`
+is the machine having infrastructure issues, and it's important
+to include this information in the report.
 
 ## TODO
 
-- [ ] Settle down on the flake database schema
 - [ ] Read the flake database in ncu-ci so people can quickly tell if
-      a failure is a flake
+    a failure is a flake
 - [ ] Automate the report process in ncu-ci
 - [ ] Migrate existing issues in nodejs/node and nodejs/build, close outdated
-      ones.
-- [ ] Automate CI health history tracking
+    ones.