| 
 | 1 | +# Premerge Advisor  | 
 | 2 | + | 
 | 3 | +## Introduction  | 
 | 4 | + | 
 | 5 | +While the premerge infrastructure is generally reliable, tests that are broken  | 
 | 6 | +at HEAD, or tests that are flaky can significantly degrade the user experience.  | 
 | 7 | +Failures unrelated to the PR being worked on can lead to warning fatigue which  | 
 | 8 | +can compound this problem when people land failing tests into main that they  | 
 | 9 | +believe are unrelated. The premerge advisor is designed to run after premerge  | 
 | 10 | +testing and signal to the user whether the failures can be safely ignored because  | 
 | 11 | +they are flaky or broken at head, or whether they should be investigated further.  | 
 | 12 | + | 
 | 13 | +## Background/Motivation  | 
 | 14 | + | 
 | 15 | +The usefulness of the premerge system is significantly impacted by how much  | 
 | 16 | +people trust it. People trust the system less when the issues reported to them  | 
 | 17 | +are unrelated to the changes that they are currently working on. False positives  | 
 | 18 | +occur regularly whether due to flaky tests, or due to main being broken. Efforts  | 
 | 19 | +to fix these issues at the source and keep main always green and deflake tests  | 
 | 20 | +are laudable, but not a scalable solution to the problem of false positive  | 
 | 21 | +failures. It is also not reasonable to expect PR authors to spend time  | 
 | 22 | +familiarizing themselves with all known flaky tests and dig through postcommit  | 
 | 23 | +testing logs to see if the failures in their premerge run also occur in main.  | 
 | 24 | +These alternatives are further explored in the section on  | 
 | 25 | +[alternatives considered](#alternatives-considered). Having tooling to  | 
 | 26 | +automatically run through the steps that one would otherwise need to perform  | 
 | 27 | +manually would ensure the analysis on every failed premerge run is thorough and  | 
 | 28 | +likely to be correct.  | 
 | 29 | + | 
 | 30 | +## Design  | 
 | 31 | + | 
 | 32 | +The premerge advisor will consist of three main parts: jobs uploading failure  | 
 | 33 | +information, a web server and database to store and query failure information,  | 
 | 34 | +and tooling to write out a verdict about the failures to comments on Github.  | 
 | 35 | +When a job runs premerge or postcommit and there are build/test failures, it  | 
 | 36 | +will upload information to the web server containing information on the failure  | 
 | 37 | +like the test/build action that failed and the exact log. The web server will  | 
 | 38 | +then store this information in a format that makes it easy to query for later.  | 
 | 39 | +Every premerge run the encounters build/test failures will then query the web  | 
 | 40 | +server to see if there are any matching build/test failures for the version of  | 
 | 41 | +`main` that the PR has been merged into. If there are, the web server can  | 
 | 42 | +respond with the failure information and signal that the failures can be safely  | 
 | 43 | +ignored for that premerge run. If the failures are novel, then the web server  | 
 | 44 | +can signal that the failures should be investigated more thoroughly. If there  | 
 | 45 | +are failures, the premerge advisor can then write out a comment on the PR  | 
 | 46 | +explaining its findings.  | 
 | 47 | + | 
 | 48 | +### Processing and Storing Failures  | 
 | 49 | + | 
 | 50 | +A key part of the premerge advisor is infrastructure to store and process build  | 
 | 51 | +failure information so that it can be queried later. We plan on having jobs  | 
 | 52 | +extract failure logs and upload them to a web server. This web server in  | 
 | 53 | +addition to having an endpoint to accept uploads will have an endpoint that will  | 
 | 54 | +accept test failure information (logs and filenames) and return whether or not  | 
 | 55 | +they are broken at `main`, flaky, or novel test failures due to the PR.  | 
 | 56 | + | 
 | 57 | +For the premerge jobs running through Github Actions, we plan on using the  | 
 | 58 | +existing `generate_test_report` scripts that are currently used for generating  | 
 | 59 | +summaries on job failures. When the job ends and there is a failure, there would  | 
 | 60 | +be a script that runs, utilizing the `generate_test_report` library to extract  | 
 | 61 | +failure information, and then uploading the information to the web server.  | 
 | 62 | +Information on how the premerge jobs will query the server and display results  | 
 | 63 | +about flaky/already broken tests is in  | 
 | 64 | +[the section below](#informing-the-user). We plan on having both the premerge  | 
 | 65 | +and postcommit jobs upload failure information to the web server. This enables  | 
 | 66 | +the web server to know about failures that are not the result of mid-air  | 
 | 67 | +collisions before postcommit testing has been completed. Postcommit testing can  | 
 | 68 | +take upwards of 30 minutes, during which the premerge advisor would not be able  | 
 | 69 | +to advise that these failures are due to a broken `main`.  | 
 | 70 | + | 
 | 71 | +We plan on implementing the web server in python with the `flask` library. All  | 
 | 72 | +contributors to the premerge infrastructure are already familiar with Python,  | 
 | 73 | +building web servers in python with `flask` is relatively easy, and we do not  | 
 | 74 | +need high performance or advanced features. For storage, we plan on using SQLite  | 
 | 75 | +as it has support built in to python, does not require any additional complexity  | 
 | 76 | +in terms of infrastructure setup, and is reliable.  | 
 | 77 | + | 
 | 78 | +Given we have two identical clusters that need to be able to function  | 
 | 79 | +independently of each other, we also need to duplicate the web server for the  | 
 | 80 | +premerge advisor. Queries and failure information uploads will go to the  | 
 | 81 | +cluster-local premerge advisor web server. Periodically (eg once every thirty  | 
 | 82 | +seconds), the web server will query the web server on the other cluster to see  | 
 | 83 | +if there is any new data that has not been propgated back to the other side yet.  | 
 | 84 | +It is easy to figure out what one side is missing as Github workflow runs are  | 
 | 85 | +numbered sequentially and git commits are also explicitly ordered. One side just  | 
 | 86 | +needs to send the latest commit SHA and workflow run it has all previous data  | 
 | 87 | +for, and the other side can reply with the data that it has past that point.  | 
 | 88 | +Explicitly synchronizing everytime without assumptions about the state of the  | 
 | 89 | +other side has benefits over just writing through, like ensuring that a cluster  | 
 | 90 | +that has been down for a significant amount of time is seamlessly able to  | 
 | 91 | +recover. Synchronization does not need to be perfect as test failures that are  | 
 | 92 | +flaky or broken at head will almost certainly show up in both clusters  | 
 | 93 | +relatively quickly, and minor discrepancies for queries between the clusters are  | 
 | 94 | +not a big deal.  | 
 | 95 | + | 
 | 96 | +### Informing the User  | 
 | 97 | + | 
 | 98 | +Once a workflow has completed, no actions related to the premerge advisor will  | 
 | 99 | +be performed if there are no failures. If there are failures, they will be  | 
 | 100 | +uploaded to the web server. Afterwards, the premerge workflow then makes a  | 
 | 101 | +request to the server asking if it can explain the failures as either existing  | 
 | 102 | +in `main` or as flaky.  | 
 | 103 | + | 
 | 104 | +After the response from the server has been recieved, the workflow will then  | 
 | 105 | +construct a comment. It will contain the failure information, and if relevant,  | 
 | 106 | +information/links showing the tests are flaky (and the flakiness rate) or are  | 
 | 107 | +broken in `main`. If all of the test failures are due to flakes or failures in  | 
 | 108 | +`main`, the comment will indicate to the user that they can safely merge their  | 
 | 109 | +PR despite the test failures. We plan to construct the comment in a manner  | 
 | 110 | +similar to the code formatting action. We will create one comment on the first  | 
 | 111 | +workflow failure and then update that comment everytime we get new data. This  | 
 | 112 | +prevents creating much noise. This does mean that the comment might get buried  | 
 | 113 | +in long running reviews, but those are not the common case and people should  | 
 | 114 | +learn to expect to look for the comment earlier in the thread in such cases.  | 
 | 115 | + | 
 | 116 | +## Alternatives Considered  | 
 | 117 | + | 
 | 118 | +There are two main alternatives to this work. One would be to do nothing and let  | 
 | 119 | +users figure out these failures on their own, potentially with documentation to  | 
 | 120 | +better inform them of the process. The other alternative would be to work on  | 
 | 121 | +keeping `main` green all the time and deflake tests rather than work on a  | 
 | 122 | +premerge advisor. Both of these alternatives are considered below.  | 
 | 123 | + | 
 | 124 | +### Deflaking Tests and Keeping Main Green  | 
 | 125 | + | 
 | 126 | +Instead of putting effort into building the premerge advisor, we could also be  | 
 | 127 | +putting effort into deflaking tests and making process changes to ensure `main`  | 
 | 128 | +is not broken. These fixes have the bonus of being useful for more than just  | 
 | 129 | +premerge, also improving reliability for buildbots and any downstream testing.  | 
 | 130 | +While we probably could achieve this at least temporarily with process changes  | 
 | 131 | +and a concentrated deflaking effort, we do not believe this is feasible or  | 
 | 132 | +scalable.  | 
 | 133 | + | 
 | 134 | +In order to ensure that main is not broken by new patches, we need to ensure  | 
 | 135 | +that every commit is tested directly on top of `main` before landing. This is  | 
 | 136 | +not feasible given LLVM's current processes where pushing directly to main is a  | 
 | 137 | +critical component of several developer's workflows. We would also need to  | 
 | 138 | +reduce the risk of "mid-air collisions", patches that pass premerge testing, but  | 
 | 139 | +fail on `main` when landed due to the patch in its original state not being  | 
 | 140 | +compatible with the new state of main. This would most likely involve merge  | 
 | 141 | +queues which would introduce new CI load and are also not compatible with LLVM's  | 
 | 142 | +existing practices for the same reason requiring premerge checks to be run  | 
 | 143 | +before landing are not.  | 
 | 144 | + | 
 | 145 | +Doing an initial effort for deflaking tests is also not scalable from an  | 
 | 146 | +engineering effort perspective. While we might be able to deflake existing  | 
 | 147 | +tests, additional flaky tests will get added in the future, and it is likely not  | 
 | 148 | +feasible to dedicate enough resources to deflake them. Policy improvements  | 
 | 149 | +around reverting patches that introduce flaky tests might make this more  | 
 | 150 | +scalable, but relies on quick detection of flaky tests, which might be difficult  | 
 | 151 | +for tests that experience flaky failures very rarely.  | 
 | 152 | + | 
 | 153 | +### Not Doing Anything  | 
 | 154 | + | 
 | 155 | +Alternatively, we could not implement this at all. This system adds quite a bit  | 
 | 156 | +of complexity and adds new failure modes. False positive failures also are not  | 
 | 157 | +that frequent. However, even a relatively small percentage of failures like we  | 
 | 158 | +are observing significantly impacts trust in the premerge system, which  | 
 | 159 | +compounds the problem. People learn not to trust the results, ignore true  | 
 | 160 | +failures caused by their patch, and then land it, causing many downstream  | 
 | 161 | +failures. The frequency of these incidents (typically multiple times per week)  | 
 | 162 | +means that it is pretty likely most LLVM developers will run into this class of  | 
 | 163 | +issue sooner or later.  | 
 | 164 | + | 
 | 165 | +The complexity is also well confined to the components specific to this new  | 
 | 166 | +infrastructure, and the new failure modes can be mitigated through proper error  | 
 | 167 | +handling at the interface between the existing premerge system and the new  | 
 | 168 | +premerge advisor infrastructure.  | 
0 commit comments