Skip to content

Retry All Tests with a Reduced Retry Limit of 1 #17939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

prudhvigodithi
Copy link
Member

@prudhvigodithi prudhvigodithi commented Apr 14, 2025

Description

Related Issues

Part of #17974

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@prudhvigodithi prudhvigodithi changed the title [Draft] Add more flaky tests to Gradle retry logic [Draft] Onboard the known flaky tests to the Gradle retry logic Apr 14, 2025
@prudhvigodithi
Copy link
Member Author

Adding @mch2 @getsaurabh02 @andrross @reta @cwperks.

Copy link
Contributor

❌ Gradle check result for 7d01417: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 0a0f107: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for bfdd7fb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 3da6877: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@prudhvigodithi prudhvigodithi force-pushed the agent branch 3 times, most recently from b49fd88 to 40c47d6 Compare April 15, 2025 18:10
Copy link
Contributor

❌ Gradle check result for 40c47d6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@prudhvigodithi prudhvigodithi force-pushed the agent branch 2 times, most recently from 97347a1 to 87b8b8d Compare April 15, 2025 20:44
build.gradle Outdated
@@ -557,6 +557,57 @@ subprojects {
includeClasses.add("org.opensearch.test.rest.ClientYamlTestSuiteIT")
includeClasses.add("org.opensearch.upgrade.DetectEsInstallationTaskTests")
includeClasses.add("org.opensearch.cluster.MinimumClusterManagerNodesIT")
includeClasses.add("org.opensearch.indices.IndicesRequestCacheIT")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously the plan to not automatically retry new flaky tests has not prevented the introduction of more flakiness. Should we just give up on this idea and automatically retry all tests?

Copy link
Member Author

@prudhvigodithi prudhvigodithi Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open for thoughts to retry for all tests as we dont have to maintain a specific list and again remove them once fixed. I have created a META issue #17974 to surface the tests passing with re-try, with this we will have information and metrics on the tests that pass on re-try and take action accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just removed the filter with includeClasses.add in my latest commit, let me run the gradle check few times on the PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the list to remove the tests which I did not see fail from past 1 year and with new flaky tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly think it might be better to remove the filter, particularly if we're going to add to the list without a good reason why we can't fix the flakiness. I don't like increasing the max retries though. How about removing the filter and setting max retries to 1? Rare flakiness won't block the build because a retry will almost always succeed. More severe flakiness will continue to block the build though. @prudhvigodithi What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you on removing the list, to see the same benefits should we continue with 3 as maxRetries ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prudhvigodithi Can you remove the filter and set max retries to 1 in this PR and then run gradle check a few times to see what the success rate is?

The reason I'd like to keep retries to a minimum is so that retries only fix the really rare flakiness, but failures still happen for tests with high likelihood of failures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have re-tried few times and got successful green CI in a row. Seen an unstable which indicates a re-try.

Copy link
Contributor

❌ Gradle check result for 87b8b8d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@prudhvigodithi prudhvigodithi force-pushed the agent branch 2 times, most recently from 1ce3fe3 to 6fd5c74 Compare April 15, 2025 21:32
Copy link
Contributor

❌ Gradle check result for 6fd5c74: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❕ Gradle check result for 572728a: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Apr 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.31%. Comparing base (cbaddd3) to head (3a74dc5).
Report is 11 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17939      +/-   ##
============================================
+ Coverage     72.51%   73.31%   +0.80%     
- Complexity    67108    67743     +635     
============================================
  Files          5475     5478       +3     
  Lines        309916   310034     +118     
  Branches      45060    45066       +6     
============================================
+ Hits         224725   227316    +2591     
+ Misses        66895    64776    -2119     
+ Partials      18296    17942     -354     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

❕ Gradle check result for 572728a: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

❕ Gradle check result for 572728a: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

❌ Gradle check result for 572728a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@prudhvigodithi
Copy link
Member Author

I would like to see this merged as it would give some bandwidth and unblock the contribution PR's. I have a META #17974 created which talks about having more mechanisms to control the flaky tests.
@andrross @reta @getsaurabh02

@prudhvigodithi prudhvigodithi force-pushed the agent branch 3 times, most recently from 1b38b33 to 5563172 Compare April 18, 2025 18:22
Signed-off-by: Prudhvi Godithi <[email protected]>
Copy link
Contributor

❌ Gradle check result for d9760c8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 3a74dc5: SUCCESS

Copy link
Contributor

❕ Gradle check result for 3a74dc5: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

✅ Gradle check result for 3a74dc5: SUCCESS

Copy link
Contributor

✅ Gradle check result for 3a74dc5: SUCCESS

@andrross
Copy link
Member

@prudhvigodithi You should probably update the PR description to what the latest approach here is ("Retry all tests, but reduce numbers of retries to 1")

@mch2 @msfroh @jainankitk @bugmakerrrrrr @ashking94 @cwperks @reta Tagging some other maintainers here. This is a pretty consequential change in terms of how we're enforcing code quality. I don't love the idea of new tests being added that fail quite often but the retry logic kicks in so it won't cause any friction and we'll never need to look at fixing it (*). At the same time, in practice we are introducing lots of flakiness and are just retrying them manually anyway. Overall, I'm sort of neutral about this change given the realities of the situation, but if someone else thinks this is the right way to go I wouldn't block it.

(*) If #17974 can mitigate this problem, I'd have no concerns with this change. But we haven't actually implemented anything yet.

@prudhvigodithi prudhvigodithi changed the title Update the gradle retry Retry All Tests with a Reduced Retry Limit of 1 Apr 21, 2025
@reta
Copy link
Contributor

reta commented Apr 22, 2025

@mch2 @msfroh @jainankitk @bugmakerrrrrr @ashking94 @cwperks @reta Tagging some other maintainers here. This is a pretty consequential change in terms of how we're enforcing code quality. I don't love the idea of new tests being added that fail quite often but the retry logic kicks in so it won't cause any friction and we'll never need to look at fixing it (*).

Personally, I was always against this approach and still am. Realistically, every large codebase faces flaky tests every now and than, but promptly addressing those is the key to keep codebases healthy. I do have one (somewhat radical idea) though: what if we remove all flaky tests from the test suites (isolate them in flakyXxxTest subfolders), establish clean baseline, and start working on them one by one?

@ashking94
Copy link
Member

ashking94 commented Apr 22, 2025

@mch2 @msfroh @jainankitk @bugmakerrrrrr @ashking94 @cwperks @reta Tagging some other maintainers here. This is a pretty consequential change in terms of how we're enforcing code quality. I don't love the idea of new tests being added that fail quite often but the retry logic kicks in so it won't cause any friction and we'll never need to look at fixing it (*). At the same time, in practice we are introducing lots of flakiness and are just retrying them manually anyway. Overall, I'm sort of neutral about this change given the realities of the situation, but if someone else thinks this is the right way to go I wouldn't block it.

While this may reduce the PR build failures for near term, the retry would hide the underneath lurking issues/flaky tests. Also, with time, we may end up in the same ratio of failed builds if flaky tests are added unchecked. I also like the idea of adding of running new tests for higher iterations as mentioned in #17974. At the same time, I think it may be okay to go ahead with this but with a plan to undo this change in near future after implementing #17974.

@prudhvigodithi
Copy link
Member Author

Thanks for the feedback. With the known list Flaky Tests we can always update the includeClasses.add section to only re-try the flaky ones. While the effort goes in parallel to fix and reduce the flaky tests, I'm fine to update the PR to have the retries for only the known flaky tests by using includeClasses.add section rather than a global retry.

Also today with every Gradle Check failure irrespective if the test passed on re-try, the data will be ingested to the OpenSearch Metrics Cluster so flaky tests info is not missed.

Example for the build https://build.ci.opensearch.org/job/gradle-check/56861/ even though the CI's would be green as the one of the test passed on re-try, the dashboard will have the failure information.

Thanks

@mch2
Copy link
Member

mch2 commented Apr 22, 2025

@mch2 @msfroh @jainankitk @bugmakerrrrrr @ashking94 @cwperks @reta Tagging some other maintainers here. This is a pretty consequential change in terms of how we're enforcing code quality. I don't love the idea of new tests being added that fail quite often but the retry logic kicks in so it won't cause any friction and we'll never need to look at fixing it (*).

Agree with whats been said so far, lets not sweep more under the rug. I'd also rather we not make a habit of continuously adding to this list and make a focused effort to drive down the count.

I do have one (somewhat radical idea) though: what if we remove all flaky tests from the test suites (isolate them in flakyXxxTest subfolders), establish clean baseline, and start working on them one by one?

I like the intent here in creating a clear list to address, but I think we could use the existing list here in build.gradle for this rather than physically moving them? With that said the tests on this list have been left unaddressed for a while because the retries hide the pain.

@andrross
Copy link
Member

I do have one (somewhat radical idea) though: what if we remove all flaky tests from the test suites (isolate them in flakyXxxTest subfolders), establish clean baseline, and start working on them one by one?

I like the intent here in creating a clear list to address, but I think we could use the existing list here in build.gradle for this rather than physically moving them? With that said the tests on this list have been left unaddressed for a while because the retries hide the pain.

Yeah, the original list was intended to be pretty much this idea. Unfortunately the "clean baseline" didn't stay so clean...

@owaiskazi19
Copy link
Member

remove all flaky tests from the test suites (isolate them in flakyXxxTest subfolders)

Instead of global retries we can only have retries for the tests present in the above subfolder. This can give us the best of both worlds - stable CI for known issues while maintaining visibility of new flakiness. Retries can be treated as a temporary solution until we start clearing the list from flakyXxxTest.

@reta
Copy link
Contributor

reta commented Apr 22, 2025

I like the intent here in creating a clear list to address, but I think we could use the existing list here in build.gradle for this rather than physically moving them? With that said the tests on this list have been left unaddressed for a while because the retries hide the pain.

Thanks, that's exactly the issue @mch2 - the list is growing and flakyness is unaddressed (we have tried many times to get back on track as @andrross said, but no luck). With such a radical solution, we revert the change(s) as fast as possible once flakyness arises.

@prudhvigodithi
Copy link
Member Author

what if we remove all flaky tests from the test suites (isolate them in flakyXxxTest subfolders),

@andrross Should we do a similar approach https://github.com/opensearch-project/OpenSearch/pull/18057/files by adding AwaitsFix annotation and start clean while the flaky tests are fixed in parallel?

Because I see some valid points in not adding Flaky Tests in includeClasses.add section.

@andrross
Copy link
Member

andrross commented Apr 23, 2025

@prudhvigodithi We should definitely make use of @AwaitsFix where appropriate. You can see the code base is currently littered with it:

% git grep '@AwaitsFix' | wc -l
55

However, we have no real mechanism to currently burn down the list of ignored tests. Also I believe the automation will automatically close the flaky test issues eventually once they stop failing due to being skipped.

In the case of #18057 I don't really have a concern with muting it because it is a new test. However, we have cases like SimpleSearchIT where a recent change introduces flakiness, and I would be a little more concerned about ignoring a general test like that because it might lead to introducing even more bugs.

@prudhvigodithi
Copy link
Member Author

prudhvigodithi commented Apr 23, 2025

Also I believe the automation will automatically close the flaky test issues eventually once they stop failing due to being skipped.

Got it. The automation will not re-open if closed and seen no new flaky failures. The close action today is on the maintainer or the assignee once the test is fixed.

So based on the discussion, following are some options.

  • Go with @AwaitsFix for flaky tests so that contribution PR's are not blocked and have an plan to fix the flaky tests.
  • Mix of @AwaitsFix and retry for some tests like SimpleSearchIT with includeClasses.add section, still having a plan to fix the flaky tests.

Also looks to me its possible to add a workflow to re-try the tests (new or modified) few times before we merge the PR. This is for #17974 to have some mechanism to early detect the failures. At high level what I was trying to do.

for i in {1..500}; do
  SEED=$(od -vAn -N8 -tx8 < /dev/urandom | tr -d ' ' | tr 'a-f' 'A-F')
  SEED=$(printf "%016s" "$SEED")
  ./gradlew ':server:test' --tests "org.opensearch.index.query.MultiMatchQueryBuilderTests.testToQueryBoost"   -Dtests.seed=$SEED 
done
  • From the open PR find the new or modified tests.
  • Find the Gradle module of the test.
  • Run the entire test class or just the modified or new test methods.

We can make it more robust extending this as a plugin and run each test in parallel on multiple containers ( or ec2's).

@andrross
Copy link
Member

So based on the discussion, following are some options.

  • Go with @AwaitsFix for flaky tests so that contribution PR's are not blocked and have an plan to fix the flaky tests.

  • Mix of @AwaitsFix and retry for some tests like SimpleSearchIT with includeClasses.add section, still having a plan to fix the flaky tests.

@prudhvigodithi Being pretty aggressive with muting failing tests with @AwaitsFix is good for newly introduced tests or tests of a specific feature that has recently been changed. Assuming the contributor is still engaged on the feature in question we can follow up with them to investigate the fix.

For issues like SimpleSearchIT it's a little more complex. That test is very broad and muting it might meaningfully reduce test coverage across big parts of the code base. We want to be a little more careful in those cases. For this case, we had a deterministic repo and I was able to use git bisect to pin it down to specific commit, which meant I could ping the contributor to fix it. We could explore automation or tooling for making that process faster/easier. However, I don't think we should normalize expanding the list of tests to retry for all the reasons mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

6 participants