Skip to content

[AUTOCUT] Gradle Check Flaky Test Report for SimpleSearchIT #16851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
opensearch-ci-bot opened this issue Dec 16, 2024 · 17 comments · Fixed by #18018
Open

[AUTOCUT] Gradle Check Flaky Test Report for SimpleSearchIT #16851

opensearch-ci-bot opened this issue Dec 16, 2024 · 17 comments · Fixed by #18018
Labels
autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc. untriaged

Comments

@opensearch-ci-bot
Copy link
Collaborator

opensearch-ci-bot commented Dec 16, 2024

Flaky Test Report for SimpleSearchIT

Noticed the SimpleSearchIT has some flaky, failing tests that failed during post-merge actions.

Details

Git Reference Merged Pull Request Build Details Test Name
069c2d8 18097 57263 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}

org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
0d1bb9b 18035 56909 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}

org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
15b702d 18042 56934 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}

org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
18a3b75 17796 56046 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}

org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
3a7d8fa 17976 56634 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}

org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
436038d 18041 56989 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}

org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
59a6efd 18036 57036 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}

org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
8cc58b5 18058 57023 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}

org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
967eee1 17216 56041 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}

org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
b842e48 17902 56242 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}

org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
c44d230 17749 56412 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}

org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
e61b6ab 18005 56703 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}

org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
e7ed33f 17978 57115 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}

org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
693c788 17921 56337 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
caf5d71 18039 56924 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
d18982c 17912 56693 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}
d29e95c 17882 56063 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}
d37cc9b 16831 51449 org.opensearch.search.simple.SimpleSearchIT.classMethod
eba12fa 17370 56995 org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"false"}}

The other pull requests, besides those involved in post-merge actions, that contain failing tests with the SimpleSearchIT class are:

For more details on the failed tests refer to OpenSearch Gradle Check Metrics dashboard.

@opensearch-ci-bot opensearch-ci-bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run untriaged labels Dec 16, 2024
@sandeshkr419
Copy link
Contributor

Looking into build logs:

[2024-12-16T08:20:25,037][ERROR][o.o.i.s.IndexShard       ] [node_t0] [test][0] Error Fetching SegmentInfos and latest checkpoint
org.apache.lucene.store.AlreadyClosedException: engine is closed
	at org.opensearch.index.shard.IndexShard.getEngine(IndexShard.java:3728) ~[main/:?]
	at org.opensearch.index.shard.IndexShard.getSegmentInfosSnapshot(IndexShard.java:5367) ~[main/:?]
	at org.opensearch.index.shard.IndexShard.getLatestSegmentInfosAndCheckpoint(IndexShard.java:1728) [main/:?]
	at org.opensearch.index.shard.IndexShard.updateReplicationCheckpoint(IndexShard.java:4868) [main/:?]
	at org.opensearch.index.shard.IndexShard$ReplicationCheckpointUpdater.afterRefresh(IndexShard.java:4862) [main/:?]
	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) [lucene-core-9.12.0.jar:9.12.0 e913796758de3d9b9440669384b29bec07e6a5cd - 2024-09-25 16:37:02]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) [lucene-core-9.12.0.jar:9.12.0 e913796758de3d9b9440669384b29bec07e6a5cd - 2024-09-25 16:37:02]
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) [lucene-core-9.12.0.jar:9.12.0 e913796758de3d9b9440669384b29bec07e6a5cd - 2024-09-25 16:37:02]
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1771) [main/:?]
	at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1886) [main/:?]
	at org.opensearch.index.engine.Engine.flush(Engine.java:1214) [main/:?]
	at org.opensearch.index.engine.Engine.flushAndClose(Engine.java:2002) [main/:?]
	at org.opensearch.index.shard.IndexShard.close(IndexShard.java:2056) [main/:?]
	at org.opensearch.index.IndexService.closeShard(IndexService.java:805) [main/:?]
	at org.opensearch.index.IndexService.removeShard(IndexService.java:781) [main/:?]
	at org.opensearch.index.IndexService.close(IndexService.java:495) [main/:?]
	at org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:1160) [main/:?]
	at org.opensearch.indices.cluster.IndicesClusterStateService.removeIndices(IndicesClusterStateService.java:443) [main/:?]
	at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:284) [main/:?]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:637) [main/:?]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:623) [main/:?]
	at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:589) [main/:?]
	at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:510) [main/:?]
	at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:205) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:932) [main/:?]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) [main/:?]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) [main/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1575) [?:?]
[2024-12-16T08:20:25,057][WARN ][o.o.c.r.a.AllocationService] [node_t0] Falling back to single shard assignment since batch mode 

From initial look, it looks more like OpenSearch process failure or corrupted index failure.

@bowenlan-amzn
Copy link
Member

SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize is flaky

@prudhvigodithi
Copy link
Member

I was able to re-produce this with a seed

./gradlew ':server:internalClusterTest' --tests "org.opensearch.search.simple.SimpleSearchIT"  -Dtests.seed=7560B20A7AC502D

@getsaurabh02 @andrross

@andrross
Copy link
Member

This is another case that looks like it was introduced by #17769. FYI @harshavamsi

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Apr 25, 2025

Coming from the Dashboard seen a new failure in #18069 (reference), not sure if the impacted PR has rebased the fix.

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Apr 28, 2025

Seen a failure in post merge action, coming from 069c2d8. Build failure log https://build.ci.opensearch.org/job/gradle-check/57263/.
Able to re-produce with ./gradlew ':server:internalClusterTest' --tests "org.opensearch.search.simple.SimpleSearchIT" -Dtests.seed=9D84159A50CA98F1

@harshavamsi
Copy link
Contributor

#18075 -- the fix has not been backported to 3.0 yet, that's why we are still seeing flakiness

@andrross
Copy link
Member

andrross commented May 7, 2025

I just saw what looks like a new kind of failure here:

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.search.simple.SimpleSearchIT" -Dtests.method="testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}}" -Dtests.seed=F888A6BE313526C9 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=se-SE -Dtests.timezone=Etc/GMT+0 -Druntime.java=21

SimpleSearchIT > testSimpleTerminateAfterTrackTotalHitsUpToSize {p0={"search.concurrent_segment_search.enabled":"true"}} FAILED
    java.lang.AssertionError: expected:<GREATER_THAN_OR_EQUAL_TO> but was:<EQUAL_TO>
        at __randomizedtesting.SeedInfo.seed([F888A6BE313526C9:161A85B12E222BF5]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:120)
        at org.junit.Assert.assertEquals(Assert.java:146)
        at org.opensearch.search.simple.SimpleSearchIT.doTestSimpleTerminateAfterTrackTotalHitsUpTo(SimpleSearchIT.java:435)
        at org.opensearch.search.simple.SimpleSearchIT.testSimpleTerminateAfterTrackTotalHitsUpToSize(SimpleSearchIT.java:443)

@prudhvigodithi Any chance this is related to your recent change?

@prudhvigodithi
Copy link
Member

@prudhvigodithi Any chance this is related to your recent change?

I see what is happening with a range query (when approximation was not introduced) its always "relation": "gte". Through approximation the query execution is optimized to count more efficiently and since the count is exactly 5 (matching the trackTotalHitsUpTo value), the relation is set to "eq"

curl -XPOST -H'Content-type: application/json' -kv localhost:9200/big5/_search -d '{
  "query": {
    "range": {
      "@timestamp": {
        "gte": "2023-01-01T00:00:00",
        "lt": "2023-01-03T00:00:00"
      }
    }
  },
  "track_total_hits": 5,
  "size": 4
}'

This is not a bug but the behavior of approximation as it only counts and returns the documents as required, we see this issue after this PR #18189 because the hack of adding +1 is removed. With +1 in the code and with approximation now the total return docs is 6, hence the tests are green with "relation": "gte".

@prudhvigodithi
Copy link
Member

From my previous comment #16851 (comment)

 curl -XPOST -H'Content-type: application/json' -kv localhost:9200/big5/_search -d '{
  "query": {
    "range": {
      "@timestamp": {
        "gte": "2023-01-01T00:00:00",
        "lt": "2023-01-03T00:00:00"
      }
    }
  },
  "track_total_hits": 5,
  "size": 6
}' | jq '.'

The searchResponse.getHits().getTotalHits().relation() depends on track_total_hits value, even with approximation when size is greater than track_total_hits, the relation": "gte" which is expected because with approximation the logic collects number of docs max of track_total_hits or size.

Let me see if I can update the tests based on this behavior or just add +1 to the code.
Thanks
@getsaurabh02 @harshavamsi

@harshavamsi
Copy link
Contributor

@prudhvigodithi Let's update the test instead of updating the approximation. Approximation is default now and any test using MatchAll or RangeQuery will need to be updated to reflect that.

@prudhvigodithi
Copy link
Member

prudhvigodithi commented May 7, 2025

Yes I'm more inclined to update the test, but going over the PR #18018 and comments to see why just +1 was added.
Thanks

@bowenlan-amzn
Copy link
Member

Note the assertion is the expected behavior, the relation should always be GREATER_THAN_OR_EQUAL_TO. Because there are more hits than 5.

The reason why +1 works is because the ensures the scorer is larger than size by 1, so the TotalHitCountCollector can flip the relation from equal to GREATER_THAN_OR_EQUAL_TO.

@prudhvigodithi your last change make sure the hits.total is short cut for match all query. But not range query, which called out #18189 (comment)
I was suggesting to also short cut for approximate range query. #18189 (comment)

But I know I feel +1 is a better approach in general, also doesn't add any cost anyway.

@bowenlan-amzn
Copy link
Member

On the other hand, I thought range query theoretically should hit this https://github.com/apache/lucene/blob/fafd6af004e0c39582043b797555d6eeb9aa7638/lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollector.java#L47-L52
And use that shortcut to give the hits.total, weight.count(context)
But somehow the test is not hitting that, this worth to check more @prudhvigodithi

@bowenlan-amzn
Copy link
Member

We should refactor this test by separating test size < track_total_size and >, then it won't be a flaky one, but consistent failing.

@prudhvigodithi
Copy link
Member

prudhvigodithi commented May 8, 2025

From my previous comments #16851 (comment) #16851 (comment) I was able to get the failure with E6CD5109E5DA1B20 seed as well, this time even with size is 4, which is less that the trackTotalHitsUpTo (which is 5). What is expected is the relation to be "relation": "eq", but got "relation": "gte".

So the test is not flaky here but the visit methods part of ApproximatePointRangeQuery does not exactly cap the size defined on the query (or when the default is 10k), there are sometimes a little extra docs collected part of the BKD walk hence the above test with size set to 4 and trackTotalHitsUpTo is 5, I can see "relation": "gte".

Having this I would go for adding +1 which consistently gives "relation": "gte" and honors the existing behavior (similar to before approximation). I have updated the PR to add the +1 #18235.

Thanks

@prudhvigodithi
Copy link
Member

prudhvigodithi commented May 8, 2025

Thanks @bowenlan-amzn yes today even before approximation the range queries does not go short cut, only the match_all queries does, hence in my PR #18189 I have fixed the match_all behavior to be the same with or without approximation. So it should not only add some more performance improvement (early termination + the early competitive iterator benefit) but also will get back the same behavior.

Next I will take a look at short cut of approximate range query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc. untriaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants