Description
I conducted performance tests in 3 scenarios on the sentinel-2-l2a collection, which has around 42 million products and 480GB of data.
The performance test consisted of maintaining 200 requests per second, querying the sentinel-2-l2a collection with random date ranges and random geometry intersections, limit 100.
Scenario I - 100GB indexes, 8 shards:
Average response time: 5.57s
Average transfer 400MB/s
200 responses - 18967
500 responses - 396
CPU usage on opensearch server 98-100%
Scenario II - 1 index, 1 shard:
Average response time: 1.14s
Average transfer 767MB/s
200 responses - 43599
500 responses - 10040
CPU usage on opensearch server 98-100%
Scenario III - quarterly indexes, based on product datetime and collection. Example items_sentinel-2-l2a-Q1-2015, we specify in the API which opensearch indexes to search in:
Average response time: 732.72ms
Average transfer 1.2 GB/s
200 responses - 53614
500 responses - 0
CPU usage on opensearch server around 60%
Notes: data distribution was uneven e.g. Q1 2015 had X times less data than Q1 2022.
Conclusions:
The best performance solution is specifying in the API which opensearch/elasticsearch indexes should search - users usually search by collection and datetime.
Solution:
Sentinel-l2a has much larger products than other collections. So to eliminate that some indexes could be 2GB while others 30GB:
-
We index around 30GB per index - +/- because we fill up to full day, so new index starts from full day, so index looks like X-2025-04-01:2025-07-03
-
In API we get indexes based on alias and cache them, so we don't execute this query on every request
-
In API we specify which indexes we'll search in
Are you open to this change? If yes then I'd be happy to prepare MR with these changes - might take some time since it's a big change.