search optimization

I conducted performance tests in 3 scenarios on the sentinel-2-l2a collection, which has around 42 million products and 480GB of data.
The performance test consisted of maintaining 200 requests per second, querying the sentinel-2-l2a collection with random date ranges and random geometry intersections, limit 100.

Scenario I - 100GB indexes, 8 shards:
Average response time: 5.57s
Average transfer 400MB/s
200 responses - 18967
500 responses - 396
CPU usage on opensearch server 98-100%

Scenario II - 1 index, 1 shard:
Average response time: 1.14s
Average transfer 767MB/s
200 responses - 43599
500 responses - 10040
CPU usage on opensearch server 98-100%

Scenario III - quarterly indexes, based on product datetime and collection. Example items_sentinel-2-l2a-Q1-2015, we specify in the API which opensearch indexes to search in:
Average response time: 732.72ms
Average transfer 1.2 GB/s
200 responses - 53614
500 responses - 0
CPU usage on opensearch server around 60%
Notes: data distribution was uneven e.g. Q1 2015 had X times less data than Q1 2022.

Conclusions:
The best performance solution is specifying in the API which opensearch/elasticsearch indexes should search - users usually search by collection and datetime.

Solution:
Sentinel-l2a has much larger products than other collections. So to eliminate that some indexes could be 2GB while others 30GB:

1. We index around 30GB per index - +/- because we fill up to full day, so new index starts from full day, so index looks like X-2025-04-01:2025-07-03

2. In API we get indexes based on alias and cache them, so we don't execute this query on every request

3. In API we specify which indexes we'll search in

Are you open to this change? If yes then I'd be happy to prepare MR with these changes - might take some time since it's a big change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

search optimization #401

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

search optimization #401

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions