Search optimization and indexing based on datetime #405

GrzegorzPustulka · 2025-06-18T13:05:33Z

Related Issue(s):

search optimization #401

Index Management System with Time-based Partitioning

Description

This PR introduces a new index management system that enables automatic index partitioning based on dates and index size control with automatic splitting.

How it works

System Architecture

The system consists of several main components:

1. Search Engine Adapters

SearchEngineAdapter - base class
ElasticsearchAdapter and OpenSearchAdapter - implementations for specific engines

2. Index Selection Strategies

AsyncDatetimeBasedIndexSelector / SyncDatetimeBasedIndexSelector - date-based index filtering
UnfilteredIndexSelector - returns all indexes (fallback)
Cache with TTL (default 1 hour) for performance

3. Data Insertion Strategies

Simple strategy: one index per collection (behavior as before)
Datetime strategy: indexes partitioned by dates with automatic partitioning

Datetime Strategy - Operation Details

Index Format:

items_collection-name_2025-01-01-2025-03-31

Item Insertion Process:

System checks item date (properties.datetime)
Looks for existing index that covers this date
If not found - creates new index from this date
Checks target index size
If exceeds limit (DATETIME_INDEX_MAX_SIZE_GB) - splits index

Early Date Handling:
If item has date earlier than oldest index:

Creates new index from this earlier date
Updates oldest index alias to end one day before new date

Index Splitting:
When index exceeds size limit:

Updates current index alias to end on last item's date
Creates new index from next day
New items go to new index

Cache and Performance

IndexCacheManager:

Stores mapping of collection aliases to index lists
TTL default 1 hour
Automatic refresh on expiration
Manual refresh after index modifications

AsyncIndexAliasLoader / SyncIndexAliasLoader:

Load alias information from search engine
Use cache manager to store results
Async and sync versions for different usage contexts

Configuration

New Environment Variables:

# Enable datetime strategy (default false)
ENABLE_DATETIME_INDEX_FILTERING=true

# Maximum index size in GB before splitting (default 25)
DATETIME_INDEX_MAX_SIZE_GB=50

Usage Examples

Scenario 1: Adding items to new collection

First item with date 2025-01-15 → creates index items_collection_2025-01-15
Subsequent items with similar dates → go to same index

Scenario 2: Size limit exceeded

Index items_collection_2025-01-01 reaches 25GB
New item with date 2025-03-15 → system splits index:
- Old: items_collection_2025-01-01-2025-03-15
- New: items_collection_2025-03-16

Scenario 3: Item with early date

Existing index: items_collection_2025-02-01
New item with date 2024-12-15 → creates:
- New: items_collection_2024-12-15-2025-01-31

Search

System automatically filters indexes during search:

Query with date range:

{
  "datetime": {
    "gte": "2025-02-01",
    "lte": "2025-02-28"
  }
}

Searches only indexes containing items from this period, instead of all collection indexes.

Factories

IndexSelectorFactory:

Creates appropriate selector based on configuration
create_async_selector() / create_sync_selector()

IndexInsertionFactory:

Creates insertion strategy based on configuration
Automatically detects engine type and creates appropriate adapter

SearchEngineAdapterFactory:

Detects whether you're using Elasticsearch or OpenSearch
Creates appropriate adapter with engine-specific methods

Backward Compatibility

When ENABLE_DATETIME_INDEX_FILTERING=false → works as before
Existing indexes remain unchanged

All operations have sync and async versions for different usage contexts in the application.

PR Checklist:

Code is formatted and linted (run pre-commit run --all-files)
Tests pass (run make test)
Documentation has been updated to reflect changes, if applicable
Changes are added to the changelog

GrzegorzPustulka · 2025-07-08T10:41:01Z

@jonhealy1
@StijnCaerts
@jamesfisher-geo

The MR is already finished and ready for code review.

jonhealy1 · 2025-07-20T11:33:06Z

@GrzegorzPustulka There's a couple of conflicts now. They don't look too bad. I have been travelling but am going to try to review this in the next few days,

jonhealy1 · 2025-07-20T11:35:53Z

@jamesfisher-geo @StijnCaerts @rhysrevans3 Hi. Added you guys as reviewers if you have time to have a look :)

rhysrevans3

Looks okay to me but I have a couple of questions.

rhysrevans3 · 2025-07-21T07:16:45Z

stac_fastapi/core/stac_fastapi/core/core.py

+            logger.error(f"Invalid interval format: {datetime}, error: {e}")
+            datetime_search = None


Should this error be returned to the user rather than continuing the search without a datetime filter?

rhysrevans3 · 2025-07-21T07:24:59Z

stac_fastapi/core/stac_fastapi/core/core.py

+        except (ValueError, TypeError) as e:
+            # Handle invalid interval formats if return_date fails
+            logger.error(
+                f"Invalid interval format: {search_request.datetime}, error: {e}"
            )
+            datetime_search = None


rhysrevans3 · 2025-07-21T13:10:51Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/adapters.py

+    def create_index_name(collection_id: str, start_date: str) -> str:
+        """Create index name from collection ID and start date.
+
+        Args:
+            collection_id (str): Collection identifier.
+            start_date (str): Start date for the index.
+
+        Returns:
+            str: Formatted index name.
+        """
+        cleaned = collection_id.translate(_ES_INDEX_NAME_UNSUPPORTED_CHARS_TABLE)
+        return f"{ITEMS_INDEX_PREFIX}{cleaned.lower()}_{start_date}"


Is this the equivalent of index_by_collection_id for the simple method? If it is should it not also include the hex of the collection_id and -000001?

What's the benefit of having the start datetime in the index name could you just have it in the alias with the end datetime? You could just use a count to prevent index name clashes.

You would then only need to create a new index when you exceed the max size and not for earlier items. If the item's start datetime is earlier or the end datetime is later than the current alias then update the alias.

rhysrevans3 · 2025-07-21T13:21:09Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/selection/cache_manager.py

+    def __init__(self, cache_ttl_seconds: int = 3600):
+        """Initialize the cache manager.
+
+        Args:
+            cache_ttl_seconds (int): Time-to-live for cache entries in seconds.
+        """
+        self._cache: Optional[Dict[str, List[str]]] = None
+        self._timestamp: float = 0
+        self._ttl = cache_ttl_seconds


Would it be better to just update the cache as aliases are set/updated rather than polling ES every hour?

jamesfisher-geo

Overall looks great. I've got some comments around error handling and some cache handling as well.

This PR will add a lot of future maintenance burden in it's current form. How about we implement only in async and not include the sync code. That would cut down on repetitive code in this PR.

@jonhealy1 @GrzegorzPustulka what are your thoughts on this?

jamesfisher-geo · 2025-07-21T13:48:31Z

stac_fastapi/core/stac_fastapi/core/core.py

@@ -342,6 +348,7 @@ async def item_collection(
            sort=None,
            token=token,
            collection_ids=[collection_id],
+            datetime_search=datetime_search,


Is this needed? We apply the datetime_search to the search variable on line 331. If this is optional, could we omit it?

jamesfisher-geo · 2025-07-21T13:49:17Z

stac_fastapi/core/stac_fastapi/core/core.py

@@ -560,6 +574,7 @@ async def post_search(
            token=search_request.token,
            sort=sort,
            collection_ids=search_request.collections,
+            datetime_search=datetime_search,


Same here -- Is this needed? We apply the datetime_search to the search variable on line 513. If this is optional, could we omit it?

jamesfisher-geo · 2025-07-21T14:33:51Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/adapters.py

+class ElasticsearchAdapter(SearchEngineAdapter):
+    """Elasticsearch-specific adapter implementation."""
+
+    async def create_simple_index(self, client: Any, collection_id: str) -> str:


The index mappings and setting are missing from ElasticsearchAdapter().create_simple_index(). Could you include the mappings here like is done in OpenSearchAdapter()._create_index_body()

The patterns for creating an index should be the same between ElasticsearchAdapter() and OpenSearchAdapter() IMO. How about creating a _create_index_body() method in ElasticsearchAdapter()?

jamesfisher-geo · 2025-07-21T15:00:15Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/adapters.py

+        Returns:
+            SearchEngineType: Detected engine type.
+        """
+        return (


How about using isInstance() here rather than matching the string?

return ( OpenSearchAdapter() if isInstance(client, (OpenSearch, AsyncOpenSearch)) else ElasticsearchAdapter() )

jamesfisher-geo · 2025-07-21T15:08:37Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/adapters.py

+    """Factory for creating search engine adapters."""
+
+    @staticmethod
+    def create_adapter(engine_type: SearchEngineType) -> SearchEngineAdapter:


Is this function necessary? See comment below

jamesfisher-geo · 2025-07-21T16:27:16Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/managers.py

+            )
+        return product_datetime
+
+    async def handle_new_collection(


logging statements in handle_new_collection() and handle_new_collection_sync() would be useful

I definitely think we need to do a better job at logging on this project.

jamesfisher-geo · 2025-07-21T16:38:42Z

..._fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/selection/async_selectors.py

+
+    _instance = None
+
+    def __new__(cls, client):


I'm a bit confused with this implementation. Maybe I am missing something. Could this be replaced with the normal method of instance creation using __init__()

def __init__(self, client: Any): self.cache_manager = IndexCacheManager() self.alias_loader = AsyncIndexAliasLoader(client, self.cache_manager)

jamesfisher-geo · 2025-07-21T17:45:18Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/selection/cache_manager.py

+class IndexCacheManager:
+    """Manages caching of index aliases with expiration."""
+
+    def __init__(self, cache_ttl_seconds: int = 3600):


I believe some concurrency management is needed here because multiple threads may be attempting to access the cache resource at the same time. From what I have found threading.Lock() should work.

https://docs.python.org/3/library/threading.html#lock-objects

The following (untested) should place a lock on the cache when accessing it and release it when finished

import threading class IndexCacheManager: def __init__(self, cache_ttl_seconds: int = 3600): self._cache: Optional[Dict[str, List[str]]] = None self._timestamp: float = 0 self._ttl = cache_ttl_seconds self._lock = threading.Lock() def get_cache(self) -> Optional[Dict[str, List[str]]]: """Get the current cache if not expired. Returns: Optional[Dict[str, List[str]]]: Cache data if valid, None if expired. """ with self._lock: if self.is_expired: return None return self._cache

jamesfisher-geo · 2025-07-21T18:09:32Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/selection/cache_manager.py

+        """
+        if self.is_expired:
+            return None
+        return self._cache


Returning the _cache object here could be problematic because it is a pointer to the actual cache. How about returning a copy?
return {k: v.copy() for k, v in self._cache.items()}

jamesfisher-geo · 2025-07-21T18:24:21Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/selection/factory.py

+        return (
+            SyncDatetimeBasedIndexSelector(sync_client)
+            if use_datetime_filtering
+            else UnfilteredIndexSelector()


But the UnfilteredIndexSelector() is async

GrzegorzPustulka · 2025-07-22T08:17:37Z

I'll improve all the comments in the coming days, remove the sync versions, and fix the bugs my friend found testing this MR

jonhealy1 · 2025-07-22T22:20:19Z

stac_fastapi/tests/resources/test_item.py

@@ -998,6 +1005,9 @@ async def _search_and_get_ids(
 async def test_search_datetime_with_null_datetime(
    app_client, txn_client, load_test_data
 ):
+    if not os.getenv("ENABLE_DATETIME_INDEX_FILTERING"):
+        pytest.skip()
+


Is this right? This test should definitely run in default mode.

jonhealy1 · 2025-07-22T22:35:39Z

@GrzegorzPustulka Can we set ENABLE_DATETIME_INDEX_FILTERING for the associated tests and then turn it off for the default tests?

jonhealy1 · 2025-07-22T22:38:58Z

README.md

+| `DATABASE_REFRESH`              | Controls whether database operations refresh the index immediately after changes. If set to `true`, changes will be immediately searchable. If set to `false`, changes may not be immediately visible but can improve performance for bulk operations. If set to `wait_for`, changes will wait for the next refresh cycle to become visible. | `false`                                              | Optional |
+| `ENABLE_TRANSACTIONS_EXTENSIONS` | Enables or disables the Transactions and Bulk Transactions API extensions. If set to `false`, the POST `/collections` route and related transaction endpoints (including bulk transaction operations) will be unavailable in the API. This is useful for deployments where mutating the catalog via the API should be prevented.             | `true`                                               | Optional |
+| `ENABLE_DATETIME_INDEX_FILTERING` | Enable datetime-based index selection using collection IDs. Requires indexes in format: STAC_ITEMS_INDEX_PREFIX_collection-id_start_year-start_month-start_day-end_year-end_month-end_day, e.g. items_sentinel-2-l2a_2025-06-06-2025-09-22.                                                                                                  | `false`                                              | Optional |
+| `DATETIME_INDEX_MAX_SIZE_GB` | Maximum size limit in GB for datetime-based indexes. When an index exceeds this size, a new time-partitioned index will be created. Note: This value should account for ~25% overhead due to OS/ES caching of data structures and metadata. Only applies when`ENABLE_DATETIME_INDEX_FILTERING` is enabled.                                                                               | `25`                                                 | Optional |


These are important additions and maybe should have their own section in the readme for a better explanation.

GrzegorzPustulka marked this pull request as ready for review July 7, 2025 20:01

GrzegorzPustulka force-pushed the search_optimization branch from 38c85e4 to 11e40c4 Compare July 7, 2025 20:19

Grzegorz Pustulka added 3 commits July 7, 2025 23:28

Add index management system with datetime partitioning

14ec996

fixed compose.yml

9f08d79

resolved conflicts

243dd1c

GrzegorzPustulka force-pushed the search_optimization branch from 295d3d6 to 243dd1c Compare July 7, 2025 23:12

Grzegorz Pustulka added 3 commits July 8, 2025 01:18

pre-commit

32f1f12

fixed tests

84aa25a

updated cicd.yml

c25b7ac

jonhealy1 requested review from jonhealy1, jamesfisher-geo, rhysrevans3 and StijnCaerts and removed request for jamesfisher-geo July 20, 2025 11:33

rhysrevans3 reviewed Jul 21, 2025

View reviewed changes

jamesfisher-geo requested changes Jul 21, 2025

View reviewed changes

jonhealy1 reviewed Jul 22, 2025

View reviewed changes

		logger.error(f"Invalid interval format: {datetime}, error: {e}")
		datetime_search = None

Search optimization and indexing based on datetime #405

Are you sure you want to change the base?

Search optimization and indexing based on datetime #405

Conversation

GrzegorzPustulka commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Index Management System with Time-based Partitioning

Description

How it works

System Architecture

Datetime Strategy - Operation Details

Cache and Performance

Configuration

Usage Examples

Scenario 1: Adding items to new collection

Scenario 2: Size limit exceeded

Scenario 3: Item with early date

Search

Factories

Backward Compatibility

Uh oh!

GrzegorzPustulka commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonhealy1 commented Jul 20, 2025

Uh oh!

jonhealy1 commented Jul 20, 2025

Uh oh!

rhysrevans3 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesfisher-geo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesfisher-geo Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GrzegorzPustulka commented Jul 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonhealy1 commented Jul 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

GrzegorzPustulka commented Jun 18, 2025 •

edited

Loading

GrzegorzPustulka commented Jul 8, 2025 •

edited

Loading

jamesfisher-geo left a comment •

edited

Loading

jamesfisher-geo Jul 21, 2025 •

edited

Loading