[Feat] Incremental Sync for Postgres #160

rkhameshra · 2025-03-17T17:46:08Z

Background
OLake currently supports Change Data Capture (CDC) for PostgreSQL, enabling efficient incremental data synchronization via logical replication (WAL-based). However, many PostgreSQL users—especially those on managed cloud databases like Amazon RDS, Google Cloud SQL, and Azure Database for PostgreSQL—may not have the necessary privileges to enable logical replication or WAL-based CDC.

For these users, OLake should support an alternative incremental sync method that does not require CDC but still efficiently captures newly inserted and updated records.

Feature Scope
This feature will introduce a query-based incremental sync mechanism that allows OLake to track and sync changes without requiring database-level replication privileges. Possible approaches include:

1. Timestamp-based Sync

Users specify a column (e.g., updated_at, last_modified, or similar) that OLake will use to track changes.
On each sync run, OLake will query for records where the timestamp is greater than the last synced timestamp.
Requires users to ensure that an appropriate timestamp column is maintained with updates.

2. Primary Key Auto-Increment Tracking (For Append-Only Tables)

For tables where only new rows are inserted and existing records are never updated, OLake can track changes using an auto-incremented primary key (e.g., id).
On each sync, OLake will fetch rows where the primary key is greater than the last synced value.
This method is not suitable for tables with updates or deletions.

3. Soft Delete Handling (Optional)

Some tables use a deleted_at column instead of hard deletes. OLake can track deleted rows by detecting non-null values in this column and propagating deletions downstream.
Alternatively, users can specify a flag column (e.g., is_deleted = true) to indicate deletions.

Implementation Details

1. User Configuration:

Allow users to specify tracking columns (e.g., updated_at, id, or deleted_at).
Provide default behavior where possible but allow customization.

2. Efficient Query Execution:

Ensure that incremental queries leverage indexes (e.g., B-tree indexes on timestamps, primary keys).
Support pagination (e.g., using LIMIT/OFFSET or cursor-based iteration) to handle large datasets.
Implement batch processing to optimize performance and reduce database load.

3. Checkpointing & State Management:

OLake should maintain a sync state per table, tracking the last processed timestamp/ID.
Ensure robustness by handling failures and retries without duplicate processing.

Fallback Mechanisms:

If the specified tracking column is missing or unreliable, OLake should either:
Recommend a fallback full sync (configurable).
Provide a warning and log errors to help users diagnose issues.

Schema Evolution Handling:

If a new column is introduced after initial sync setup, OLake should detect it and allow users to update their sync strategy.
Handle cases where tracking columns (e.g., updated_at) are removed or modified.

Deliverables

Implementation of query-based incremental sync.
Automated tests for different use cases (timestamp tracking, primary key tracking, soft deletes).
Performance benchmarking to optimize sync efficiency.
User documentation with setup instructions and best practices.

Impact

Broader PostgreSQL Compatibility: Users on managed cloud databases (RDS, Cloud SQL, etc.) can use incremental sync without requiring privileged access to enable CDC.
Performance Improvements: Avoids full table scans and reduces resource consumption.
More Flexible Sync Options: Allows users to choose the best sync strategy based on their database structure.

rkhameshra changed the title ~~Incremental Synch for Postgres~~ [Feat] Incremental Synch for Postgres Mar 17, 2025

rkhameshra changed the title ~~[Feat] Incremental Synch for Postgres~~ [Feat] Incremental Sync for Postgres Mar 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Incremental Sync for Postgres #160

[Feat] Incremental Sync for Postgres #160

rkhameshra commented Mar 17, 2025

[Feat] Incremental Sync for Postgres #160

[Feat] Incremental Sync for Postgres #160

Comments

rkhameshra commented Mar 17, 2025