You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background
OLake currently supports Change Data Capture (CDC) for PostgreSQL, enabling efficient incremental data synchronization via logical replication (WAL-based). However, many PostgreSQL users—especially those on managed cloud databases like Amazon RDS, Google Cloud SQL, and Azure Database for PostgreSQL—may not have the necessary privileges to enable logical replication or WAL-based CDC.
For these users, OLake should support an alternative incremental sync method that does not require CDC but still efficiently captures newly inserted and updated records.
Feature Scope
This feature will introduce a query-based incremental sync mechanism that allows OLake to track and sync changes without requiring database-level replication privileges. Possible approaches include:
1. Timestamp-based Sync
Users specify a column (e.g., updated_at, last_modified, or similar) that OLake will use to track changes.
On each sync run, OLake will query for records where the timestamp is greater than the last synced timestamp.
Requires users to ensure that an appropriate timestamp column is maintained with updates.
For tables where only new rows are inserted and existing records are never updated, OLake can track changes using an auto-incremented primary key (e.g., id).
On each sync, OLake will fetch rows where the primary key is greater than the last synced value.
This method is not suitable for tables with updates or deletions.
3. Soft Delete Handling (Optional)
Some tables use a deleted_at column instead of hard deletes. OLake can track deleted rows by detecting non-null values in this column and propagating deletions downstream.
Alternatively, users can specify a flag column (e.g., is_deleted = true) to indicate deletions.
Implementation Details
1. User Configuration:
Allow users to specify tracking columns (e.g., updated_at, id, or deleted_at).
Provide default behavior where possible but allow customization.
2. Efficient Query Execution:
Ensure that incremental queries leverage indexes (e.g., B-tree indexes on timestamps, primary keys).
Support pagination (e.g., using LIMIT/OFFSET or cursor-based iteration) to handle large datasets.
Implement batch processing to optimize performance and reduce database load.
3. Checkpointing & State Management:
OLake should maintain a sync state per table, tracking the last processed timestamp/ID.
Ensure robustness by handling failures and retries without duplicate processing.
Fallback Mechanisms:
If the specified tracking column is missing or unreliable, OLake should either:
Recommend a fallback full sync (configurable).
Provide a warning and log errors to help users diagnose issues.
Schema Evolution Handling:
If a new column is introduced after initial sync setup, OLake should detect it and allow users to update their sync strategy.
Handle cases where tracking columns (e.g., updated_at) are removed or modified.
Deliverables
Implementation of query-based incremental sync.
Automated tests for different use cases (timestamp tracking, primary key tracking, soft deletes).
Performance benchmarking to optimize sync efficiency.
User documentation with setup instructions and best practices.
Impact
Broader PostgreSQL Compatibility: Users on managed cloud databases (RDS, Cloud SQL, etc.) can use incremental sync without requiring privileged access to enable CDC.
Performance Improvements: Avoids full table scans and reduces resource consumption.
More Flexible Sync Options: Allows users to choose the best sync strategy based on their database structure.
The text was updated successfully, but these errors were encountered:
rkhameshra
changed the title
Incremental Synch for Postgres
[Feat] Incremental Synch for Postgres
Mar 17, 2025
rkhameshra
changed the title
[Feat] Incremental Synch for Postgres
[Feat] Incremental Sync for Postgres
Mar 17, 2025
Background
OLake currently supports Change Data Capture (CDC) for PostgreSQL, enabling efficient incremental data synchronization via logical replication (WAL-based). However, many PostgreSQL users—especially those on managed cloud databases like Amazon RDS, Google Cloud SQL, and Azure Database for PostgreSQL—may not have the necessary privileges to enable logical replication or WAL-based CDC.
For these users, OLake should support an alternative incremental sync method that does not require CDC but still efficiently captures newly inserted and updated records.
Feature Scope
This feature will introduce a query-based incremental sync mechanism that allows OLake to track and sync changes without requiring database-level replication privileges. Possible approaches include:
1. Timestamp-based Sync
2. Primary Key Auto-Increment Tracking (For Append-Only Tables)
3. Soft Delete Handling (Optional)
Implementation Details
1. User Configuration:
2. Efficient Query Execution:
3. Checkpointing & State Management:
Fallback Mechanisms:
Schema Evolution Handling:
Deliverables
Impact
The text was updated successfully, but these errors were encountered: