Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Incremental Sync for Postgres #160

Open
rkhameshra opened this issue Mar 17, 2025 · 0 comments
Open

[Feat] Incremental Sync for Postgres #160

rkhameshra opened this issue Mar 17, 2025 · 0 comments

Comments

@rkhameshra
Copy link
Contributor

Background
OLake currently supports Change Data Capture (CDC) for PostgreSQL, enabling efficient incremental data synchronization via logical replication (WAL-based). However, many PostgreSQL users—especially those on managed cloud databases like Amazon RDS, Google Cloud SQL, and Azure Database for PostgreSQL—may not have the necessary privileges to enable logical replication or WAL-based CDC.

For these users, OLake should support an alternative incremental sync method that does not require CDC but still efficiently captures newly inserted and updated records.

Feature Scope
This feature will introduce a query-based incremental sync mechanism that allows OLake to track and sync changes without requiring database-level replication privileges. Possible approaches include:

1. Timestamp-based Sync

  • Users specify a column (e.g., updated_at, last_modified, or similar) that OLake will use to track changes.
  • On each sync run, OLake will query for records where the timestamp is greater than the last synced timestamp.
  • Requires users to ensure that an appropriate timestamp column is maintained with updates.

2. Primary Key Auto-Increment Tracking (For Append-Only Tables)

  • For tables where only new rows are inserted and existing records are never updated, OLake can track changes using an auto-incremented primary key (e.g., id).
  • On each sync, OLake will fetch rows where the primary key is greater than the last synced value.
  • This method is not suitable for tables with updates or deletions.

3. Soft Delete Handling (Optional)

  • Some tables use a deleted_at column instead of hard deletes. OLake can track deleted rows by detecting non-null values in this column and propagating deletions downstream.
  • Alternatively, users can specify a flag column (e.g., is_deleted = true) to indicate deletions.

Implementation Details

1. User Configuration:

  • Allow users to specify tracking columns (e.g., updated_at, id, or deleted_at).
  • Provide default behavior where possible but allow customization.

2. Efficient Query Execution:

  • Ensure that incremental queries leverage indexes (e.g., B-tree indexes on timestamps, primary keys).
  • Support pagination (e.g., using LIMIT/OFFSET or cursor-based iteration) to handle large datasets.
  • Implement batch processing to optimize performance and reduce database load.

3. Checkpointing & State Management:

  • OLake should maintain a sync state per table, tracking the last processed timestamp/ID.
  • Ensure robustness by handling failures and retries without duplicate processing.

Fallback Mechanisms:

  • If the specified tracking column is missing or unreliable, OLake should either:
  • Recommend a fallback full sync (configurable).
  • Provide a warning and log errors to help users diagnose issues.

Schema Evolution Handling:

  • If a new column is introduced after initial sync setup, OLake should detect it and allow users to update their sync strategy.
  • Handle cases where tracking columns (e.g., updated_at) are removed or modified.

Deliverables

  • Implementation of query-based incremental sync.
  • Automated tests for different use cases (timestamp tracking, primary key tracking, soft deletes).
  • Performance benchmarking to optimize sync efficiency.
  • User documentation with setup instructions and best practices.

Impact

  • Broader PostgreSQL Compatibility: Users on managed cloud databases (RDS, Cloud SQL, etc.) can use incremental sync without requiring privileged access to enable CDC.
  • Performance Improvements: Avoids full table scans and reduces resource consumption.
  • More Flexible Sync Options: Allows users to choose the best sync strategy based on their database structure.
@rkhameshra rkhameshra changed the title Incremental Synch for Postgres [Feat] Incremental Synch for Postgres Mar 17, 2025
@rkhameshra rkhameshra changed the title [Feat] Incremental Synch for Postgres [Feat] Incremental Sync for Postgres Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant