Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Incremental Sync for MongoDB #161

Open
rkhameshra opened this issue Mar 17, 2025 · 1 comment
Open

[Feat] Incremental Sync for MongoDB #161

rkhameshra opened this issue Mar 17, 2025 · 1 comment

Comments

@rkhameshra
Copy link
Contributor

Background
OLake currently supports Change Data Capture (CDC) for MongoDB using Change Streams, which efficiently tracks inserts, updates, and deletes in real time. However, many users—especially those on MongoDB shared clusters (e.g., Atlas M0, M2, M5 tiers) or self-hosted deployments without replica sets—may not have access to Change Streams, as it requires replica set or sharded cluster configurations.

For these users, OLake should support an alternative incremental sync method that does not rely on Change Streams but still efficiently captures newly inserted and updated documents.

Feature Scope
This feature will introduce a query-based incremental sync mechanism for MongoDB that does not require Change Streams. Possible approaches include:

1. Timestamp-based Sync (updatedAt field tracking)

  • Users specify a timestamp field (e.g., updatedAt, lastModified, or a custom field) to track changes.
  • OLake will store the last synced timestamp and query documents where updatedAt > last_synced_timestamp.
  • This method requires users to ensure that the updatedAt field is updated on every modification (e.g., using MongoDB triggers or application logic).

2. ObjectId-based Sync (For Insert-Only Collections)

  • MongoDB’s ObjectId contains a timestamp, allowing OLake to track inserts by filtering based on the _id field.
  • OLake will query documents where _id > last_synced_id.
  • This method only works for append-only collections, as it does not capture updates or deletes.

3. Soft Delete Handling (Optional)

  • Some collections use a deleted flag or deletedAt timestamp instead of hard deletes. OLake can track deletions using:
  • A boolean field (deleted: true)
  • A timestamp (deletedAt with non-null values)

Implementation Details
1. User Configuration:

  • Allow users to specify a tracking field (updatedAt, _id, or deletedAt).
  • Provide default behavior (e.g., use updatedAt if available).

2. Efficient Query Execution:

  • Ensure that incremental queries leverage indexes (e.g., index on updatedAt or _id).
  • Implement batch processing and pagination to optimize large dataset syncs.
  • Use projection to reduce unnecessary data transfer.

3. Checkpointing & State Management:

  • OLake should maintain a sync state per collection, tracking the last processed timestamp or _id.
  • Ensure robustness by handling failures and retries without duplicate processing.

Fallback Mechanisms:
If the specified tracking field is missing or unreliable, OLake should:

  • Recommend a fallback full sync (configurable).
  • Provide a warning and log errors to help users diagnose issues.

Schema Evolution Handling:

  • If a new field is introduced after initial sync setup, OLake should detect it and allow users to update their sync strategy.
  • Handle cases where tracking fields (updatedAt, _id) are removed or modified.

Deliverables

  • Implementation of query-based incremental sync for MongoDB.
  • Automated tests for different sync strategies (timestamp tracking, _id tracking, soft deletes).
  • Performance benchmarking to optimize query execution.
  • User documentation with setup instructions and best practices.

Impact

  • Expands OLake’s MongoDB Support: Users on shared MongoDB instances (e.g., Atlas M0, M2, M5) can use incremental sync without needing replica sets.
  • Performance Improvements: Avoids full collection scans and reduces database load.
  • More Flexible Sync Options: Allows users to configure sync methods based on their collection structure.
@rkhameshra rkhameshra changed the title [Feat] Incremental Sync for MongoDB [Feat] Incremental Synch for MongoDB Mar 17, 2025
@rkhameshra rkhameshra changed the title [Feat] Incremental Synch for MongoDB [Feat] Incremental Sync for MongoDB Mar 17, 2025
@mrmagicpotato007
Copy link
Contributor

@rkhameshra i would like to take this , ill give a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants