Skip to content

Commit b80da3c

Browse files
Chore: Update master readme, mongo and postgres readme (#162)
Co-authored-by: hash-data <[email protected]>
1 parent 6f19f98 commit b80da3c

File tree

3 files changed

+165
-281
lines changed

3 files changed

+165
-281
lines changed

README.md

+41-261
Original file line numberDiff line numberDiff line change
@@ -8,23 +8,12 @@
88
<p align="center">Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB. Visit <a href="https://olake.io/" target="_blank">olake.io/docs</a> for the full documentation, and benchmarks</p>
99

1010
<p align="center">
11-
<img alt="GitHub issues" src="https://img.shields.io/github/issues/datazip-inc/olake"> </a>
12-
<a href="https://twitter.com/intent/tweet?text=Use%20the%20fastest%20open-source%20tool,%20OLake,%20for%20replicating%20Databases%20to%20S3%20and%20Apache%20Iceberg%20or%20Data%20Lakehouse.%20It%E2%80%99s%20Efficient,%20quick%20and%20scalable%20data%20ingestion%20for%20real-time%20analytics.%20Check%20at%20https://olake.io/%20%23opensource%20%23olake%20via%20%40_olake">
13-
<img alt="tweet" src="https://img.shields.io/twitter/url/http/shields.io.svg?style=social"></a>
11+
<img alt="GitHub issues" src="https://img.shields.io/github/issues/datazip-inc/olake">
12+
<a alt="Documentation" src="https://olake.io/docs "> <img height="23" src="https://img.shields.io/badge/view-Documentation-blue?style=for-the-badge"></a>
1413
<a href="https://join.slack.com/t/getolake/shared_invite/zt-2utw44do6-g4XuKKeqBghBMy2~LcJ4ag">
1514
<img alt="slack" src="https://img.shields.io/badge/Join%20Our%20Community-Slack-blue">
1615
</a>
1716
</p>
18-
19-
20-
<h3 align="center">
21-
<a href="https://olake.io/docs"><b>Documentation</b></a> &bull;
22-
<a href="https://twitter.com/_olake"><b>Twitter</b></a> &bull;
23-
<a href="https://www.youtube.com/@olakeio"><b>YouTube</b></a> &bull;
24-
<a href="https://meetwaves.com/library/olake"><b>Slack Knowledgebase</b></a> &bull;
25-
<a href="https://olake.io/blog"><b>Blogs</b></a>
26-
</h3>
27-
2817

2918
![undefined](https://github.com/user-attachments/assets/fe37e142-556a-48f0-a649-febc3dbd083c)
3019

@@ -33,221 +22,45 @@ Connector ecosystem for Olake, the key points Olake Connectors focuses on are th
3322
- **Connector Autonomy**
3423
- **Avoid operations that don't contribute to increasing record throughput**
3524

36-
# Getting Started with OLake
37-
38-
Follow the steps below to get started with OLake:
39-
40-
1. ### Prepare Your Folder
41-
42-
1. Create a folder on your computer. Let’s call it `olake_folder_path`.
43-
<div style="background-color: #f9f9f9; border-left: 6px solid #007bff; padding: 10px; color: black;">
44-
45-
💡 **Note:** In below configurations replace `olake_folder_path` with the newly created folder path.
46-
47-
</div>
48-
2. Inside this folder, create two files:
49-
- config.json: This file contains your connection details. You can find examples and instructions [here](https://github.com/datazip-inc/olake/tree/master/drivers/mongodb#config-file).
50-
- writer.json: This file specifies where to save your data (local machine or S3).
51-
52-
#### Example Structure of `writer.json` :
53-
Example (For Local):
54-
```json
55-
{
56-
"type": "PARQUET",
57-
"writer": {
58-
"normalization":false, // to enable/disable level one flattening
59-
"local_path": "/mnt/config/{olake_reader}" // replace olake_reader with desired folder name
60-
}
61-
}
62-
```
63-
Example (For S3):
64-
```json
65-
{
66-
"type": "PARQUET",
67-
"writer": {
68-
"normalization":false, // to enable/disable level one flattening
69-
"s3_bucket": "olake",
70-
"s3_region": "",
71-
"s3_access_key": "",
72-
"s3_secret_key": "",
73-
"s3_path": ""
74-
}
75-
}
76-
```
77-
2. ### Generate a Catalog File
78-
79-
Run the discovery process to identify your MongoDB data:
80-
```bash
81-
docker run -v olake_folder_path:/mnt/config olakego/source-mongodb:latest discover --config /mnt/config/config.json
82-
```
83-
This will create a catalog.json file in your folder. The file lists the data streams from your MongoDB
84-
```json
85-
{
86-
"selected_streams": {
87-
"namespace": [
88-
{
89-
"partition_regex": "/{col_1, default_value, granularity}",
90-
"stream_name": "table1"
91-
},
92-
{
93-
"partition_regex": "",
94-
"stream_name": "table2"
95-
}
96-
]
97-
},
98-
"streams": [
99-
{
100-
"stream": {
101-
"name": "table1",
102-
"namespace": "namespace",
103-
// ...
104-
"sync_mode": "cdc"
105-
}
106-
},
107-
{
108-
"stream": {
109-
"name": "table2",
110-
"namespace": "namespace",
111-
// ...
112-
"sync_mode": "cdc"
113-
}
114-
}
115-
]
116-
}
117-
```
118-
#### (Optional) Partition Destination Folder based on Columns
119-
Partition data based on column value. Read more in the documentation about [S3 partitioning](https://olake.io/docs/writers/s3#s3-data-partitioning).
120-
```json
121-
"partition_regex": "/{col_1, default_value, granularity}",
122-
```
123-
`col_1`: Partitioning Column. Supports `now()` as a value for the current date.<br>
124-
`default_value`: if the column value is null or not parsable then the default will be used.<br>
125-
`granularity` (Optional): Support for time-based columns. Supported Values: `HH`,`DD`,`WW`,`MM`,`YY`.
126-
#### (Optional) Exclude Unwanted Streams
127-
To exclude streams, edit catalog.json and remove them from selected_streams. <br>
128-
#### Example (For Exclusion of table2)
129-
**Before**
130-
```json
131-
"selected_streams": {
132-
"namespace": [
133-
{
134-
"partition_regex": "/{col_1, default_value, granularity}",
135-
"stream_name": "table1"
136-
},
137-
{
138-
"partition_regex": "",
139-
"stream_name": "table2"
140-
}
141-
]
142-
}
143-
```
144-
**After Exclusion of table2**
145-
```json
146-
"selected_streams": {
147-
"namespace": [
148-
{
149-
"partition_regex": "/{col_1, default_value, granularity}",
150-
"stream_name": "table1"
151-
}
152-
]
153-
}
154-
```
155-
3. ### Sync Data
156-
Run the following command to sync data from MongoDB to your destination:
157-
158-
```bash
159-
docker run -v olake_folder_path:/mnt/config olakego/source-mongodb:latest sync --config /mnt/config/config.json --catalog /mnt/config/catalog.json --destination /mnt/config/writer.json
160-
161-
```
162-
163-
4. ### Sync with State:
164-
If you’ve previously synced data and want to continue from where you left off, use the state file:
165-
```bash
166-
docker run -v olake_folder_path:/mnt/config olakego/source-mongodb:latest sync --config /mnt/config/config.json --catalog /mnt/config/catalog.json --destination /mnt/config/writer.json --state /mnt/config/state.json
167-
168-
```
169-
170-
For more details, refer to the [documentation](https://olake.io/docs).
171-
172-
173-
174-
## Benchmark Results: Refer to this doc for complete information
175-
176-
### Speed Comparison: Full Load Performance
177-
178-
For a collection of 230 million rows (664.81GB) from [Twitter data](https://archive.org/details/archiveteam-twitter-stream-2017-11), here's how Olake compares to other tools:
179-
180-
| Tool | Full Load Time | Performance |
181-
| ----------------------- | -------------------------- | -------------- |
182-
| **Olake** | 46 mins | X times faster |
183-
| **Fivetran** | 4 hours 39 mins (279 mins) | 6x slower |
184-
| **Airbyte** | 16 hours (960 mins) | 20x slower |
185-
| **Debezium (Embedded)** | 11.65 hours (699 mins) | 15x slower |
186-
187-
188-
### Incremental Sync Performance
189-
190-
| Tool | Incremental Sync Time | Records per Second (r/s) | Performance |
191-
| ----------------------- | --------------------- | ------------------------ | -------------- |
192-
| **Olake** | 28.3 sec | 35,694 r/s | X times faster |
193-
| **Fivetran** | 3 min 10 sec | 5,260 r/s | 6.7x slower |
194-
| **Airbyte** | 12 min 44 sec | 1,308 r/s | 27.3x slower |
195-
| **Debezium (Embedded)** | 12 min 44 sec | 1,308 r/s | 27.3x slower |
196-
197-
Cost Comparison: (Considering 230 million first full load & 50 million rows incremental rows per month) as dated 30th September 2025: Find more [here](https://olake.io/docs/connectors/mongodb/benchmarks).
198-
199-
200-
201-
### Testing Infrastructure
202-
203-
Virtual Machine: `Standard_D64as_v5`
204-
205-
- CPU: `64` vCPUs
206-
- Memory: `256` GiB RAM
207-
- Storage: `250` GB of shared storage
208-
209-
### MongoDB Setup:
210-
211-
- 3 Nodes running in a replica set configuration:
212-
- 1 Primary Node (Master) that handles all write operations.
213-
- 2 Secondary Nodes (Replicas) that replicate data from the primary node.
214-
215-
Find more [here](https://olake.io/docs/connectors/mongodb/benchmarks).
216-
25+
## Getting Started with OLake
21726

27+
### Source / Connectors
28+
1. [Getting started Postgres -> Writers](https://github.com/datazip-inc/olake/tree/master/drivers/postgres) | [Postgres Docs](https://olake.io/docs/category/postgres)
29+
2. [Getting started MongoDB -> Writers](https://github.com/datazip-inc/olake/tree/master/drivers/mongodb) | [MongoDB Docs](https://olake.io/docs/category/mongodb)
30+
3. [Getting started MySQL -> Writers](https://github.com/datazip-inc/olake/tree/master/drivers/mysql) | [MySQL Docs](https://olake.io/docs/category/mysql)
21831

219-
Detailed roadmap can be found on [GitHub OLake Roadmap 2024-25](https://github.com/orgs/datazip-inc/projects/5)
32+
### Writers / Destination
33+
1. [Apache Iceberg Docs](https://olake.io/docs/category/apache-iceberg)
34+
2. [AWS S3 Docs](https://olake.io/docs/category/aws-s3)
35+
3. [Local FileSystem Docs](https://olake.io/docs/writers/local)
22036

221-
## Source Connector Level Functionalities Supported
22237

223-
| Connector Functionalities | MongoDB [(docs)](https://olake.io/docs/connectors/mongodb/overview) | Postgres [(docs)](https://olake.io/docs/connectors/postgres/overview) | MySQL [(docs)](https://olake.io/docs/connectors/mysql/overview) |
224-
| ------------------------- | ------- | -------- | ------------------------------------------------------------ |
225-
| Full Refresh Sync Mode | ✅ | ✅ | ✅ |
226-
| Incremental Sync Mode | ❌ | ❌ | ❌ |
227-
| CDC Sync Mode | ✅ | ✅ | ✅ |
228-
| Full Parallel Processing | ✅ | ✅ | ✅ |
229-
| CDC Parallel Processing | ✅ | ❌ | ❌ |
230-
| Resumable Full Load | ✅ | ✅ | ✅ |
231-
| CDC Heart Beat | ❌ | ❌ | ❌ |
38+
## Source/Connector Functionalities
39+
| Functionality | MongoDB | Postgres | MySQL |
40+
| ------------------------- | ------- | -------- | ----- |
41+
| Full Refresh Sync Mode ||||
42+
| Incremental Sync Mode ||||
43+
| CDC Sync Mode ||||
44+
| Full Parallel Processing ||||
45+
| CDC Parallel Processing ||||
46+
| Resumable Full Load ||||
47+
| CDC Heart Beat ||||
23248

23349
We have additionally planned the following sources - [AWS S3](https://github.com/datazip-inc/olake/issues/86) | [Kafka](https://github.com/datazip-inc/olake/issues/87)
23450

23551

236-
## Writer Level Functionalities Supported
52+
## Writer Functionalities
53+
| Functionality | Local Filesystem | AWS S3 | Apache Iceberg |
54+
| ------------------------------- | ---------------- | ------ | -------------- |
55+
| Flattening & Normalization (L1) ||| |
56+
| Partitioning ||| |
57+
| Schema Changes ||| |
58+
| Schema Evolution ||| |
23759

238-
| Features/Functionality | Local Filesystem [(docs)](https://olake.io/docs/writers/local) | AWS S3 [(docs)](https://olake.io/docs/writers/s3/overview) | Iceberg (WIP) |
239-
| ------------------------------- | ---------------------- | --- | ------------- |
240-
| Flattening & Normalization (L1) | ✅ | ✅ | |
241-
| Partitioning | ✅ | ✅ | |
242-
| Schema Changes | ✅ | ✅ | |
243-
| Schema Evolution | ✅ | ✅ | |
244-
245-
246-
## Catalogue Support
247-
248-
| Catalogues | Support |
60+
## Supported Catalogs For Iceberg Writer
61+
| Catalog | Status |
24962
| -------------------------- | -------------------------------------------------------------------------------------------------------- |
250-
| Glue Catalog | [WIP](https://github.com/datazip-inc/olake/pull/113) |
63+
| Glue Catalog | WIP |
25164
| Hive Meta Store | Upcoming |
25265
| JDBC Catalogue | Upcoming |
25366
| REST Catalogue - Nessie | Upcoming |
@@ -257,13 +70,7 @@ We have additionally planned the following sources - [AWS S3](https://github.co
25770
| Azure Purview | Not Planned, [submit a request](https://github.com/datazip-inc/olake/issues/new?template=new-feature.md) |
25871
| BigLake Metastore | Not Planned, [submit a request](https://github.com/datazip-inc/olake/issues/new?template=new-feature.md) |
25972

260-
261-
262-
See [Roadmap](https://github.com/orgs/datazip-inc/projects/5) for more details.
263-
264-
265-
### Core
266-
73+
## Core
26774
Core or framework is the component/logic that has been abstracted out from Connectors to follow DRY. This includes base CLI commands, State logic, Validation logic, Type detection for unstructured data, handling Config, State, Catalog, and Writer config file, logging etc.
26875

26976
Core includes http server that directly exposes live stats about running sync such as:
@@ -279,42 +86,15 @@ Core handles the commands to interact with a driver via these:
27986

28087
Find more about how OLake works [here.](https://olake.io/docs/category/understanding-olake)
28188

282-
### SDKs
283-
284-
SDKs are libraries/packages that can orchestrate the connector in two environments i.e. Docker and Kubernetes. These SDKs can be directly consumed by users similar to PyAirbyte, DLT-hub.
285-
286-
(Unconfirmed) SDKs can interact with Connectors via potential GRPC server to override certain default behavior of the system by adding custom functions to enable features like Transformation, Custom Table Name via writer, or adding hooks.
287-
288-
### Olake
289-
290-
Olake will be built on top of SDK providing persistent storage and a user interface that enables orchestration directly from your machine with default writer mode as `S3 Iceberg Parquet`
291-
292-
89+
## Roadmap
90+
Checkout [GitHub Project Roadmap](https://github.com/orgs/datazip-inc/projects/5) and [Upcoming OLake Roadmap](https://olake.io/docs/roadmap) to track and influence the way we build it.
91+
If you have any ideas, questions, or any feedback, please share on our [Github Discussions](https://github.com/datazip-inc/olake/discussions) or raise an issue.
29392

29493
## Contributing
94+
We ❤️ contributions big or small check our [Bounty Program](https://olake.io/docs/community/issues-and-prs#goodies). As always, thanks to our amazing [contributors!](https://github.com/datazip-inc/olake/graphs/contributors).
95+
- To contribute to Olake Check [CONTRIBUTING.md](CONTRIBUTING.md)
96+
- To contribute to UI, visit [OLake UI Repository](https://github.com/datazip-inc/olake-frontend/).
97+
- To contribute to OLake website and documentation (olake.io), visit [Olake Docs Repository][GITHUB_DOCS].
29598

296-
We ❤️ contributions big or small. Please read [CONTRIBUTING.md](CONTRIBUTING.md) to get started with making contributions to OLake.
297-
298-
- To contribute to Frontend, go to [OLake Frontend GitHub repo](https://github.com/datazip-inc/olake-frontend/).
299-
300-
- To contribute to OLake website and documentation (olake.io), go to [OLake Frontend GitHub repo](https://github.com/datazip-inc/olake-docs).
301-
302-
Not sure how to get started? Just ping us on `#contributing-to-olake` in our [slack community](https://olake.io/slack)
303-
304-
## [Documentation](olake.io/docs)
305-
306-
307-
If you need any clarification or find something missing, feel free to raise a GitHub issue with the label `documentation` at [olake-docs](https://github.com/datazip-inc/olake-docs/) repo or reach out to us at the community slack channel.
308-
309-
310-
311-
312-
## Community
313-
314-
Join the [slack community](https://olake.io/slack) to know more about OLake, future roadmaps and community meetups, about Data Lakes and Lakehouses, the Data Engineering Ecosystem and to connect with other users and contributors.
315-
316-
Checkout [OLake Roadmap](https://olake.io/docs/roadmap) to track and influence the way we build it, your expert opinion is always welcomed for us to build a best class open source offering in Data space.
317-
318-
If you have any ideas, questions, or any feedback, please share on our [Github Discussions](https://github.com/datazip-inc/olake/discussions) or raise an issue.
319-
320-
As always, thanks to our amazing [contributors!](https://github.com/datazip-inc/olake/graphs/contributors)
99+
<!----variables---->
100+
[GITHUB_DOCS]: https://github.com/datazip-inc/olake-docs/

0 commit comments

Comments
 (0)