Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain batching, document best practices #255

Open
hariso opened this issue Mar 18, 2025 · 1 comment
Open

Explain batching, document best practices #255

hariso opened this issue Mar 18, 2025 · 1 comment
Assignees

Comments

@hariso
Copy link
Contributor

hariso commented Mar 18, 2025

@nickchomey asked us to write more docs explaining batching, and document best practices, especially with the new architecture.

@lovromazgon lovromazgon moved this from Triage to Todo in Conduit Main Mar 18, 2025
@lovromazgon lovromazgon self-assigned this Mar 18, 2025
@nickchomey
Copy link
Contributor

nickchomey commented Mar 19, 2025

Lovro wrote this in discord.

It's pretty simple actually - the source connector is in charge of collecting the batch and sends it to Conduit. The batch is then treated as one "unit", all messages move through the pipeline together as one (an array of messages). In the end, the whole batch is sent to the destination connector, so the destination doesn't need to do any batching on its end, since it will already receive the batch.

In contrast, the old engine (well, still the current one) pushed messages through the pipeline one by one, that's why batching on the source side doesn't make sense in that engine. The batch would just be broken up into singular messages anyway. The batching on the destination was added so that the destination could collect a batch and write them all in one go.

So, to sum up, in the old arch we collect batches on the destination, in the new arch we collect them on the source.
Keep in mind that this only improves the Conduit internals, the connectors are likely still going to be the bottlenecks, depending on how efficiently they read and write the data.
What I see from the graph is that we spend quite some time on encoding and decoding the data using the schema. If you are using built-in connectors for both the source and destination, you could take the shortcut and simply disable schemas altogether (sdk.schema.extract.key.enabled: false and sdk.schema.extract.payload.enabled: false on both connectors). This shortcut can be taken, because connectors don't use gRPC to communicate with Conduit, so they should be able to just return the data as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

3 participants