You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's pretty simple actually - the source connector is in charge of collecting the batch and sends it to Conduit. The batch is then treated as one "unit", all messages move through the pipeline together as one (an array of messages). In the end, the whole batch is sent to the destination connector, so the destination doesn't need to do any batching on its end, since it will already receive the batch.
In contrast, the old engine (well, still the current one) pushed messages through the pipeline one by one, that's why batching on the source side doesn't make sense in that engine. The batch would just be broken up into singular messages anyway. The batching on the destination was added so that the destination could collect a batch and write them all in one go.
So, to sum up, in the old arch we collect batches on the destination, in the new arch we collect them on the source.
Keep in mind that this only improves the Conduit internals, the connectors are likely still going to be the bottlenecks, depending on how efficiently they read and write the data.
What I see from the graph is that we spend quite some time on encoding and decoding the data using the schema. If you are using built-in connectors for both the source and destination, you could take the shortcut and simply disable schemas altogether (sdk.schema.extract.key.enabled: false and sdk.schema.extract.payload.enabled: false on both connectors). This shortcut can be taken, because connectors don't use gRPC to communicate with Conduit, so they should be able to just return the data as is.
@nickchomey asked us to write more docs explaining batching, and document best practices, especially with the new architecture.
The text was updated successfully, but these errors were encountered: