Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable writes to Azure #114

Closed
wants to merge 20 commits into from
Closed

Enable writes to Azure #114

wants to merge 20 commits into from

Conversation

thovoll
Copy link
Contributor

@thovoll thovoll commented May 12, 2022

@xianwill and I paired on parts of this, thanks for your help Christian!

This PR upgrades kafka-delta-ingest to a newer version of delta-rs that includes Azure support.
Various libraries were upgraded as needed.
Dockerfile added and README updated to allow execution on Windows.
Deprecated temporarily allowed to avoid having to rewrite the clap code in the main module.

@thovoll thovoll changed the title Enable writes to Azure Storage Enable writes to Azure May 12, 2022
@thovoll thovoll requested a review from xianwill May 12, 2022 04:54
@xianwill
Copy link
Collaborator

xianwill commented May 12, 2022

@thovoll - oh snap. Test helper compile is failing because parquet::util has been made private between versions. Fortunately, looks like SliceableCursor is exported publicly so you should be able to fix this by changing the use.

@xianwill xianwill requested a review from mosyp May 12, 2022 15:36
@thovoll
Copy link
Contributor Author

thovoll commented May 12, 2022

Thanks @xianwill, fixing this ASAP.

@thovoll thovoll force-pushed the rev-delta-rs branch 2 times, most recently from 3c5dc99 to 6dd8768 Compare May 12, 2022 23:41
@thovoll
Copy link
Contributor Author

thovoll commented May 13, 2022

@mosyp would you be able to take a look at the failing integration tests? At least the coercion and dlq tests are confirmed failing. OTOH, we have the kdi process running fine in the Debian docker image. Possibly something going on with the integration test setup?

@mightyshazam
Copy link
Collaborator

mightyshazam commented Feb 28, 2023

@mosyp would you be able to take a look at the failing integration tests? At least the coercion and dlq tests are confirmed failing. OTOH, we have the kdi process running fine in the Debian docker image. Possibly something going on with the integration test setup?

The reason the tests are failing is because of the upgrades to parquet, arrow and delta-rs. The ArrowWriter implementation has changed. In the old version, it would flush all of the records when calling the write method. In the version in this PR, the writer buffers the rows in memory until it reaches max_row_group_size buffered rows. Once that happens, or someone closes the file, it flushes.
The code in lib.rs that checks whether it should complete the batch based on file size looks at the size of the cursor passed to the arrow writer. This cursor only contains a few bytes until the flush happens.
One way around this is to set max_row_group_size to max_messages_per_batch from IngestionOptions. Another is to modify the writer in the arrow-rs to expose the size of the buffered data.

@rtyler
Copy link
Member

rtyler commented Mar 26, 2023

Closing this in favor of #136 which @mightyshazam and I are collaborating on.

@rtyler rtyler closed this Mar 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants