Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic change of batch size when memory error occurs #259

Closed
leo-schick opened this issue Sep 19, 2022 · 7 comments
Closed

Automatic change of batch size when memory error occurs #259

leo-schick opened this issue Sep 19, 2022 · 7 comments

Comments

@leo-schick
Copy link

I use odbc2parquet to export a whole database of over 100 tables at once where I run odbc2parquet up to 8 times at once or more. Sometimes the jobs steal each other the memory and I get into an memory error (e.g. memory allocation of 262205535 bytes failed).

I cannot predict how much memory is available when I start a job and the size of the tables variate in single row size and number of rows. To use a fixed batch size (option --batch-size) is I think a bad solution because I have no clue about the design of the table when I run the command.

I think it would be great when odbc2parquet just tries to use a lower batch size when the required memory is not available. Something like "when failed, brute force until you get through".

Similar issue:

@pacman82
Copy link
Owner

pacman82 commented Sep 19, 2022

odbc2parquet already calculates the batch size based on memory consumption by default. On a 64Bit machine it aims to use 2 GiB of memory. This may be too much if you run 8 at once. You can adapt that value using --batch-size-memory option. E.g. specifying something like --batch-size-memory 500MiB should probably work out just fine.

@pacman82
Copy link
Owner

Did --batch-size-memory solve your issue?

@pacman82
Copy link
Owner

@leo-schick Still not sure wether --batch-size-memory does solve the issue for you. I am closing this issue for now, until I learn otherwise.

@leo-schick
Copy link
Author

@pacman82 Yes, it solved my problem.

But it would be nice to have a --batch-size-memory auto option where odbc2parquet uses the available memory if possible and does not break when the available memory falls short.

@pacman82
Copy link
Owner

@leo-schick Available memory is pretty tricky term. Also fallible allocation do come with a significant performance overhead on some systems. If I were to write more code for the usecase of downloading many tables, it is more likely I would end up supporting that in a single process.

IMHO any option which uses an auto should most likely be default, and make a good decision in the vast majority of cases. However I do feel that just crashing in an Out of memory situation actually is that right decision. As my understanding is right now, I would be against the introduction of such a flag.

I am somewhat interessted in the larger context of your usecase though. Are you downloading every table in a database? Or just some of them? Are you downloading tables from different data sources? Are you doing a backup? Or in one simpler question: Why are you running 8 instances of odbc2parquet at once?

@leo-schick
Copy link
Author

@pacman82 I am mid in a migration of a on-premise data warehouse to a cloud solution. This will take several months. To have a smooth switch, I want to make the data warehouse tables and source database tables available on a cloud azure strage.

I have a predefined list of tables from a database I want to transfer. To do so, I use a Mara ETL Data Pipeline where I add a task per table runnging odbc2parquet. The mara ETL pipeline runs in parallel with 8 threads.

For each task, I run the following shell command set:

echo 'SELECT * FROM my_table' \
  | odbc2parquet -q query -c "<dboc_connection_string>" --column-compression-default snappy - - \
  | azcopy cp 'https://<storage_account>.dfs.core.windows.net/<storage_container>/<path_to_model>/part.0.parquet?<sas-token>' --from-to PipeBlob

The benefit of running it with a Mara Pipeline is that it intelligently generates the shortest DAG path (DAG = Directed Acyclic Graph): It saves the length each task (= table) takes, and uses this information for the next run to make sure that the longest table/task runs first:

Sample:
image

This saves total execution time. And execution time is critical.

@pacman82
Copy link
Owner

Thanks for the additional context. Very helpful. In the features I had in mind wouldn't be helpful. For now I'll leave the artefact as is.

Cheers, Markus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants