-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic change of batch size when memory error occurs #259
Comments
|
Did --batch-size-memory solve your issue? |
@leo-schick Still not sure wether |
@pacman82 Yes, it solved my problem. But it would be nice to have a |
@leo-schick Available memory is pretty tricky term. Also fallible allocation do come with a significant performance overhead on some systems. If I were to write more code for the usecase of downloading many tables, it is more likely I would end up supporting that in a single process. IMHO any option which uses an I am somewhat interessted in the larger context of your usecase though. Are you downloading every table in a database? Or just some of them? Are you downloading tables from different data sources? Are you doing a backup? Or in one simpler question: Why are you running 8 instances of |
@pacman82 I am mid in a migration of a on-premise data warehouse to a cloud solution. This will take several months. To have a smooth switch, I want to make the data warehouse tables and source database tables available on a cloud azure strage. I have a predefined list of tables from a database I want to transfer. To do so, I use a Mara ETL Data Pipeline where I add a task per table runnging odbc2parquet. The mara ETL pipeline runs in parallel with 8 threads. For each task, I run the following shell command set: echo 'SELECT * FROM my_table' \
| odbc2parquet -q query -c "<dboc_connection_string>" --column-compression-default snappy - - \
| azcopy cp 'https://<storage_account>.dfs.core.windows.net/<storage_container>/<path_to_model>/part.0.parquet?<sas-token>' --from-to PipeBlob The benefit of running it with a Mara Pipeline is that it intelligently generates the shortest DAG path (DAG = Directed Acyclic Graph): It saves the length each task (= table) takes, and uses this information for the next run to make sure that the longest table/task runs first: This saves total execution time. And execution time is critical. |
Thanks for the additional context. Very helpful. In the features I had in mind wouldn't be helpful. For now I'll leave the artefact as is. Cheers, Markus |
I use odbc2parquet to export a whole database of over 100 tables at once where I run odbc2parquet up to 8 times at once or more. Sometimes the jobs steal each other the memory and I get into an memory error (e.g.
memory allocation of 262205535 bytes failed
).I cannot predict how much memory is available when I start a job and the size of the tables variate in single row size and number of rows. To use a fixed batch size (option
--batch-size
) is I think a bad solution because I have no clue about the design of the table when I run the command.I think it would be great when odbc2parquet just tries to use a lower batch size when the required memory is not available. Something like "when failed, brute force until you get through".
Similar issue:
The text was updated successfully, but these errors were encountered: