Automatic change of batch size when memory error occurs #259

leo-schick · 2022-09-19T06:08:58Z

I use odbc2parquet to export a whole database of over 100 tables at once where I run odbc2parquet up to 8 times at once or more. Sometimes the jobs steal each other the memory and I get into an memory error (e.g. memory allocation of 262205535 bytes failed).

I cannot predict how much memory is available when I start a job and the size of the tables variate in single row size and number of rows. To use a fixed batch size (option --batch-size) is I think a bad solution because I have no clue about the design of the table when I run the command.

I think it would be great when odbc2parquet just tries to use a lower batch size when the required memory is not available. Something like "when failed, brute force until you get through".

Similar issue:

Memory allocation error due to Longtext column #43

The text was updated successfully, but these errors were encountered:

pacman82 · 2022-09-19T08:25:37Z

odbc2parquet already calculates the batch size based on memory consumption by default. On a 64Bit machine it aims to use 2 GiB of memory. This may be too much if you run 8 at once. You can adapt that value using --batch-size-memory option. E.g. specifying something like --batch-size-memory 500MiB should probably work out just fine.

pacman82 · 2022-09-22T10:41:27Z

Did --batch-size-memory solve your issue?

pacman82 · 2022-09-26T05:51:50Z

@leo-schick Still not sure wether --batch-size-memory does solve the issue for you. I am closing this issue for now, until I learn otherwise.

leo-schick · 2022-09-26T08:22:31Z

@pacman82 Yes, it solved my problem.

But it would be nice to have a --batch-size-memory auto option where odbc2parquet uses the available memory if possible and does not break when the available memory falls short.

pacman82 · 2022-09-26T11:49:15Z

@leo-schick Available memory is pretty tricky term. Also fallible allocation do come with a significant performance overhead on some systems. If I were to write more code for the usecase of downloading many tables, it is more likely I would end up supporting that in a single process.

IMHO any option which uses an auto should most likely be default, and make a good decision in the vast majority of cases. However I do feel that just crashing in an Out of memory situation actually is that right decision. As my understanding is right now, I would be against the introduction of such a flag.

I am somewhat interessted in the larger context of your usecase though. Are you downloading every table in a database? Or just some of them? Are you downloading tables from different data sources? Are you doing a backup? Or in one simpler question: Why are you running 8 instances of odbc2parquet at once?

leo-schick · 2022-09-26T12:17:34Z

@pacman82 I am mid in a migration of a on-premise data warehouse to a cloud solution. This will take several months. To have a smooth switch, I want to make the data warehouse tables and source database tables available on a cloud azure strage.

I have a predefined list of tables from a database I want to transfer. To do so, I use a Mara ETL Data Pipeline where I add a task per table runnging odbc2parquet. The mara ETL pipeline runs in parallel with 8 threads.

For each task, I run the following shell command set:

echo 'SELECT * FROM my_table' \
  | odbc2parquet -q query -c "<dboc_connection_string>" --column-compression-default snappy - - \
  | azcopy cp 'https://<storage_account>.dfs.core.windows.net/<storage_container>/<path_to_model>/part.0.parquet?<sas-token>' --from-to PipeBlob

The benefit of running it with a Mara Pipeline is that it intelligently generates the shortest DAG path (DAG = Directed Acyclic Graph): It saves the length each task (= table) takes, and uses this information for the next run to make sure that the longest table/task runs first:

Sample:

This saves total execution time. And execution time is critical.

pacman82 · 2022-09-27T16:16:08Z

Thanks for the additional context. Very helpful. In the features I had in mind wouldn't be helpful. For now I'll leave the artefact as is.

Cheers, Markus

pacman82 closed this as completed Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic change of batch size when memory error occurs #259

Automatic change of batch size when memory error occurs #259

leo-schick commented Sep 19, 2022

pacman82 commented Sep 19, 2022 •

edited

Loading

pacman82 commented Sep 22, 2022

pacman82 commented Sep 26, 2022

leo-schick commented Sep 26, 2022

pacman82 commented Sep 26, 2022

leo-schick commented Sep 26, 2022

pacman82 commented Sep 27, 2022

Automatic change of batch size when memory error occurs #259

Automatic change of batch size when memory error occurs #259

Comments

leo-schick commented Sep 19, 2022

pacman82 commented Sep 19, 2022 • edited Loading

pacman82 commented Sep 22, 2022

pacman82 commented Sep 26, 2022

leo-schick commented Sep 26, 2022

pacman82 commented Sep 26, 2022

leo-schick commented Sep 26, 2022

pacman82 commented Sep 27, 2022

pacman82 commented Sep 19, 2022 •

edited

Loading