Skip to content

[FEATURE] Speed up the program by speeding up data insertion #145

@Eric-Liu-SANDAG

Description

@Eric-Liu-SANDAG

Is your feature request related to a problem? Please describe

Currently, the major slowdown in running Estimates is slow bulk insertion, especially for the large ASE data. I don't have exact numbers on this, but I suspect that if we are able to significantly speed up data insertion, the whole program could be run in an hour or two instead of overnight

Describe the solution you'd like

Speeding up the insertion of data can be done in a few different methods, which need to be properly evaluated. Data is loaded in two ways, using pd.DataFrame.to_sql() and a manual bulk insert process.

WRT to the bulk insert process:

  • Much of our ASE data is sparse, in the sense that most of it is zero. We could insert only non-zero data, then do a join with a table shell to insert the zero data. This would probably reduce the amount of data we load by at least 75%, given that all GQ ASE data is mostly zero
  • We could write the data to a different file location like blob storage which is more accessible to SQL Server
  • We could use new features of SQL Server 2022 that enable faster bulk insert, such as more efficient file types that are not csv files

WRT to pd.DataFrame.to_sql():

  • Evaluate how long these calls take, as I believe they just end up doing a whole bunch of INSERT statements
  • If necessary, implement some of the above suggestions

Describe alternatives you've considered

N/A

Additional context

Very low priority task, as running overnight once a year is not really that big of a deal. However, we could do this dev work for intercensal (2010-2019) or 2025 Estimates (2020-2025) if we have dev time

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions