Bulk Loaders and Exports #571

n0mn0m · 2017-09-01T19:58:17Z

Hi,

I've used odo in some of my smaller projects and enjoy it, but with others the current data migrations I do use pyodbc and calls to utilities like bcp and nzload due to the size of the data I am importing/exporting.

I don't see much documentation around utilities like these in Odo, but I did see that pg_bulkload was not implemented yet.

I was wondering if there was any movement in this space and if there is on boarding documentation for somebody that might like to assist or work on some contributions relating to these bulk import/export utilities? If that doesn't fit the project any ideas or suggestions for somebody interested in creating an extension library for Odo with these utilities?

Thanks.

dhirschfeld · 2017-09-12T23:15:06Z

I've previously used bcp by writing csvs to spinning rust but whilst it does get you very good speedups the whole process felt a bit icky. In contrast the postgres bulk upload could be used entirely in memory by writing the csv to a StringIO. I tried to avoid writing to disk by using a named pipe but couldn't get that working. I've read various things on the internet that suggest there may be a way to make that work but I haven't followed that up.

In this instance I've got a bit of .NET envy - the SqlBulkCopy class uses the same apis as bcp and so gets the same performance but directly from C# code. I haven't pursued the named pipe route as I think the proper way to do this is to implement the Tabular Data Stream Protocol directly in Python as SqlBulkCopy does in C#. I think that doable but is a big project which I don't personally have time for at the moment.

Refs:

n0mn0m · 2017-09-13T01:28:20Z

Can you give me some more insight on what made bcp feel icky? My current usage does feel like a hack, but just curious if you had more details on that. I'll look into MS-TDS. Sounds like an interesting opportunity, I'll be curious to consider the scale since you mentioned it being a big project. Today I implemented an mssql CopyToCSV, that said I'm not in love with it. It works, but it feels like a hack, a lot of that is the bcp component.

startup.py

import sqlalchemy as sa
from odo import odo

metadata = sa.MataData(bind="mssql+pyodbc:///?odbc_connect%s" % params)
tbl = sa.Table('details', metadata, schema='scheme', autoload=True)
odo(tbl, 'bulk_out.csv')

odo/backends/sql.py

@compiles('CopyToCSV', 'mssql')
def compile_copy_to_csv_mssql(element, compiler, **kwargs):
    selectable = element.element

    connection_elements = selectable.bind.url.query['odbc_connect'].replace('=',';').split(';')
    connection_elements = dict(zip(connection_elements[0::2], connection_elements[1::2]))

    cmd_template = """bcp {database}.{schema}.{table} OUT {path} -S {server} -T -c -q"""
    cmd_statement = cmd_template.format(database=connection_elements['DATABASE'],
                                        schema=selectable.schema,
                                        table=selectable.name,
                                        path=element.path.replace('\\\\','\\'),
                                        server=connection_elements['SERVER'])

    os.system(cmd_statement)
    return ''

So there are a lot of reasons I don't like this. First it doesn't actually use sqlalchemy for anything but to be part of the compile decorated function pattern that's already present. Because of that it feels hacky, and I have to use the connection string to build up the bcp command. It's also just issuing a command on os.system, which isn't a bad thing, but very different from the other compile_copy_to_csv_* functions. Finally building up the bcp command is inflexible if somebody needed a different delimiter or flags then they would have to manually change this functions internals.

What is nice is that for somebody using odo for sql -> to csv for a table export it's very fast, the usage is still just odo(mssql_source, csv_dest) and it routes with the sqlalchemy compiler which makes the discovery the same as the other sql.py CopyToCSV database functions.

That said this was a first pass for me today just digging into the odo internals a bit more and seeing if I could get this working. I need to spend more time working with the odo graph and it's arrival at the compile_copy_to_csv_* functions. I have some notes from my conversation with Joe over the weekend to review where he had offered some suggestions and potentially a different way to route this instead of making it sqlalchemy compiled.

If you have any suggestions or concerns please let me know as I'm not married to this, but want to provide some form of working sql server bulk operation for odo. That may be MS-TDS I'll look into that this week and see what it is all about as well as exploring of sqlalchemy already has some MS-TDS extension available.

dhirschfeld · 2017-09-13T01:32:49Z

Can you give me some more insight on what made bcp feel icky?

Writing anything to disk is slow. Since there's no reason why you should have to do that it's not a great solution IMHO (except in the fact that it works and gives a good performance boost). If bcp could either accept data from stdin or from a named pipe you could avoid having to write anything to disk. The fact that limitations in bcp prevent that is icky.

dhirschfeld · 2017-09-13T01:53:26Z

If you have any suggestions or concerns please let me know as I'm not married to this...

It's a perfectly good solution - I just think implementing the TDS protocol directly would be a better solution, but also a lot more work so this is a good improvement on what's currently available.

NB: I'd probably use subprocess.check_call so you know if it worked or not

dhirschfeld · 2017-09-13T02:01:38Z

On the named pipe front:

n0mn0m · 2017-09-13T02:37:03Z

Got it, that makes a lot of sense. Thank you for all the references, insight and feedback. I'll dig into these and see if I can figure something different out this week.

dhirschfeld · 2017-09-13T02:42:35Z

Like I say - this is a good fit for now - SqlBulkCopy for Python is just something on my wishlist but I thought I'd document what I'd found in case anyone wanted to pick it up...

n0mn0m · 2017-09-13T11:41:05Z

Understood. I'll work on a bcp in function for sql_csv.py mssql and then setup a PR.

nickolay · 2018-08-30T00:41:14Z

There's pymssql/pymssql#279 requesting bulk load support, which points to another library (https://github.com/zillow/ctds), which implements bulk insert, though it does not yet have fully working SQLAlchemy integration (zillow/ctds#6).

n0mn0m · 2018-08-31T14:55:59Z

I don't think much is happening with odo today. I'll checkout the referenced pymssql issue and the zillow implementation. Thanks @nickolay

nickolay mentioned this issue Aug 30, 2018

Microsoft SQL Server bcp import/export #575

Closed

n0mn0m closed this as completed Aug 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk Loaders and Exports #571

Bulk Loaders and Exports #571

n0mn0m commented Sep 1, 2017 •

edited

Loading

dhirschfeld commented Sep 12, 2017

n0mn0m commented Sep 13, 2017

dhirschfeld commented Sep 13, 2017

dhirschfeld commented Sep 13, 2017

dhirschfeld commented Sep 13, 2017

n0mn0m commented Sep 13, 2017

dhirschfeld commented Sep 13, 2017

n0mn0m commented Sep 13, 2017 •

edited

Loading

nickolay commented Aug 30, 2018

n0mn0m commented Aug 31, 2018

Bulk Loaders and Exports #571

Bulk Loaders and Exports #571

Comments

n0mn0m commented Sep 1, 2017 • edited Loading

dhirschfeld commented Sep 12, 2017

n0mn0m commented Sep 13, 2017

dhirschfeld commented Sep 13, 2017

dhirschfeld commented Sep 13, 2017

dhirschfeld commented Sep 13, 2017

n0mn0m commented Sep 13, 2017

dhirschfeld commented Sep 13, 2017

n0mn0m commented Sep 13, 2017 • edited Loading

nickolay commented Aug 30, 2018

n0mn0m commented Aug 31, 2018

n0mn0m commented Sep 1, 2017 •

edited

Loading

n0mn0m commented Sep 13, 2017 •

edited

Loading