Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk Loaders and Exports #571

Closed
n0mn0m opened this issue Sep 1, 2017 · 10 comments
Closed

Bulk Loaders and Exports #571

n0mn0m opened this issue Sep 1, 2017 · 10 comments

Comments

@n0mn0m
Copy link

n0mn0m commented Sep 1, 2017

Hi,

I've used odo in some of my smaller projects and enjoy it, but with others the current data migrations I do use pyodbc and calls to utilities like bcp and nzload due to the size of the data I am importing/exporting.

I don't see much documentation around utilities like these in Odo, but I did see that pg_bulkload was not implemented yet.

I was wondering if there was any movement in this space and if there is on boarding documentation for somebody that might like to assist or work on some contributions relating to these bulk import/export utilities? If that doesn't fit the project any ideas or suggestions for somebody interested in creating an extension library for Odo with these utilities?

Thanks.

@dhirschfeld
Copy link
Contributor

I've previously used bcp by writing csvs to spinning rust but whilst it does get you very good speedups the whole process felt a bit icky. In contrast the postgres bulk upload could be used entirely in memory by writing the csv to a StringIO. I tried to avoid writing to disk by using a named pipe but couldn't get that working. I've read various things on the internet that suggest there may be a way to make that work but I haven't followed that up.

In this instance I've got a bit of .NET envy - the SqlBulkCopy class uses the same apis as bcp and so gets the same performance but directly from C# code. I haven't pursued the named pipe route as I think the proper way to do this is to implement the Tabular Data Stream Protocol directly in Python as SqlBulkCopy does in C#. I think that doable but is a big project which I don't personally have time for at the moment.

Refs:

@n0mn0m
Copy link
Author

n0mn0m commented Sep 13, 2017

Can you give me some more insight on what made bcp feel icky? My current usage does feel like a hack, but just curious if you had more details on that. I'll look into MS-TDS. Sounds like an interesting opportunity, I'll be curious to consider the scale since you mentioned it being a big project. Today I implemented an mssql CopyToCSV, that said I'm not in love with it. It works, but it feels like a hack, a lot of that is the bcp component.

startup.py

import sqlalchemy as sa
from odo import odo

metadata = sa.MataData(bind="mssql+pyodbc:///?odbc_connect%s" % params)
tbl = sa.Table('details', metadata, schema='scheme', autoload=True)
odo(tbl, 'bulk_out.csv')

odo/backends/sql.py

@compiles('CopyToCSV', 'mssql')
def compile_copy_to_csv_mssql(element, compiler, **kwargs):
    selectable = element.element

    connection_elements = selectable.bind.url.query['odbc_connect'].replace('=',';').split(';')
    connection_elements = dict(zip(connection_elements[0::2], connection_elements[1::2]))

    cmd_template = """bcp {database}.{schema}.{table} OUT {path} -S {server} -T -c -q"""
    cmd_statement = cmd_template.format(database=connection_elements['DATABASE'],
                                        schema=selectable.schema,
                                        table=selectable.name,
                                        path=element.path.replace('\\\\','\\'),
                                        server=connection_elements['SERVER'])

    os.system(cmd_statement)
    return ''

So there are a lot of reasons I don't like this. First it doesn't actually use sqlalchemy for anything but to be part of the compile decorated function pattern that's already present. Because of that it feels hacky, and I have to use the connection string to build up the bcp command. It's also just issuing a command on os.system, which isn't a bad thing, but very different from the other compile_copy_to_csv_* functions. Finally building up the bcp command is inflexible if somebody needed a different delimiter or flags then they would have to manually change this functions internals.

What is nice is that for somebody using odo for sql -> to csv for a table export it's very fast, the usage is still just odo(mssql_source, csv_dest) and it routes with the sqlalchemy compiler which makes the discovery the same as the other sql.py CopyToCSV database functions.

That said this was a first pass for me today just digging into the odo internals a bit more and seeing if I could get this working. I need to spend more time working with the odo graph and it's arrival at the compile_copy_to_csv_* functions. I have some notes from my conversation with Joe over the weekend to review where he had offered some suggestions and potentially a different way to route this instead of making it sqlalchemy compiled.

If you have any suggestions or concerns please let me know as I'm not married to this, but want to provide some form of working sql server bulk operation for odo. That may be MS-TDS I'll look into that this week and see what it is all about as well as exploring of sqlalchemy already has some MS-TDS extension available.

@dhirschfeld
Copy link
Contributor

Can you give me some more insight on what made bcp feel icky?

Writing anything to disk is slow. Since there's no reason why you should have to do that it's not a great solution IMHO (except in the fact that it works and gives a good performance boost). If bcp could either accept data from stdin or from a named pipe you could avoid having to write anything to disk. The fact that limitations in bcp prevent that is icky.

@dhirschfeld
Copy link
Contributor

If you have any suggestions or concerns please let me know as I'm not married to this...

It's a perfectly good solution - I just think implementing the TDS protocol directly would be a better solution, but also a lot more work so this is a good improvement on what's currently available.

NB: I'd probably use subprocess.check_call so you know if it worked or not

@n0mn0m
Copy link
Author

n0mn0m commented Sep 13, 2017

Got it, that makes a lot of sense. Thank you for all the references, insight and feedback. I'll dig into these and see if I can figure something different out this week.

@dhirschfeld
Copy link
Contributor

Like I say - this is a good fit for now - SqlBulkCopy for Python is just something on my wishlist but I thought I'd document what I'd found in case anyone wanted to pick it up...

@n0mn0m
Copy link
Author

n0mn0m commented Sep 13, 2017

Understood. I'll work on a bcp in function for sql_csv.py mssql and then setup a PR.

@nickolay
Copy link

There's pymssql/pymssql#279 requesting bulk load support, which points to another library (https://github.com/zillow/ctds), which implements bulk insert, though it does not yet have fully working SQLAlchemy integration (zillow/ctds#6).

@n0mn0m
Copy link
Author

n0mn0m commented Aug 31, 2018

I don't think much is happening with odo today. I'll checkout the referenced pymssql issue and the zillow implementation. Thanks @nickolay

@n0mn0m n0mn0m closed this as completed Aug 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants