-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding target BigQuery #282
Comments
hey @jmriego . Adding target-bigquery would be awesome. If you're happy doing the coding part yourself then we can discuss your and our experiences so far and I can give you some hints. Unfortunately we haven't got chance looking into bigquery in more details, but if you happy to contribute then we're happy to help where we can. If you're interested please reach me on singer-io slack channel, direct message me and we can discuss the details. |
WiP prototype repo created at https://github.com/jmriego/pipelinewise-target-bigquery |
Hi @koszti ! Just wondering if there's anything you might want me to check about this target. We have been using it in production for 6 months so should be usable now |
I’m just curious what the plan is for this? I’m getting started with PipelineWise and BQ is my target. A PR would be great. |
sorry @stewartbryson but the way PipelineWise works I can't really send a PR. Each target is a different repository and this target would eventually live in https://github.com/transferwise/pipelinewise-target-bigquery |
That's just my ignorance... I get it now. Thanks for clearing it up. Again... just getting started with this, but it looks like that |
no that's fine, you shouldn't need to fork the repo for using this target. I haven't added it to You'd have to create that requirements file I mentioned earlier and run: |
Thanks for your feedback and activity in this thread. Making a fully completed target connector requires a little bit more work than adding a new tap. But target-bigquery would be a great milestone and I can try to help with pushing it forward. To create a completed target-bigquery we ideally would need to:
Re point 1) @jmriego , I can create and move your code into https://github.com/transferwise/pipelinewise-target-bigquery . Is that fine with you? Once it's done we can link it to the main PPW creating a new Re point 2) What datasources do you use? Did you experience slow initial syncs of large databases that bigger than 10-20GB. Basically we can live without fastsync-to-bigquery but I guess there will be some performance problems when syncing very big tables. What do you think, would that be a problem? Re point 3) and 4). It looks straightforward and we can re-use existing test cases from other targets. |
Sure, that makes perfect sense. Once we have that pipelinewise-target-bigquery repo I'll be able to have a proper PR. There will be some things I need to update to have all the updates that have been added to PipelineWise in the last months. Also, adding the end to end tests and similar After that repo is created I'll work on this PR. I'll leave it as WIP because there are some questions I'd want to ask you about |
It sounds like this might happen pretty quickly, and it's appreciated. Just in case it takes a bit, do you guys have a sample |
Source code moved to https://github.com/transferwise/pipelinewise-target-bigquery @jmriego, I added some common changes, like circleci integration, pylint and switched to PPW-singer-library that gives the same logging experience across every tap and target. Please send further PRs as required. 🙇 |
I just created the PR but there's still some work to do mainly about tests and documentation as mentioned by @koszti above. At the moment I'm unable to build the PipelineWise docker image so I haven't even tested the current state of the code for fastsync (related issue https://github.com/transferwise/pipelinewise-tap-mysql/issues/22). As usual with BigQuery the testing code that is compatible with all other databases need some rework here mainly because BigQuery does not have the same DDL and INFORMATION_SCHEMA capabilities. I'll keep on working on this PR |
@stewartbryson this is a sample target_bigquery.yml file. Hope it helps!
|
@koszti in our current usage of this target we have had to add some code for managing replication of postgres to bigquery for numbers that are too big for BigQuery to handle (as Postgres can handle numbers bigger than In our case, the tap_postgres had to be modified so when querying the database would make numbers bigger than that |
We can also consider to handle this in target-bigquery instead. Maybe some other targets can deal with it and want to keep the original extracted values. For example target-s3-csv or pg-to-pg. What do you think can we make this in target-pg instead? I remember target-snowflake also has some similar type/value conversion that only snowflake can't handle but we wanted to give the chance for other targets; so we implemented it in the target and not in the tap. |
Sorry, I didn't do a great job of explaining that. I was referring to the fast sync to BigQuery. There are several modifications you need to do to your source query for making sure you don't export bad rows. For example, here for the dates: https://github.com/transferwise/pipelinewise/blob/master/pipelinewise/fastsync/commons/tap_mysql.py#L215 In BigQuery we also need to add a modification to numbers that are too big for it to handle. The way I was planning on doing that is to modify the |
Hi @koszti , I have had some problems getting this to work but it seems ready now. It seems like Python 3.8 is not compatible with PipelineWise and there were some issues with the Docker image but it's all good now. I have been able to run fast and incremental with the image in the PR. I still have some stuff left to do mainly about testing. Could you give me some instructions on how the end to end tests work? I can't really figure out where to start adding it and which env variables are needed or detected |
hey @jmriego , yes, I think adding optional parameter of Regarding testing I'll add some hints to #445 |
hi @koszti , just letting you know that I managed to get the docker testing working and wrote unit tests, integration tests and updated some docs for getting the target bigquery merged. I have rebased my changes on top of the current PipelineWise repo and also refactored some stuff to make it similar to the way other PipelineWise targets are organized. Let me know if there's anything else I need to do for this. Thanks in advance! |
First of all thanks for this amazing project. It really makes replication so much easier to do and to version control the configuration.
Is there any plans to add the BigQuery target?
I would be happy to add it myself but I'm having some issues understanding where should I make changes. I have tried just adding the target from singer and the
requirements.txt
as instructed in the Contributing part of the docs, but can't get it to work. I have problems with the data types and also the identifiers can't contain dashes.Is there any plans to add it? If anyone could help me with this I'm happy to do it myself. Thanks!
The text was updated successfully, but these errors were encountered: