Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding target BigQuery #282

Open
jmriego opened this issue Dec 2, 2019 · 19 comments
Open

Adding target BigQuery #282

jmriego opened this issue Dec 2, 2019 · 19 comments
Labels
enhancement New feature or request

Comments

@jmriego
Copy link
Contributor

jmriego commented Dec 2, 2019

First of all thanks for this amazing project. It really makes replication so much easier to do and to version control the configuration.

Is there any plans to add the BigQuery target?

I would be happy to add it myself but I'm having some issues understanding where should I make changes. I have tried just adding the target from singer and the requirements.txt as instructed in the Contributing part of the docs, but can't get it to work. I have problems with the data types and also the identifiers can't contain dashes.

Is there any plans to add it? If anyone could help me with this I'm happy to do it myself. Thanks!

@koszti
Copy link
Contributor

koszti commented Dec 4, 2019

hey @jmriego . Adding target-bigquery would be awesome. If you're happy doing the coding part yourself then we can discuss your and our experiences so far and I can give you some hints. Unfortunately we haven't got chance looking into bigquery in more details, but if you happy to contribute then we're happy to help where we can.

If you're interested please reach me on singer-io slack channel, direct message me and we can discuss the details.

@koszti koszti added the enhancement New feature or request label Dec 15, 2019
@koszti
Copy link
Contributor

koszti commented Dec 30, 2019

WiP prototype repo created at https://github.com/jmriego/pipelinewise-target-bigquery

@jmriego
Copy link
Contributor Author

jmriego commented Jun 2, 2020

Hi @koszti ! Just wondering if there's anything you might want me to check about this target. We have been using it in production for 6 months so should be usable now

@stewartbryson
Copy link

I’m just curious what the plan is for this? I’m getting started with PipelineWise and BQ is my target. A PR would be great.

@jmriego
Copy link
Contributor Author

jmriego commented Jun 7, 2020

sorry @stewartbryson but the way PipelineWise works I can't really send a PR. Each target is a different repository and this target would eventually live in https://github.com/transferwise/pipelinewise-target-bigquery
For now the way you can try it is creating a singer-connectors/target-bigquery/requirements.txt that points to my repo

@stewartbryson
Copy link

sorry @stewartbryson but the way PipelineWise works I can't really send a PR. Each target is a different repository and this target would eventually live in https://github.com/transferwise/pipelinewise-target-bigquery
For now the way you can try it is creating a singer-connectors/target-bigquery/requirements.txt that points to my repo

That's just my ignorance... I get it now. Thanks for clearing it up. Again... just getting started with this, but it looks like that tap_properties.py would have to be updated to include it, is that right? Would I have to fork the main repo to make those modifications to use it?

@jmriego
Copy link
Contributor Author

jmriego commented Jun 7, 2020

no that's fine, you shouldn't need to fork the repo for using this target. I haven't added it to tap_properties.py as the target works with the defaults

You'd have to create that requirements file I mentioned earlier and run: ./install.sh --connectors=target-bigquery

@koszti
Copy link
Contributor

koszti commented Jun 8, 2020

Thanks for your feedback and activity in this thread. Making a fully completed target connector requires a little bit more work than adding a new tap. But target-bigquery would be a great milestone and I can try to help with pushing it forward.

To create a completed target-bigquery we ideally would need to:

  1. Moving the code under PPW repo at https://github.com/transferwise/pipelinewise-target-bigquery
  2. Adding Fast Sync components to initial sync big tables here
  3. Adding End to end tests in main PipelineWise here
  4. Documentation and examples here. We need to document the differences between the common PPW Schema Changes and why it's working differently in pipelinewise-target-bigquery schema changes. Espeically versioning columns.

Re point 1) @jmriego , I can create and move your code into https://github.com/transferwise/pipelinewise-target-bigquery . Is that fine with you? Once it's done we can link it to the main PPW creating a new singer-connectors/target-bigquery/requirements.txt

Re point 2) What datasources do you use? Did you experience slow initial syncs of large databases that bigger than 10-20GB. Basically we can live without fastsync-to-bigquery but I guess there will be some performance problems when syncing very big tables. What do you think, would that be a problem?

Re point 3) and 4). It looks straightforward and we can re-use existing test cases from other targets.

@jmriego
Copy link
Contributor Author

jmriego commented Jun 9, 2020

Sure, that makes perfect sense.
I also created the fastsync components as we were running into some speed issues when running it for big tables.

Once we have that pipelinewise-target-bigquery repo I'll be able to have a proper PR. There will be some things I need to update to have all the updates that have been added to PipelineWise in the last months. Also, adding the end to end tests and similar

After that repo is created I'll work on this PR. I'll leave it as WIP because there are some questions I'd want to ask you about

@stewartbryson
Copy link

stewartbryson commented Jun 9, 2020

It sounds like this might happen pretty quickly, and it's appreciated. Just in case it takes a bit, do you guys have a sample target_bigquery.yml file? I've got everything else up and running in my docker container.

@koszti
Copy link
Contributor

koszti commented Jun 9, 2020

Source code moved to https://github.com/transferwise/pipelinewise-target-bigquery
v1.0.0 released to PyPI and available at https://pypi.org/project/pipelinewise-target-bigquery/

@jmriego, I added some common changes, like circleci integration, pylint and switched to PPW-singer-library that gives the same logging experience across every tap and target. Please send further PRs as required. 🙇

@jmriego
Copy link
Contributor Author

jmriego commented Jun 14, 2020

I just created the PR but there's still some work to do mainly about tests and documentation as mentioned by @koszti above. At the moment I'm unable to build the PipelineWise docker image so I haven't even tested the current state of the code for fastsync (related issue https://github.com/transferwise/pipelinewise-tap-mysql/issues/22).

As usual with BigQuery the testing code that is compatible with all other databases need some rework here mainly because BigQuery does not have the same DDL and INFORMATION_SCHEMA capabilities. I'll keep on working on this PR

@jmriego
Copy link
Contributor Author

jmriego commented Jun 14, 2020

@stewartbryson this is a sample target_bigquery.yml file. Hope it helps!

# ------------------------------------------------------------------------------
# General Properties
# ------------------------------------------------------------------------------
id: "bigquery"                        # Unique identifier of the target
name: "BigQuery"                      # Name of the target
type: "target-bigquery"               # !! THIS SHOULD NOT CHANGE !!


# ------------------------------------------------------------------------------
# Target - Data Warehouse connection details
# ------------------------------------------------------------------------------
db_conn:
  project_id: "???"
  dataset_id: "???"

@jmriego
Copy link
Contributor Author

jmriego commented Jun 14, 2020

@koszti in our current usage of this target we have had to add some code for managing replication of postgres to bigquery for numbers that are too big for BigQuery to handle (as Postgres can handle numbers bigger than 99999999999999999999999999999.999999999)

In our case, the tap_postgres had to be modified so when querying the database would make numbers bigger than that NULL instead. Do you think this kind of code should be part of the PR?

@koszti
Copy link
Contributor

koszti commented Jun 15, 2020

We can also consider to handle this in target-bigquery instead. Maybe some other targets can deal with it and want to keep the original extracted values. For example target-s3-csv or pg-to-pg.

What do you think can we make this in target-pg instead?

I remember target-snowflake also has some similar type/value conversion that only snowflake can't handle but we wanted to give the chance for other targets; so we implemented it in the target and not in the tap.

@jmriego
Copy link
Contributor Author

jmriego commented Jun 16, 2020

Sorry, I didn't do a great job of explaining that. I was referring to the fast sync to BigQuery.

There are several modifications you need to do to your source query for making sure you don't export bad rows. For example, here for the dates: https://github.com/transferwise/pipelinewise/blob/master/pipelinewise/fastsync/commons/tap_mysql.py#L215

In BigQuery we also need to add a modification to numbers that are too big for it to handle. The way I was planning on doing that is to modify the get_table_columns function adding a new optional parameter max_number that would make all numbers greater than that as null. Do you think that's an acceptable solution?

@jmriego
Copy link
Contributor Author

jmriego commented Jun 21, 2020

Hi @koszti ,

I have had some problems getting this to work but it seems ready now. It seems like Python 3.8 is not compatible with PipelineWise and there were some issues with the Docker image but it's all good now. I have been able to run fast and incremental with the image in the PR. I still have some stuff left to do mainly about testing. Could you give me some instructions on how the end to end tests work? I can't really figure out where to start adding it and which env variables are needed or detected

@koszti
Copy link
Contributor

koszti commented Jul 4, 2020

hey @jmriego , yes, I think adding optional parameter of max_number to get_table_columns makes sense, and you can control the behaviour from the bigquery specific fastsync methods. Just please use the current logic as the default so the behaviour will not change in other components.

Regarding testing I'll add some hints to #445

@jmriego
Copy link
Contributor Author

jmriego commented May 29, 2021

hi @koszti ,

just letting you know that I managed to get the docker testing working and wrote unit tests, integration tests and updated some docs for getting the target bigquery merged.
It's quite a big PR and it required some changes to existing fastsync taps as mentioned but I always left the current existing behaviour. This would require first merging jmriego/pipelinewise-target-bigquery#12 and once that's merged and that repo's version changed, I'll update the target-bigquery requirements.txt file with the newer version number.

I have rebased my changes on top of the current PipelineWise repo and also refactored some stuff to make it similar to the way other PipelineWise targets are organized.

Let me know if there's anything else I need to do for this. Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants