Adding target BigQuery #282

jmriego · 2019-12-02T10:59:37Z

First of all thanks for this amazing project. It really makes replication so much easier to do and to version control the configuration.

Is there any plans to add the BigQuery target?

I would be happy to add it myself but I'm having some issues understanding where should I make changes. I have tried just adding the target from singer and the requirements.txt as instructed in the Contributing part of the docs, but can't get it to work. I have problems with the data types and also the identifiers can't contain dashes.

Is there any plans to add it? If anyone could help me with this I'm happy to do it myself. Thanks!

The text was updated successfully, but these errors were encountered:

koszti · 2019-12-04T18:00:12Z

hey @jmriego . Adding target-bigquery would be awesome. If you're happy doing the coding part yourself then we can discuss your and our experiences so far and I can give you some hints. Unfortunately we haven't got chance looking into bigquery in more details, but if you happy to contribute then we're happy to help where we can.

If you're interested please reach me on singer-io slack channel, direct message me and we can discuss the details.

koszti · 2019-12-30T23:58:52Z

WiP prototype repo created at https://github.com/jmriego/pipelinewise-target-bigquery

jmriego · 2020-06-02T19:26:01Z

Hi @koszti ! Just wondering if there's anything you might want me to check about this target. We have been using it in production for 6 months so should be usable now

stewartbryson · 2020-06-07T14:15:36Z

I’m just curious what the plan is for this? I’m getting started with PipelineWise and BQ is my target. A PR would be great.

jmriego · 2020-06-07T16:54:18Z

sorry @stewartbryson but the way PipelineWise works I can't really send a PR. Each target is a different repository and this target would eventually live in https://github.com/transferwise/pipelinewise-target-bigquery
For now the way you can try it is creating a singer-connectors/target-bigquery/requirements.txt that points to my repo

stewartbryson · 2020-06-07T17:06:05Z

sorry @stewartbryson but the way PipelineWise works I can't really send a PR. Each target is a different repository and this target would eventually live in https://github.com/transferwise/pipelinewise-target-bigquery
For now the way you can try it is creating a singer-connectors/target-bigquery/requirements.txt that points to my repo

That's just my ignorance... I get it now. Thanks for clearing it up. Again... just getting started with this, but it looks like that tap_properties.py would have to be updated to include it, is that right? Would I have to fork the main repo to make those modifications to use it?

jmriego · 2020-06-07T17:34:30Z

no that's fine, you shouldn't need to fork the repo for using this target. I haven't added it to tap_properties.py as the target works with the defaults

You'd have to create that requirements file I mentioned earlier and run: ./install.sh --connectors=target-bigquery

koszti · 2020-06-08T17:09:23Z

Thanks for your feedback and activity in this thread. Making a fully completed target connector requires a little bit more work than adding a new tap. But target-bigquery would be a great milestone and I can try to help with pushing it forward.

To create a completed target-bigquery we ideally would need to:

Moving the code under PPW repo at https://github.com/transferwise/pipelinewise-target-bigquery
Adding Fast Sync components to initial sync big tables here
Adding End to end tests in main PipelineWise here
Documentation and examples here. We need to document the differences between the common PPW Schema Changes and why it's working differently in pipelinewise-target-bigquery schema changes. Espeically versioning columns.

Re point 1) @jmriego , I can create and move your code into https://github.com/transferwise/pipelinewise-target-bigquery . Is that fine with you? Once it's done we can link it to the main PPW creating a new singer-connectors/target-bigquery/requirements.txt

Re point 2) What datasources do you use? Did you experience slow initial syncs of large databases that bigger than 10-20GB. Basically we can live without fastsync-to-bigquery but I guess there will be some performance problems when syncing very big tables. What do you think, would that be a problem?

Re point 3) and 4). It looks straightforward and we can re-use existing test cases from other targets.

jmriego · 2020-06-09T14:09:20Z

Sure, that makes perfect sense.
I also created the fastsync components as we were running into some speed issues when running it for big tables.

Once we have that pipelinewise-target-bigquery repo I'll be able to have a proper PR. There will be some things I need to update to have all the updates that have been added to PipelineWise in the last months. Also, adding the end to end tests and similar

After that repo is created I'll work on this PR. I'll leave it as WIP because there are some questions I'd want to ask you about

stewartbryson · 2020-06-09T14:36:24Z

It sounds like this might happen pretty quickly, and it's appreciated. Just in case it takes a bit, do you guys have a sample target_bigquery.yml file? I've got everything else up and running in my docker container.

koszti · 2020-06-09T18:32:15Z

Source code moved to https://github.com/transferwise/pipelinewise-target-bigquery
v1.0.0 released to PyPI and available at https://pypi.org/project/pipelinewise-target-bigquery/

@jmriego, I added some common changes, like circleci integration, pylint and switched to PPW-singer-library that gives the same logging experience across every tap and target. Please send further PRs as required. 🙇

jmriego · 2020-06-14T15:56:41Z

I just created the PR but there's still some work to do mainly about tests and documentation as mentioned by @koszti above. At the moment I'm unable to build the PipelineWise docker image so I haven't even tested the current state of the code for fastsync (related issue https://github.com/transferwise/pipelinewise-tap-mysql/issues/22).

As usual with BigQuery the testing code that is compatible with all other databases need some rework here mainly because BigQuery does not have the same DDL and INFORMATION_SCHEMA capabilities. I'll keep on working on this PR

jmriego · 2020-06-14T16:00:13Z

@stewartbryson this is a sample target_bigquery.yml file. Hope it helps!

# ------------------------------------------------------------------------------
# General Properties
# ------------------------------------------------------------------------------
id: "bigquery"                        # Unique identifier of the target
name: "BigQuery"                      # Name of the target
type: "target-bigquery"               # !! THIS SHOULD NOT CHANGE !!


# ------------------------------------------------------------------------------
# Target - Data Warehouse connection details
# ------------------------------------------------------------------------------
db_conn:
  project_id: "???"
  dataset_id: "???"

jmriego · 2020-06-14T16:04:29Z

@koszti in our current usage of this target we have had to add some code for managing replication of postgres to bigquery for numbers that are too big for BigQuery to handle (as Postgres can handle numbers bigger than 99999999999999999999999999999.999999999)

In our case, the tap_postgres had to be modified so when querying the database would make numbers bigger than that NULL instead. Do you think this kind of code should be part of the PR?

koszti · 2020-06-15T20:00:47Z

We can also consider to handle this in target-bigquery instead. Maybe some other targets can deal with it and want to keep the original extracted values. For example target-s3-csv or pg-to-pg.

What do you think can we make this in target-pg instead?

I remember target-snowflake also has some similar type/value conversion that only snowflake can't handle but we wanted to give the chance for other targets; so we implemented it in the target and not in the tap.

jmriego · 2020-06-16T13:59:51Z

Sorry, I didn't do a great job of explaining that. I was referring to the fast sync to BigQuery.

There are several modifications you need to do to your source query for making sure you don't export bad rows. For example, here for the dates: https://github.com/transferwise/pipelinewise/blob/master/pipelinewise/fastsync/commons/tap_mysql.py#L215

In BigQuery we also need to add a modification to numbers that are too big for it to handle. The way I was planning on doing that is to modify the get_table_columns function adding a new optional parameter max_number that would make all numbers greater than that as null. Do you think that's an acceptable solution?

jmriego · 2020-06-21T16:14:44Z

Hi @koszti ,

I have had some problems getting this to work but it seems ready now. It seems like Python 3.8 is not compatible with PipelineWise and there were some issues with the Docker image but it's all good now. I have been able to run fast and incremental with the image in the PR. I still have some stuff left to do mainly about testing. Could you give me some instructions on how the end to end tests work? I can't really figure out where to start adding it and which env variables are needed or detected

koszti · 2020-07-04T14:20:16Z

hey @jmriego , yes, I think adding optional parameter of max_number to get_table_columns makes sense, and you can control the behaviour from the bigquery specific fastsync methods. Just please use the current logic as the default so the behaviour will not change in other components.

Regarding testing I'll add some hints to #445

jmriego · 2021-05-29T13:01:03Z

hi @koszti ,

just letting you know that I managed to get the docker testing working and wrote unit tests, integration tests and updated some docs for getting the target bigquery merged.
It's quite a big PR and it required some changes to existing fastsync taps as mentioned but I always left the current existing behaviour. This would require first merging jmriego/pipelinewise-target-bigquery#12 and once that's merged and that repo's version changed, I'll update the target-bigquery requirements.txt file with the newer version number.

I have rebased my changes on top of the current PipelineWise repo and also refactored some stuff to make it similar to the way other PipelineWise targets are organized.

Let me know if there's anything else I need to do for this. Thanks in advance!

koszti added the enhancement New feature or request label Dec 15, 2019

jmriego mentioned this issue Jun 14, 2020

WIP: Adding a Bigquery Target #445

Merged

14 tasks

AlexStansfield mentioned this issue Jul 30, 2020

Using this target with pipelinewise jmriego/pipelinewise-target-bigquery#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding target BigQuery #282

Adding target BigQuery #282

jmriego commented Dec 2, 2019

koszti commented Dec 4, 2019

koszti commented Dec 30, 2019

jmriego commented Jun 2, 2020

stewartbryson commented Jun 7, 2020

jmriego commented Jun 7, 2020

stewartbryson commented Jun 7, 2020

jmriego commented Jun 7, 2020

koszti commented Jun 8, 2020 •

edited

Loading

jmriego commented Jun 9, 2020

stewartbryson commented Jun 9, 2020 •

edited

Loading

koszti commented Jun 9, 2020

jmriego commented Jun 14, 2020

jmriego commented Jun 14, 2020

jmriego commented Jun 14, 2020

koszti commented Jun 15, 2020 •

edited

Loading

jmriego commented Jun 16, 2020

jmriego commented Jun 21, 2020

koszti commented Jul 4, 2020 •

edited

Loading

jmriego commented May 29, 2021

Adding target BigQuery #282

Adding target BigQuery #282

Comments

jmriego commented Dec 2, 2019

koszti commented Dec 4, 2019

koszti commented Dec 30, 2019

jmriego commented Jun 2, 2020

stewartbryson commented Jun 7, 2020

jmriego commented Jun 7, 2020

stewartbryson commented Jun 7, 2020

jmriego commented Jun 7, 2020

koszti commented Jun 8, 2020 • edited Loading

jmriego commented Jun 9, 2020

stewartbryson commented Jun 9, 2020 • edited Loading

koszti commented Jun 9, 2020

jmriego commented Jun 14, 2020

jmriego commented Jun 14, 2020

jmriego commented Jun 14, 2020

koszti commented Jun 15, 2020 • edited Loading

jmriego commented Jun 16, 2020

jmriego commented Jun 21, 2020

koszti commented Jul 4, 2020 • edited Loading

jmriego commented May 29, 2021

koszti commented Jun 8, 2020 •

edited

Loading

stewartbryson commented Jun 9, 2020 •

edited

Loading

koszti commented Jun 15, 2020 •

edited

Loading

koszti commented Jul 4, 2020 •

edited

Loading