INITIALIZE VERTICA PROJECT: target-vertica as fork of pipelinewise-postgres-vertica #7

jordan8037310 · 2021-05-22T22:58:49Z

Proposed changes

This PR is for initial setup of the pipelinewise-target-vertica project.

This PR includes all relevant tests.

Types of changes

New feature (non-breaking change which adds functionality)

Checklist

Description above provides context of the change
I have added tests that prove my fix is effective or that my feature works
Unit tests for changes (not needed for documentation changes)
CI checks pass with my changes
Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
Commit message/PR title starts with [AP-NNNN] (if applicable. AP-NNNN = JIRA ID)
Branch name starts with AP-NNN (if applicable. AP-NNN = JIRA ID)
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions

GUIDE TO TEST

Pipelinewise Guide

This guide is all about the necessary requirements and installation steps required to get started with pipelinewise.

Table of content

Requirements
Installation
Creating Pipelines
- Template for tap-s3-csv
- Template for target-vertica
Running Pipelines
Run Test in Docker Development Environment
Issues

Requirements

Python 3.7 or 3.8 (recommended)

Python packages:

argparse 1.4.0
tabulate 0.8.2
PyYAML 5.3.1
ansible 2.7.16
Jinja2 2.11.3
joblib 1.0.0
PyMySQL 0.7.11
psycopg2-binary 2.8.6
snowflake-connector-python[pandas] 2.3.7
pipelinewise-singer-python 1.*
singer-encodings 0.0.*
python-dateutil 2.8.1 or below
messytables 0.15.*
python-pidfile 3.0.0
pre-commit 2.11.0
pymongo >=3.10, <3.12
tzlocal >=2.0, <2.2
slackclient >=2.7, <2.10
psutil 5.8.0
vertica-python 1.0.1

And the singer connector packages for pipelinwise.

^{Note: All the necessary packages will be installed automatically, the above is just for reference.}

Installation

Clone the pipelinewise and target-vertica repo and change the directory to pipelinewise repo.

$ git clone -b fastsync-vertica https://github.com/full360/pipelinewise.git
$ git clone -b target-vertica https://github.com/full360/pipelinewise-target-vertica.git

Edit requirements.txt for singer target-vertica and add the path to the pipelinewise-target-vertica repo to install it with pip.

Clear the requirements.txt for vertica.
```
$ > pipelinewise/singer-connectors/target-vertica/requirements.txt
```
Get the current directory path and add /pipelinewise-target-vertica at the end. For ex:
```
$ pwd
/Users/jordanryan/code/f360/pipelinewise
```
This is the complete /Users/jordanryan/code/f360/pipelinewise/pipelinewise-target-vertica path to the target-vertica repo. Add this path to the below command.
```
$ echo "<path/to/pipelinewise-target-vertica>" >> pipelinewise/singer-connectors/target-vertica/requirements.txt
```
If everything is done correctly the path to pipelinewise-target-vertica repo should appear in ~/pipelinewise/singer-connectors/target-vertica/requirements.txt. The path to the pipelinewise-target-vertica repo can also be manually edited to the same file.

Run the install script that installs the PipelineWise CLI and every supported singer connectors into separated virtual environments. (refer for more)

$ cd ./pipelinewise
$ ./install.sh --connectors=all --acceptlicenses

(...installation usually takes 5-10 minutes...)

To start CLI:
$ source /Users/jordanryan/code/f360/drupal-office/target-vertica/pipelinewise/.virtualenvs/pipelinewise/bin/activate
$ export PIPELINEWISE_HOME=/Users/jordanryan/code/f360/drupal-office/target-vertica/pipelinewise
$ pipelinewise status

Once the install script finished, you will need to activate the virtual environment with the Command Line Tools and set the PIPELINEWISE_HOME environment variable as it is displayed above at the end of the install script:

$ source /Users/jordanryan/code/f360/drupal-office/target-vertica/pipelinewise/.virtualenvs/pipelinewise/bin/activate
$ export PIPELINEWISE_HOME=/Users/jordanryan/code/f360/drupal-office/target-vertica/pipelinewise
$ pipelinewise status

Tap ID    Tap Type    Target ID    Target Type    Enabled    Status    Last Sync    Last Sync Result
--------  ----------  -----------  -------------  ---------  --------  -----------  ------------------
0 pipeline(s)

If you see the above output saying that you have 0 pipelines in the system then the installation is complete and successful.

Creating Pipelines

After the installation, sample YAML files can be created for each of the supported connectors, which then can be adjusted. Create sample YAML files with the following command:

$ cd ..
$ pipelinewise init --name pipelinewise_samples

This will create a pipelinewise_samples directory with samples for each supported component:

└── pipelinewise_samples
    ├── README.md
    ├── config.yml
    ├── tap_jira.yml.sample
    ├── tap_kafka.yml.sample
    ├── tap_mysql_mariadb.yml.sample
    ├── tap_postgres.yml.sample
    ├── tap_s3_csv.yml.sample
    ├── tap_salesforce.yml.sample
    ├── tap_snowflake.yml.sample
    ├── tap_zendesk.yml.sample
    ├── target_postgres.yml.sample
    ├── target_redshift.yml.sample
    ├── target_s3_csv.yml.sample
    ├── target_snowflake.yml.sample
    └── target_verica.yml.sample

To create a new pipeline you need to enable at least one tap and target by renaming the tap*....yml.sample and one target*...yml.sample file by removing the .sample postfixes.

Tap Template for S3 CSV

---

# ------------------------------------------------------------------------------
# General Properties
# ------------------------------------------------------------------------------
id: "csv_on_s3"                        # Unique identifier of the tap
name: "Sample CSV files on S3"          # Name of the tap
type: "tap-s3-csv"                     # !! THIS SHOULD NOT CHANGE !!
owner: "[email protected]"              # Data owner to contact
#send_alert: False                     # Optional: Disable all configured alerts on this tap
add_metadata_columns: False


# ------------------------------------------------------------------------------
# Source (Tap) - S3 connection details
# ------------------------------------------------------------------------------
db_conn:
  # Credentials based authentication
  aws_access_key_id: <AWS_ACCESS_KEY_ID>          # Plain string or vault encrypted. Required for non-profile based auth. If not provided, AWS_ACCESS_KEY_ID environment variable will be used.
  aws_secret_access_key: <AWS_SECRET_ACCESS_KEY>  # Plain string or vault encrypted. Required for non-profile based auth. If not provided, AWS_SECRET_ACCESS_KEY environment variable will be used.
  bucket: "facet-testing"                         # S3 Bucket name
  start_date: "2000-01-01"                      # File before this data will be excluded
  # fastsync_parallelism: 8                   # Optional: size of multiprocessing pool used by FastSync
                                                #           Min: 1
                                                #           Default: number of CPU cores

# ------------------------------------------------------------------------------
# Destination (Target) - Target properties
# Connection details should be in the relevant target YAML file
# ------------------------------------------------------------------------------
target: "vertica"                       # ID of the target connector where the data will be loaded
batch_size_rows: 20000                    # Batch size for the stream to optimise load performance
stream_buffer_size: 0                     # In-memory buffer size (MB) between taps and targets for asynchronous data pipes
default_target_schema: "s3_feeds"         # Target schema where the data will be loaded
default_target_schema_select_permission:  # Optional: Grant SELECT on schema and tables that created
  - grp_power
primary_key_required: False             # Optional: in case you want to load tables without key
                                          #            properties, uncomment this. Please note
                                          #            that files without primary keys will not
                                          #            be de-duplicated and could cause
                                          #            duplicates. Always try selecting
                                          #            a reasonable key from the CSV file


# ------------------------------------------------------------------------------
# Source to target Schema mapping
# ------------------------------------------------------------------------------
schemas:
  - source_schema: "s3_feeds" # This is mandatory but can be anything in this tap type
    target_schema: "s3_feeds" # Target schema in the destination Data Warehouse

    # List of CSV files to destination tables
    tables:

      # Every file in S3 bucket that matches the search pattern will be loaded into this table
      - table_name: "port_dim"
        s3_csv_mapping:
          search_pattern: "^CSV/atlas/port_dim.txt$"  # Required.
          search_prefix: ""                                       # Optional
          key_properties: []                                      # Optional
          delimiter: "|"                                          # Optional. Default: ','

Target Template for Vertica

---

# ------------------------------------------------------------------------------
# General Properties
# ------------------------------------------------------------------------------
id: "vertica"                       # Unique identifier of the target
name: "Vertica Data Warehouse"      # Name of the target
type: "target-vertica"              # !! THIS SHOULD NOT CHANGE !!
add_metadata_columns: False


# ------------------------------------------------------------------------------
# Target - Data Warehouse connection details
# ------------------------------------------------------------------------------
db_conn:
  host: "0.0.0.0"                    # Vertica host
  port: 5433                         # Vertica port
  user: "dbadmin"                    # Vertica user
  password: " "                      # Plain string or vault encrypted
  dbname: "docker"                   # Vertica database name
  #ssl: "true"                       # Optional: Using SSL via vertica sslmode 'require' option.
                                       #           If the server does not accept SSL connections or the client
                                       #           certificate has not recognized the connection will fail

Once you renamed the files that you need, edit the YAML files with your favourite text editor. Follow the instructions in the files to set database credentials, connection details, select tables to replicate, define source to target schema mapping or add load time transformations. Check the detailed Example replication here.

Once you've configured the YAML files you then activate the Pipelines from the YAML files section with the following command.
```
$ pipelinewise import --dir pipelinewise_samples

... detailed messages about import and discovery...

     -------------------------------------------------------
     IMPORTING YAML CONFIGS FINISHED
     -------------------------------------------------------
         Total targets to import        : 1
         Total taps to import           : 1
         Taps imported successfully     : 1
         Taps failed to import          : []
         Runtime                        : 0:00:01.835720
     -------------------------------------------------------
```
At this point PipelineWise will connect to and analyse every source database, discovering tables, columns and data types and will generate the required JSON files for the singer taps and targets into ~/.pipelinewise. PipelineWise will use this directory internally to keep tracking the state files for Key Based Incremental and Log Based replications (aka. bookmarks) and this will be the directory where the log files will be created. Normally you will need to go into ~/.pipelinewise only when you want to access the log files.

Once the config YAML files are imported, you can see the new pipelines with the status command:

$ pipelinewise status
time=2021-05-19 13:15:38 logger_name=pipelinewise log_level=INFO message=Profiling mode not enabled
Tap ID     Tap Type    Target ID    Target Type     Enabled    Status    Last Sync    Last Sync Result
---------  ----------  -----------  --------------  ---------  --------  -----------  ------------------
csv_on_s3  tap-s3-csv  vertica      target-vertica  True       ready                  unknown

At this point, you have successfully created your first pipeline in PipelineWise and it’s now ready to run

Running Pipelines

To run a pipeline use the run_tap command with --tap and --target arguments to specify which pipeline to run by IDs.

$ pipelinewise run_tap --tap csv_on_s3 --target vertica
time=2021-05-19 13:16:33 logger_name=pipelinewise log_level=INFO message=Profiling mode not enabled
time=2021-05-19 13:16:33 logger_name=pipelinewise.cli.pipelinewise log_level=INFO message=Running csv_on_s3 tap in vertica target
time=2021-05-19 13:16:33 logger_name=pipelinewise.cli.pipelinewise log_level=INFO message=Table(s) selected to sync by fastsync: ['port_dim']
time=2021-05-19 13:16:33 logger_name=pipelinewise.cli.commands log_level=INFO message=Writing output into /Users/jordanryan/.pipelinewise/vertica/csv_on_s3/log/vertica-csv_on_s3-20210519_074633.fastsync.log
time=2021-05-19 13:16:44 logger_name=pipelinewise.cli.pipelinewise log_level=INFO message=Table(s) selected to sync by singer: ['port_dim']
time=2021-05-19 13:16:44 logger_name=pipelinewise.cli.commands log_level=INFO message=Writing output into /Users/jordanryan/.pipelinewise/vertica/csv_on_s3/log/vertica-csv_on_s3-20210519_074633.singer.log
time=2021-05-19 13:16:49 logger_name=pipelinewise.cli.pipelinewise log_level=INFO message=
-------------------------------------------------------
TAP RUN SUMMARY
-------------------------------------------------------
    Status  : SUCCESS
    Runtime : 0:00:11.735258
-------------------------------------------------------

You can check the status with the following command

$ pipelinewise status
time=2021-05-19 13:20:53 logger_name=pipelinewise log_level=INFO message=Profiling mode not enabled
Tap ID     Tap Type    Target ID    Target Type     Enabled    Status    Last Sync            Last Sync Result
---------  ----------  -----------  --------------  ---------  --------  -------------------  ------------------
csv_on_s3  tap-s3-csv  vertica      target-vertica  True       ready     2021-05-19T07:50:39  success
1 pipeline(s)
(pipelinewise) MacBook-Pro:pipelinewise_test jordanryan$

Issues

Critical
- Header detection bug: Takes 1st row as the header of CSV with no headers (issue on github)
- add_metadata_columns bug: add_metadata_columns are created even if set False (issue on github)
Both of the above issues need to be fixed separately for fastsync tap-s3-csv and singer for pipelinewise-tap-s3-csv.
Moderate
- Fastsync and ordinary sync result in different column types (issue on github)

(cherry picked from commit cad4bca)

- Change "postgres" with "vertica". - Add vertica connector packaage "vertica-python". - Change the package version with intial. (cherry picked from commit 38421f0)

- Change "postgres" with "vertica". - Change SQL queries. - Methods of dbsync are modified. - Few other chagnes. (cherry picked from commit 3deebcd)

- Separate all exceptions into one file. - Organise all utility functions of streams and dbsync into one file. (cherry picked from commit fb55004)

(cherry picked from commit 886b6e9)

- Change "postgres" with "vertica". - Change length in one of the test cases. - Change the invalid input exception vertical copy error. (cherry picked from commit 28574c6)

(cherry picked from commit 56879e2)

- Change port for vertica. - Change "postgres" with "vertica" - Change datatypes (cherry picked from commit 8a5d727)

(cherry picked from commit 46061b5)

(cherry picked from commit 3a33dcf)

(cherry picked from commit 1bb3a8e)

(cherry picked from commit 54a1f91)

(cherry picked from commit 64f1e30)

(cherry picked from commit 40651db)

(cherry picked from commit fdd301f)

- Remove leftover comments - Remove completed TODOs - Remove add_columns - Remove unnecesary loggers - Fix ssl confg option for connection - Fix few pylint errors - Add set header to false for copy statement (cherry picked from commit ace52b5)

- Remove "add_columns" function - Remove unnecessary import - Fix few pylint errors. - Fix long varchar data type. - Fix datatypes with length issue. - Correct "integer" data type with "int". (cherry picked from commit 16ccb74)

*Note: comment few badges as the pypi links for vertica doesn't exists* (cherry picked from commit b63b99b)

(cherry picked from commit 18ab42a)

- Change author - Change URL (cherry picked from commit cc4d516)

(cherry picked from commit 8ffe34c)

(cherry picked from commit 4f0ed2b)

(cherry picked from commit 8490926)

(cherry picked from commit 9c4fa27)

(cherry picked from commit 12cbbf3)

(cherry picked from commit ba813e8)

(cherry picked from commit 16aef8a)

(cherry picked from commit c91b9d0)

jordan8037310 added 29 commits May 22, 2021 15:50

Remove CHANGELOG.md

7340841

(cherry picked from commit cad4bca)

Minor change

81841ef

- Change "postgres" with "vertica". - Add vertica connector packaage "vertica-python". - Change the package version with intial. (cherry picked from commit 38421f0)

Major change

cc3a41c

- Change "postgres" with "vertica". - Change SQL queries. - Methods of dbsync are modified. - Few other chagnes. (cherry picked from commit 3deebcd)

Separate exceptions and utility functions.

7f7123c

- Separate all exceptions into one file. - Organise all utility functions of streams and dbsync into one file. (cherry picked from commit fb55004)

Delete test_target_postgres.py

f3a35fe

(cherry picked from commit 886b6e9)

Minor Change

cb4c723

- Change "postgres" with "vertica". - Change length in one of the test cases. - Change the invalid input exception vertical copy error. (cherry picked from commit 28574c6)

Change environment variables for vertica.

0b351f8

(cherry picked from commit 56879e2)

Minor change.

fe55266

- Change port for vertica. - Change "postgres" with "vertica" - Change datatypes (cherry picked from commit 8a5d727)

Change "postgres" with "vertica"

c2dba7d

(cherry picked from commit 46061b5)

Remane test_target_postgres.py to test_target_vertica.py

0c0a49b

(cherry picked from commit 3a33dcf)

Add TODOs for review.

e0acb4c

(cherry picked from commit 1bb3a8e)

Add TODOs for review.

9e8f56a

(cherry picked from commit 54a1f91)

Add CHANGELOG

c6534af

(cherry picked from commit 64f1e30)

Change the version to v1.0.0

67d2b40

(cherry picked from commit 40651db)

Remove unnecessary "pass" statement

d5ea059

(cherry picked from commit fdd301f)

Major change

1d71a4d

- Remove leftover comments - Remove completed TODOs - Remove add_columns - Remove unnecesary loggers - Fix ssl confg option for connection - Fix few pylint errors - Add set header to false for copy statement (cherry picked from commit ace52b5)

Major changes

5db4976

- Remove "add_columns" function - Remove unnecessary import - Fix few pylint errors. - Fix long varchar data type. - Fix datatypes with length issue. - Correct "integer" data type with "int". (cherry picked from commit 16ccb74)

Update the README for vertica

be9266c

*Note: comment few badges as the pypi links for vertica doesn't exists* (cherry picked from commit b63b99b)

Update the README for vertica

93a6697

(cherry picked from commit 18ab42a)

Minor changes

f9dd7b6

- Change author - Change URL (cherry picked from commit cc4d516)

Correct maxlength value for unicode json

d7257de

(cherry picked from commit 8ffe34c)

Make tests a python package for testing.

eeaccd6

(cherry picked from commit 4f0ed2b)

Fix import error.

d5416a9

(cherry picked from commit 8490926)

Remove unnecessary trailing whitespace

475e48f

(cherry picked from commit 9c4fa27)

Correct the testing data types.

dbe906d

(cherry picked from commit 12cbbf3)

Remove leftover comments.

1d2c704

(cherry picked from commit ba813e8)

Achieve 10 rating for pylist test.

facf021

(cherry picked from commit 16aef8a)

Update circleci with vertica config

2eecdec

(cherry picked from commit c91b9d0)

👨‍💻 varchar(1024) as default and other minor fixes

992d9f7

jordan8037310 mentioned this pull request Jun 16, 2021

Add support for specifying column size options in taps (S3 CSV to Vertica VARCHAR(N)) transferwise/pipelinewise#720

Open

vianel merged commit 6f00008 into master Jun 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INITIALIZE VERTICA PROJECT: target-vertica as fork of pipelinewise-postgres-vertica #7

INITIALIZE VERTICA PROJECT: target-vertica as fork of pipelinewise-postgres-vertica #7

jordan8037310 commented May 22, 2021 •

edited

Loading

INITIALIZE VERTICA PROJECT: target-vertica as fork of pipelinewise-postgres-vertica #7

INITIALIZE VERTICA PROJECT: target-vertica as fork of pipelinewise-postgres-vertica #7

Conversation

jordan8037310 commented May 22, 2021 • edited Loading

Proposed changes

Types of changes

Checklist

GUIDE TO TEST

Pipelinewise Guide

Table of content

Requirements

Installation

Creating Pipelines

Tap Template for S3 CSV

Target Template for Vertica

Running Pipelines

Issues

jordan8037310 commented May 22, 2021 •

edited

Loading