Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INITIALIZE VERTICA PROJECT: target-vertica as fork of pipelinewise-postgres-vertica #7

Merged
merged 29 commits into from
Jun 18, 2021

Conversation

jordan8037310
Copy link
Collaborator

@jordan8037310 jordan8037310 commented May 22, 2021

Proposed changes

This PR is for initial setup of the pipelinewise-target-vertica project.

  • This PR includes all relevant tests.

Types of changes

  • New feature (non-breaking change which adds functionality)

Checklist

  • Description above provides context of the change
  • I have added tests that prove my fix is effective or that my feature works
  • Unit tests for changes (not needed for documentation changes)
  • CI checks pass with my changes
  • Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
  • Commit message/PR title starts with [AP-NNNN] (if applicable. AP-NNNN = JIRA ID)
  • Branch name starts with AP-NNN (if applicable. AP-NNN = JIRA ID)
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions

GUIDE TO TEST

Pipelinewise Guide

This guide is all about the necessary requirements and installation steps required to get started with pipelinewise.

Table of content

  1. Requirements
  2. Installation
  3. Creating Pipelines
  4. Running Pipelines
  5. Run Test in Docker Development Environment
  6. Issues

Requirements

  • Python 3.7 or 3.8 (recommended)

Python packages:

  • argparse 1.4.0
  • tabulate 0.8.2
  • PyYAML 5.3.1
  • ansible 2.7.16
  • Jinja2 2.11.3
  • joblib 1.0.0
  • PyMySQL 0.7.11
  • psycopg2-binary 2.8.6
  • snowflake-connector-python[pandas] 2.3.7
  • pipelinewise-singer-python 1.*
  • singer-encodings 0.0.*
  • python-dateutil 2.8.1 or below
  • messytables 0.15.*
  • python-pidfile 3.0.0
  • pre-commit 2.11.0
  • pymongo >=3.10, <3.12
  • tzlocal >=2.0, <2.2
  • slackclient >=2.7, <2.10
  • psutil 5.8.0
  • vertica-python 1.0.1

And the singer connector packages for pipelinwise.

Note: All the necessary packages will be installed automatically, the above is just for reference.

Installation

  1. Clone the pipelinewise and target-vertica repo and change the directory to pipelinewise repo.

    $ git clone -b fastsync-vertica https://github.com/full360/pipelinewise.git
    $ git clone -b target-vertica https://github.com/full360/pipelinewise-target-vertica.git
  2. Edit requirements.txt for singer target-vertica and add the path to the pipelinewise-target-vertica repo to install it with pip.

    Clear the requirements.txt for vertica.

    $ > pipelinewise/singer-connectors/target-vertica/requirements.txt

    Get the current directory path and add /pipelinewise-target-vertica at the end. For ex:

    $ pwd
    /Users/jordanryan/code/f360/pipelinewise

    This is the complete /Users/jordanryan/code/f360/pipelinewise/pipelinewise-target-vertica path to the target-vertica repo. Add this path to the below command.

    $ echo "<path/to/pipelinewise-target-vertica>" >> pipelinewise/singer-connectors/target-vertica/requirements.txt

    If everything is done correctly the path to pipelinewise-target-vertica repo should appear in ~/pipelinewise/singer-connectors/target-vertica/requirements.txt. The path to the pipelinewise-target-vertica repo can also be manually edited to the same file.

  3. Run the install script that installs the PipelineWise CLI and every supported singer connectors into separated virtual environments. (refer for more)

    $ cd ./pipelinewise
    $ ./install.sh --connectors=all --acceptlicenses
    
    (...installation usually takes 5-10 minutes...)
    
    To start CLI:
    $ source /Users/jordanryan/code/f360/drupal-office/target-vertica/pipelinewise/.virtualenvs/pipelinewise/bin/activate
    $ export PIPELINEWISE_HOME=/Users/jordanryan/code/f360/drupal-office/target-vertica/pipelinewise
    $ pipelinewise status

Once the install script finished, you will need to activate the virtual environment with the Command Line Tools and set the PIPELINEWISE_HOME environment variable as it is displayed above at the end of the install script:

$ source /Users/jordanryan/code/f360/drupal-office/target-vertica/pipelinewise/.virtualenvs/pipelinewise/bin/activate
$ export PIPELINEWISE_HOME=/Users/jordanryan/code/f360/drupal-office/target-vertica/pipelinewise
$ pipelinewise status

Tap ID    Tap Type    Target ID    Target Type    Enabled    Status    Last Sync    Last Sync Result
--------  ----------  -----------  -------------  ---------  --------  -----------  ------------------
0 pipeline(s)

If you see the above output saying that you have 0 pipelines in the system then the installation is complete and successful.

Creating Pipelines

  1. After the installation, sample YAML files can be created for each of the supported connectors, which then can be adjusted. Create sample YAML files with the following command:

    $ cd ..
    $ pipelinewise init --name pipelinewise_samples

    This will create a pipelinewise_samples directory with samples for each supported component:

    └── pipelinewise_samples
        ├── README.md
        ├── config.yml
        ├── tap_jira.yml.sample
        ├── tap_kafka.yml.sample
        ├── tap_mysql_mariadb.yml.sample
        ├── tap_postgres.yml.sample
        ├── tap_s3_csv.yml.sample
        ├── tap_salesforce.yml.sample
        ├── tap_snowflake.yml.sample
        ├── tap_zendesk.yml.sample
        ├── target_postgres.yml.sample
        ├── target_redshift.yml.sample
        ├── target_s3_csv.yml.sample
        ├── target_snowflake.yml.sample
        └── target_verica.yml.sample

    To create a new pipeline you need to enable at least one tap and target by renaming the tap*....yml.sample and one target*...yml.sample file by removing the .sample postfixes.

    Tap Template for S3 CSV

    ---
    
    # ------------------------------------------------------------------------------
    # General Properties
    # ------------------------------------------------------------------------------
    id: "csv_on_s3"                        # Unique identifier of the tap
    name: "Sample CSV files on S3"          # Name of the tap
    type: "tap-s3-csv"                     # !! THIS SHOULD NOT CHANGE !!
    owner: "[email protected]"              # Data owner to contact
    #send_alert: False                     # Optional: Disable all configured alerts on this tap
    add_metadata_columns: False
    
    
    # ------------------------------------------------------------------------------
    # Source (Tap) - S3 connection details
    # ------------------------------------------------------------------------------
    db_conn:
      # Credentials based authentication
      aws_access_key_id: <AWS_ACCESS_KEY_ID>          # Plain string or vault encrypted. Required for non-profile based auth. If not provided, AWS_ACCESS_KEY_ID environment variable will be used.
      aws_secret_access_key: <AWS_SECRET_ACCESS_KEY>  # Plain string or vault encrypted. Required for non-profile based auth. If not provided, AWS_SECRET_ACCESS_KEY environment variable will be used.
      bucket: "facet-testing"                         # S3 Bucket name
      start_date: "2000-01-01"                      # File before this data will be excluded
      # fastsync_parallelism: 8                   # Optional: size of multiprocessing pool used by FastSync
                                                    #           Min: 1
                                                    #           Default: number of CPU cores
    
    # ------------------------------------------------------------------------------
    # Destination (Target) - Target properties
    # Connection details should be in the relevant target YAML file
    # ------------------------------------------------------------------------------
    target: "vertica"                       # ID of the target connector where the data will be loaded
    batch_size_rows: 20000                    # Batch size for the stream to optimise load performance
    stream_buffer_size: 0                     # In-memory buffer size (MB) between taps and targets for asynchronous data pipes
    default_target_schema: "s3_feeds"         # Target schema where the data will be loaded
    default_target_schema_select_permission:  # Optional: Grant SELECT on schema and tables that created
      - grp_power
    primary_key_required: False             # Optional: in case you want to load tables without key
                                              #            properties, uncomment this. Please note
                                              #            that files without primary keys will not
                                              #            be de-duplicated and could cause
                                              #            duplicates. Always try selecting
                                              #            a reasonable key from the CSV file
    
    
    # ------------------------------------------------------------------------------
    # Source to target Schema mapping
    # ------------------------------------------------------------------------------
    schemas:
      - source_schema: "s3_feeds" # This is mandatory but can be anything in this tap type
        target_schema: "s3_feeds" # Target schema in the destination Data Warehouse
    
        # List of CSV files to destination tables
        tables:
    
          # Every file in S3 bucket that matches the search pattern will be loaded into this table
          - table_name: "port_dim"
            s3_csv_mapping:
              search_pattern: "^CSV/atlas/port_dim.txt$"  # Required.
              search_prefix: ""                                       # Optional
              key_properties: []                                      # Optional
              delimiter: "|"                                          # Optional. Default: ','

    Target Template for Vertica

    ---
    
    # ------------------------------------------------------------------------------
    # General Properties
    # ------------------------------------------------------------------------------
    id: "vertica"                       # Unique identifier of the target
    name: "Vertica Data Warehouse"      # Name of the target
    type: "target-vertica"              # !! THIS SHOULD NOT CHANGE !!
    add_metadata_columns: False
    
    
    # ------------------------------------------------------------------------------
    # Target - Data Warehouse connection details
    # ------------------------------------------------------------------------------
    db_conn:
      host: "0.0.0.0"                    # Vertica host
      port: 5433                         # Vertica port
      user: "dbadmin"                    # Vertica user
      password: " "                      # Plain string or vault encrypted
      dbname: "docker"                   # Vertica database name
      #ssl: "true"                       # Optional: Using SSL via vertica sslmode 'require' option.
                                           #           If the server does not accept SSL connections or the client
                                           #           certificate has not recognized the connection will fail
    

    Once you renamed the files that you need, edit the YAML files with your favourite text editor. Follow the instructions in the files to set database credentials, connection details, select tables to replicate, define source to target schema mapping or add load time transformations. Check the detailed Example replication here.

  2. Once you've configured the YAML files you then activate the Pipelines from the YAML files section with the following command.

    $ pipelinewise import --dir pipelinewise_samples
    
    ... detailed messages about import and discovery...
    
         -------------------------------------------------------
         IMPORTING YAML CONFIGS FINISHED
         -------------------------------------------------------
             Total targets to import        : 1
             Total taps to import           : 1
             Taps imported successfully     : 1
             Taps failed to import          : []
             Runtime                        : 0:00:01.835720
         -------------------------------------------------------

    At this point PipelineWise will connect to and analyse every source database, discovering tables, columns and data types and will generate the required JSON files for the singer taps and targets into ~/.pipelinewise. PipelineWise will use this directory internally to keep tracking the state files for Key Based Incremental and Log Based replications (aka. bookmarks) and this will be the directory where the log files will be created. Normally you will need to go into ~/.pipelinewise only when you want to access the log files.

  3. Once the config YAML files are imported, you can see the new pipelines with the status command:

    $ pipelinewise status
    time=2021-05-19 13:15:38 logger_name=pipelinewise log_level=INFO message=Profiling mode not enabled
    Tap ID     Tap Type    Target ID    Target Type     Enabled    Status    Last Sync    Last Sync Result
    ---------  ----------  -----------  --------------  ---------  --------  -----------  ------------------
    csv_on_s3  tap-s3-csv  vertica      target-vertica  True       ready                  unknown

    At this point, you have successfully created your first pipeline in PipelineWise and it’s now ready to run

Running Pipelines

To run a pipeline use the run_tap command with --tap and --target arguments to specify which pipeline to run by IDs.

$ pipelinewise run_tap --tap csv_on_s3 --target vertica
time=2021-05-19 13:16:33 logger_name=pipelinewise log_level=INFO message=Profiling mode not enabled
time=2021-05-19 13:16:33 logger_name=pipelinewise.cli.pipelinewise log_level=INFO message=Running csv_on_s3 tap in vertica target
time=2021-05-19 13:16:33 logger_name=pipelinewise.cli.pipelinewise log_level=INFO message=Table(s) selected to sync by fastsync: ['port_dim']
time=2021-05-19 13:16:33 logger_name=pipelinewise.cli.commands log_level=INFO message=Writing output into /Users/jordanryan/.pipelinewise/vertica/csv_on_s3/log/vertica-csv_on_s3-20210519_074633.fastsync.log
time=2021-05-19 13:16:44 logger_name=pipelinewise.cli.pipelinewise log_level=INFO message=Table(s) selected to sync by singer: ['port_dim']
time=2021-05-19 13:16:44 logger_name=pipelinewise.cli.commands log_level=INFO message=Writing output into /Users/jordanryan/.pipelinewise/vertica/csv_on_s3/log/vertica-csv_on_s3-20210519_074633.singer.log
time=2021-05-19 13:16:49 logger_name=pipelinewise.cli.pipelinewise log_level=INFO message=
-------------------------------------------------------
TAP RUN SUMMARY
-------------------------------------------------------
    Status  : SUCCESS
    Runtime : 0:00:11.735258
-------------------------------------------------------

You can check the status with the following command

$ pipelinewise status
time=2021-05-19 13:20:53 logger_name=pipelinewise log_level=INFO message=Profiling mode not enabled
Tap ID     Tap Type    Target ID    Target Type     Enabled    Status    Last Sync            Last Sync Result
---------  ----------  -----------  --------------  ---------  --------  -------------------  ------------------
csv_on_s3  tap-s3-csv  vertica      target-vertica  True       ready     2021-05-19T07:50:39  success
1 pipeline(s)
(pipelinewise) MacBook-Pro:pipelinewise_test jordanryan$ 

Issues

  1. Critical

    • Header detection bug: Takes 1st row as the header of CSV with no headers (issue on github)
    • add_metadata_columns bug: add_metadata_columns are created even if set False (issue on github)

    Both of the above issues need to be fixed separately for fastsync tap-s3-csv and singer for pipelinewise-tap-s3-csv.

  2. Moderate

    • Fastsync and ordinary sync result in different column types (issue on github)

(cherry picked from commit cad4bca)
- Change "postgres" with "vertica".
- Add vertica connector packaage "vertica-python".
- Change the package version with intial.

(cherry picked from commit 38421f0)
- Change "postgres" with "vertica".
- Change SQL queries.
- Methods of dbsync are modified.
- Few other chagnes.

(cherry picked from commit 3deebcd)
- Separate all exceptions into one file.
- Organise all utility functions of streams and dbsync into one file.

(cherry picked from commit fb55004)
(cherry picked from commit 886b6e9)
- Change "postgres" with "vertica".
- Change length in one of the test cases.
- Change the invalid input exception vertical copy error.

(cherry picked from commit 28574c6)
- Change port for vertica.
- Change "postgres" with "vertica"
- Change datatypes

(cherry picked from commit 8a5d727)
(cherry picked from commit 46061b5)
(cherry picked from commit 1bb3a8e)
(cherry picked from commit 54a1f91)
(cherry picked from commit 64f1e30)
(cherry picked from commit 40651db)
(cherry picked from commit fdd301f)
- Remove leftover comments
- Remove completed TODOs
- Remove add_columns
- Remove unnecesary loggers
- Fix ssl confg option for connection
- Fix few pylint errors
- Add set header to false for copy statement

(cherry picked from commit ace52b5)
- Remove "add_columns" function
- Remove unnecessary import
- Fix few pylint errors.
- Fix long varchar data type.
- Fix datatypes with length issue.
- Correct "integer" data type with "int".

(cherry picked from commit 16ccb74)
*Note: comment few badges as the pypi links for vertica doesn't exists*

(cherry picked from commit b63b99b)
(cherry picked from commit 18ab42a)
- Change author
- Change URL

(cherry picked from commit cc4d516)
(cherry picked from commit 8490926)
(cherry picked from commit 12cbbf3)
(cherry picked from commit ba813e8)
(cherry picked from commit 16aef8a)
(cherry picked from commit c91b9d0)
@vianel vianel merged commit 6f00008 into master Jun 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants