Skip to content

Python Classes

Keenan Berry edited this page Jun 25, 2020 · 5 revisions

This section describes the various Python classes and their methods used throughout this project. The goal is that future developers (or maintainers) can utilize this wiki to better understand the code, and consequently, add additional features/functionality as needed. Below you will find a detailed description of 3 Classes and their methods.

Config

class Config

This class uses Airflow's BaseHook class to set connection details for the project's database, S3 bucket, and APIs. The config class is currently only configured to accept API connection details from the data.mo.gov API; however, additional connections can be configured as needed.

All connections must be defined within Airflow, then they can be extracted by running the following commands:

from airflow.hooks.base_hook import BaseHook

DATABASE_CONN = BaseHook.get_connection('postgres_default')

The _CONN objects produce the following class variables:

  • DATABASE_USERNAME string: PostgreSQL DB username
  • DATABASE_PASSWORD string: PostgreSQL DB password
  • DATABASE_PORT numeric: PostgreSQL DB port
  • DATABASE_NAME string: PostgreSQL DB name
  • S3_BUCKET string: AWS S3 bucket name
  • AWS_ACCESS_KEY_ID string: AWS access key
  • AWS_SECRET_ACCESS_KEY string: AWS secret access key
  • AWS_REGION_NAME string: AWS region name
  • API_TOKEN string: API access token
  • API_HOST string: API base URL
  • API_USER_EMAIL string: API user email
  • API_USER_PWD string: API user password

Scraper

class Scraper

This is a simple class that was built to extract and transform data. Originally designed to handle the fetching of data from a URL, this class also has methods that transform "static" files in an AWS S3 bucket.

url
  • A URL string where the scraper will begin scraping data from. This is a required argument; however, if you are using the scraper for transformation of static files in an S3 bucket (method: s3_transform_to_s3), the url parameter is not used and can be set to None.
Config
  • An instance of the Config class. This is a required argument of the scraper class and is used to define the following class variables:
    • bucket_name string: name of the S3 bucket (data source/sink) to connect to
    • aws_access_key_id string: AWS access key
    • aws_secret_access_key string: AWS secret access key
    • api_token string: API token for API configured in Config class
    • api_user_email string: email associated with API
    • api_user_pwd string: password associated with API
connect_s3_sink(self)
  • Method to create AWS S3 connection object s3_conn. Uses the various AWS class variables to make connection. This connection must be made before using any other class method.
url_to_s3(self, filename, filters=None, nullstr='')
  • Method used to extract data from a URL and upload data as .csv file to the S3 bucket.
  • Arguments:
    • filename string: name given to data file uploaded to S3
    • filters dictionary: {column <string>: accepted values <list>} this argument is used to filter a specific column to only include a set of values
    • nullstr string: string assigned to null values upon writing to .csv file; only used when filtering
api_to_s3(self, filename, table_name, limit=2000)
  • Method used to fetch data from an API and upload as .csv file to S3 bucket. Designed for specific use with data.mo.gov API.
  • Arguments:
    • filename string: name given to data file uploaded to S3
    • table_name string: name of data set fetched from API
    • limit numeric: number of records to fetch from API
url_transform_to_s3(self, filename, transformer, sep='|')
  • Method used to fetch data from a URL and transform that data before uploading the data to S3 bucket.
  • Arguments:
    • filename string: name given to data file uploaded to S3
    • transformer function: transformation function applied to data at URL (functions defined in and called from /scripts/url_transformers.py
    • sep string: CSV file delimiter
s3_transform_to_s3(self, data, output_filename, resource_path, transformer, sep='|')
  • Method used to transform static data files within the S3 bucket. Method uses the data argument to identify which files need transformation before uploading the transformed files back to the S3 bucket.
  • Arguments:
    • data string: file or folder name in which the data for transformation can be found
    • output_filename string: name given to transformed data file uploaded to S3
    • resource_path string: path to project's /resources directory; used to load JSON files as dictionary objects that can be used to guide transformations
    • transformer function: transformation function applied to data in S3 (functions defined in and called from /scripts/s3_transformers.py
    • sep string: CSV file delimiter

Database

class Database

A simple class used to upload .csv files from the project's S3 bucket to the PostgrSQL database. The main method of this class csv_to_table uses a SQL copy command to quickly load data into the appropriate database tables.

Config
  • An instance of the Config class. This is a required argument of the database class and is used to define the following class variables:
    • host string: PostgreSQL database connection string
    • username string: PostgreSQL database username
    • password string: PostgreSQL database password
    • port numeric: PostgreSQL database port
    • dbname string: PostgreSQL database name
    • bucket_name string: name of the S3 bucket (data source) to connect to
    • aws_access_key_id string: AWS access key
    • aws_secret_access_key string: AWS secret access key
connect(self)
  • Method to create database connection object conn. Uses the various database class variables to make connection. This connection must be made before loading data to the database.
connect_s3_source(self)
  • Method to create AWS S3 connection object s3_conn. Uses the various AWS class variables to make connection. This connection must be made before loading data to the database.
csv_to_table(self, filename, table_name, sep=',', nullstr='NaN')
  • Main method used to load data from flat files within the S3 bucket to database tables. Uses a PostgreSQL copy command to efficiently load data.
  • Arguments:
    • filename string: name of data file within S3 to be uploaded to the database
    • table_name string: name of the database table to copy data to
    • sep string: CSV file delimiter
    • nullstr string: string assigned to null values in CSV file
close(self)
  • Method closes PostgreSQL database connection. AWS connection automatically closes after set time.

Airflow Callables

This section shows the functions (aka "callables") that are used by Airflow's Python Operators throughout this workflow. The callables utilize the classes and methods described above. Tasks include downloading data from the web and saving it as a flat file in S3, copying flat files to database tables, or transforming static files to make the data more compatible with the database. There are 5 main callables; however, the count of these functions is expected to grow as additional data sources (with new input/output logic) are incorporated into the 211Dashboard pipeline.

def scrape_file(**kwargs):
    '''Calls "url_to_s3" method of Scraper class'''
    s = Scraper(kwargs['url'], Config)
    s.connect_s3_sink()
    s.url_to_s3(filename=kwargs['filename'],
                filters=kwargs['filters'],
                nullstr=kwargs['nullstr'])
def scrape_api(**kwargs):
    '''Calls "api_to_s3" method of Scraper class.'''
    s = Scraper(kwargs['url'], Config)
    s.connect_s3_sink()
    s.api_to_s3(filename=kwargs['filename'],
                table_name=kwargs['table_name'], 
                limit=kwargs['limit'])
def scrape_transform(**kwargs):
    '''Calls "url_transform_to_s3" method of Scraper class'''
    s = Scraper(kwargs['url'], Config)
    s.connect_s3_sink()
    s.url_transform_to_s3(filename=kwargs['filename'],
                          transformer=kwargs['transformer'], 
                          sep=kwargs['sep'])
def transform_static_s3(**kwargs):
    '''Calls "s3_transform_to_s3" method of Scraper class'''
    s = Scraper(None, Config)  # None type url
    s.connect_s3_sink()
    s.s3_transform_to_s3(data=kwargs['data'],
                         output_filename=kwargs['filename'],
                         resource_path=kwargs['resource_path'],
                         transformer=kwargs['transformer'],
                         sep=kwargs['sep'])
def load_file(**kwargs):
    '''Calls "csv_to_table" method of Database class'''
    #NOTE table_name must be truncated before
    db = Database(Config)
    db.connect()
    db.connect_s3_source()
    db.csv_to_table(filename=kwargs['filename'], 
                    table_name=kwargs['table_name'],
                    sep=kwargs['sep'],
                    nullstr=kwargs['nullstr'])
    db.close()

Clone this wiki locally