This repository contains the code necessary to build the data pipeline for PySQLshop, a fictional store. The code is designed for educational purposes and demonstrates how to set up and manage a data pipeline using Airflow and DBT (Data Build Tool) for data transformations with BigQuery as the target.
In this project, we use Docker to containerize Airflow, DBT, and related services. The DBT process uses the dbt-bigquery
adapter to connect to Google BigQuery for data transformation tasks.
Before running this project, you need to have the following prerequisites:
-
Docker installed on your machine to build and run the containers.
-
Docker Compose installed to manage multi-container Docker applications.
-
Google Cloud Service Account with access to BigQuery, and the corresponding JSON key file.
- Create a service account in Google Cloud with access to BigQuery.
- Download the service account key JSON file and place it in the root of this repository as
bigquery-service-account.json
.
If you haven't already cloned the repository, you can do so by running the following command:
git clone [email protected]:GADES-DATAENG/webinar.git
cd webinar
Before starting the services, you need to build the .env file with some variables. Please check the .env.template file and use it as a template for your .env file.
cp .env.template .env
After downloading your GCP service account JSON credentials file, just past it under the keys folder with the name gcp-key.json
Before starting the services, you need to build the DBT Docker image. Run the following command inside the repository folder:
docker build -t dbt-core .
This will build the dbt-core
image based on the Dockerfile
in the repository.
Once the image is built, you can start the services (Airflow, DBT, and other dependencies) using Docker Compose. Run the following command:
docker-compose up -d
This command will start all the containers defined in the docker-compose.yml
file. It will set up Airflow, DBT, and any necessary services, including BigQuery integration.
- Airflow Web UI: You can access the Airflow web interface at http://localhost:8080
- Default login credentials are
- Username:
airflow
- Password:
airflow
- Username:
- Default login credentials are
- DBT: The DBT transformation will run inside the DBT container, triggered by the Airflow DAG
- The DBT container uses
dbt-bigquery
to interact with Google BigQuery - The service account key file (
gcp-key.json
) should be inside thekeys
folder. DBT will use this file to authenticate and interact with BigQuery
Ensure that the key file is placed correctly in the repository folder as:
/webinar/gcp-key.json
Airflow will trigger the DBT transformations according to the defined DAGs. You can monitor the progress of your tasks in the Airflow UI and view the logs for any issues or success.
If you need to run DBT manually inside the container, you can use the following command:
docker exec -it dbt-container dbt run
This will execute the DBT transformations inside the running container.
/dags
: Contains the Airflow DAGs that control the data pipeline./dbt
: Contains DBT models and configuration.docker-compose.yml
: The Docker Compose configuration to run the services.Dockerfile
: The Dockerfile for building the DBT container./keys/gcp-key.json
: Your Google Cloud service account JSON key file (not included in the repo for security reasons).
This project is for educational purposes. Please do not use for production without proper security and configuration updates.