Syllabus

Overview

This 6-week program provides a hands-on introduction to Apache Hadoop and Spark programming using Python and cloud computing. The key components covered by the course include Hadoop Distributed File Systems, MapReduce using MRJob, Apache Hive, Pig, and Spark. Tools and platforms that are used include Docker, Amazon Web Services and Databricks. In the first half of the program students are required to pull a pre-built Docker image and run most of the exercises locally using docker containers. In the second half students must access their AWS and Databricks accounts to run cloud computing exercises. Students will need to bring their laptops to class. Detailed instructions will be provided ahead of time on: how to pull and run a docker image, how to connect to AWS/Databricks, etc.

Unit 1: Introduction to Hadoop

Data Engineering Toolkits

Running Linux using Docker containers
Linux CLI command and bash scripts
Python basics

Hadoop and MapReduce

Big Data Overview
HDFS
YARN
MapReduce

Unit 2: MapReduce

MapReduce using MRJob 1

Protocols for Input & Output
Filtering

MapReduce using MRJob 2

Top n
Inverted Index
Multi-step Jobs

Unit 3: Apache Hive

Apache Hive 1

Databases for Big Data
HiveQL and Querying Data
Windowing And Analytics Functions
MapReduce Scripts

Apache Hive 2

Tables in Hive
Managed Tables and External Tables
Storage Formats
Partitions and Buckets

Unit 4: Apache Pig

Apache Pig 1

Overview
Pig Latin: Data Types
Pig Latin: Relational Operators

Apache Pig 2

More Pig Latin: Relational operators
More Pig Latin: Functions
Compiling Pig to MapReduce
The Parallel Clause
Join Optimizations

Unit 5: Apache Spark and AWS

Apache Spark - Spark Core

Spark Overview
Running Spark using Databricks Notebooks
Working with PySpark: RDDs
Transformations and Actions

Apache Spark - Spark SQL

Spark DataFrame
SQL Operations using Spark SQL

Apache Spark - Spark ML

ML Pipeline using PySpark

Amazon Elastic MapReduce

Overview
Amazon Web Services: IAM, EC2, S3
Creating EMR Cluster
Submitting Jobs
Intro to AWS CLI

Project: Data Engineering Project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syllabus.md

syllabus.md

Syllabus

Overview

Files

syllabus.md

Latest commit

History

syllabus.md

File metadata and controls

Syllabus

Overview