Skip to content

Latest commit

 

History

History
81 lines (61 loc) · 2.27 KB

syllabus.md

File metadata and controls

81 lines (61 loc) · 2.27 KB

Syllabus

Overview

This 6-week program provides a hands-on introduction to Apache Hadoop and Spark programming using Python and cloud computing. The key components covered by the course include Hadoop Distributed File Systems, MapReduce using MRJob, Apache Hive, Pig, and Spark. Tools and platforms that are used include Docker, Amazon Web Services and Databricks. In the first half of the program students are required to pull a pre-built Docker image and run most of the exercises locally using docker containers. In the second half students must access their AWS and Databricks accounts to run cloud computing exercises. Students will need to bring their laptops to class. Detailed instructions will be provided ahead of time on: how to pull and run a docker image, how to connect to AWS/Databricks, etc.

Unit 1: Introduction to Hadoop

  1. Data Engineering Toolkits
  • Running Linux using Docker containers
  • Linux CLI command and bash scripts
  • Python basics
  1. Hadoop and MapReduce
  • Big Data Overview
  • HDFS
  • YARN
  • MapReduce

Unit 2: MapReduce

  1. MapReduce using MRJob 1
  • Protocols for Input & Output
  • Filtering
  1. MapReduce using MRJob 2
  • Top n
  • Inverted Index
  • Multi-step Jobs

Unit 3: Apache Hive

  1. Apache Hive 1
  • Databases for Big Data
  • HiveQL and Querying Data
  • Windowing And Analytics Functions
  • MapReduce Scripts
  1. Apache Hive 2
  • Tables in Hive
  • Managed Tables and External Tables
  • Storage Formats
  • Partitions and Buckets

Unit 4: Apache Pig

  1. Apache Pig 1
  • Overview
  • Pig Latin: Data Types
  • Pig Latin: Relational Operators
  1. Apache Pig 2
  • More Pig Latin: Relational operators
  • More Pig Latin: Functions
  • Compiling Pig to MapReduce
  • The Parallel Clause
  • Join Optimizations

Unit 5: Apache Spark and AWS

  1. Apache Spark - Spark Core
  • Spark Overview
  • Running Spark using Databricks Notebooks
  • Working with PySpark: RDDs
  • Transformations and Actions
  1. Apache Spark - Spark SQL
  • Spark DataFrame
  • SQL Operations using Spark SQL
  1. Apache Spark - Spark ML
  • ML Pipeline using PySpark
  1. Amazon Elastic MapReduce
  • Overview
  • Amazon Web Services: IAM, EC2, S3
  • Creating EMR Cluster
  • Submitting Jobs
  • Intro to AWS CLI

Project: Data Engineering Project