Amazon Elastic MapReduce

Return to table of contents
Useful Links:
Exam Tips:
- Used for big data processing, manipulation, analytics, indexing, transformation, and more (used as a part of data pipeline).
- If these are mentioned in the exam, EMR is probably being used:
  - Spark
  - HBase
  - Presto
  - Fink
  - Hive
  - Pig
- EMR deployment options:
  - Cluster for specific function.
  - Create cluster to be long lasting for multiple workloads.
  - Cluster to be connected to and used interactively.
    - Run SQL like queries against hive.
- Using S3 backed storage allows persistence of the data beyond the life of the cluster.
- By default all instances in a cluster are place within the same AZ for performance.
- Try to understand what Spark, Hive, and Pig do.
- EMR Architecture:
  - Each cluster has at least 1 node - the master node.
    - Master node manages the cluster and its health.
    - It distributes workloads, and acts as the NAME node within MapReduce.
    - The node you SSH into.
  - Clusters can have zero or more core nodes.
    - Act as data nodes for HDFS.
    - Run task trackers and can run mapping and reduce tasks in the cluster.
  - Task nodes are optional:
    - They have no HDFS involvement.
    - They don't run task trackers (core), only run tasks.
    - Ideal for SPOT based scaling.
- EMR Cost and Performance Optimization:
  - Performance:
    - Launch the cluster as close to the data source as possible, in the same region is preferable.
  - Cost:
    - Use latest generations of the instance type.
    - Only use reserved instances when you know the usage upfront.
    - If you are using per-second billing, it might make more sense to run more instances for a shorter amount of time to process jobs quicker.
      - The logic behind this is that since the hour minimum billing is no longer applicable, you can use more instances to process a job quicker and pay the same price or less.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EMR.md

EMR.md

Amazon Elastic MapReduce

Files

EMR.md

Latest commit

History

EMR.md

File metadata and controls

Amazon Elastic MapReduce