Email : pshin8@uic.edu
Job Running in Hadoop Pseudo Distributed Mode - https://youtu.be/yr6GhDT-DkY
Job Running in AWS EMR - https://youtu.be/3n63BBl1G-I
This project implements a pipeline which first shards the data based on the configured size in Hadoop HDFS (configured in application.conf file). Then for each Split/Shard of Data it runs the Map reduce jobs which are:
- TokenizationJob
- SlidingWindowJob
- EmbeddingJob
- SemanticSimilarityJob
- StatisticsCollaterJob
The role of each Map reduce job is explained as comments in the code.
- Install Java 8 - brew install homebrew/cask-versions/adoptopenjdk8
- Install Scala 3.5.0 - brew install sbt - then install scala plugin from IntelliJ
- Install Hadoop 3.4.0 - brew install hadoop
- To configure Hadoop to run in pseudo distributed mode follow - https://github.com/0x1DOCD00D/CS441_Fall2024/blob/main/Homeworks/MapReduceHadoopExampleProgram.md
- Clone the Project.
- run
sbt update- to load all the dependencies - run
sbt clean compile- to build the project - run
SBT_OPTS="-Xmx2G" sbt assembly- to build a fat jar - run
hadoop namenode -Format- to format and initialize the namenode - run
start-dfs.sh- to start data and name nodes - run
start-yarn.sh- to start resource managers - run
hdfs dfs -mkdir -p input- to create input folder in hdfs - run
hdfs dfs -put /path/to/input/data/file /hdfs/input/folder/path - run
hadoop jar /path/to/fat/jar/file /hdfs/input/folder/path /hdfs/output/folder/path
The job output will be available in: /hdfs/output/folder/path
The final output file generated is: https://drive.google.com/file/d/1Y-IyzJ1B92Q4soBzN_ccUw5gIQWiAnAr/view?usp=drive_link
If you encounter issues:
- Check logs for error messages
- Verify input data location and permissions
- Ensure your JAR includes all necessary dependencies