Text Processing Pipeline

Name : Pooja Chandrakant Shinde

Email : pshin8@uic.edu

Videos

Job Running in Hadoop Pseudo Distributed Mode - https://youtu.be/yr6GhDT-DkY

Job Running in AWS EMR - https://youtu.be/3n63BBl1G-I

Text Processing Pipeline

This project implements a pipeline which first shards the data based on the configured size in Hadoop HDFS (configured in application.conf file). Then for each Split/Shard of Data it runs the Map reduce jobs which are:

TokenizationJob
SlidingWindowJob
EmbeddingJob
SemanticSimilarityJob
StatisticsCollaterJob

The role of each Map reduce job is explained as comments in the code.

Setup

Install Java 8 - brew install homebrew/cask-versions/adoptopenjdk8
Install Scala 3.5.0 - brew install sbt - then install scala plugin from IntelliJ
Install Hadoop 3.4.0 - brew install hadoop
To configure Hadoop to run in pseudo distributed mode follow - https://github.com/0x1DOCD00D/CS441_Fall2024/blob/main/Homeworks/MapReduceHadoopExampleProgram.md

Running the Job

Clone the Project.
run sbt update - to load all the dependencies
run sbt clean compile - to build the project
run SBT_OPTS="-Xmx2G" sbt assembly - to build a fat jar
run hadoop namenode -Format - to format and initialize the namenode
run start-dfs.sh - to start data and name nodes
run start-yarn.sh - to start resource managers
run hdfs dfs -mkdir -p input - to create input folder in hdfs
run hdfs dfs -put /path/to/input/data/file /hdfs/input/folder/path
run hadoop jar /path/to/fat/jar/file /hdfs/input/folder/path /hdfs/output/folder/path

Output

The job output will be available in: /hdfs/output/folder/path

The final output file generated is: https://drive.google.com/file/d/1Y-IyzJ1B92Q4soBzN_ccUw5gIQWiAnAr/view?usp=drive_link

Troubleshooting

If you encounter issues:

Check logs for error messages
Verify input data location and permissions
Ensure your JAR includes all necessary dependencies

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.bsp		.bsp
.idea		.idea
project		project
src		src
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
feedback.md		feedback.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Name : Pooja Chandrakant Shinde

Email : pshin8@uic.edu

Videos

Text Processing Pipeline

Setup

Running the Job

Output

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Name : Pooja Chandrakant Shinde

Email : pshin8@uic.edu

Videos

Text Processing Pipeline

Setup

Running the Job

Output

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages