Reddit-Analyzer

Goal

The end product you will get building this project is an insight analysis on reddit posts/comments. You will be building an end-to-end pipeline, starting from data ingestion to visualization.

Why do we need that?

A huge amount of data flows through social networks. Huge amount of data translates in an endless potential to build something useful.

Requirements

In order to accomplish our goal we need a few things:

Data from reddit
A way to stream those data
A way to manipulate data in streaming
A way to FAST analyze our data
A way to provide those analytics to a end-user

Data Pipeline

Step 1

Start zookeeper and kafka server.
This allows to reliably stream big amount of data fast and easy.

Move to /kafka and type

docker-compose up

on your terminal.

This wil bring up both zookeeper and kafka server.

Step 2

Start reddit connector. This will get streaming data from posts and comments in the desidered subreddit (which you can set on reddit.env file).

Move to /bin and type

./reddit_connector.sh

This will start a docker container running the kafka connector which will stream data in two different kafka topics:

reddit-posts
reddit-comments

Step 3

Start Elastic Search and Kibana which will help later on for Data Indexing and visualization.
Elastic Search is used to fast aggregate streaming data, thanks to its lighting speed.

Move to /elasticsearch and type

docker-compose up

Step 4

Start Spark Streaming in order to perform Data Elaboration, cleaning and predictions.

Elaboration steps

Fit our model with our training data
Make a spark rdd containing just the intresting field from kafka streamed data
Convert the unix epoch time to timestamp
Apply some NLP
Predict using the trained model
Dump to a JSON
Send to Elastic Search

Move to /bin and type

./sparkSubmitPython.sh comment_stream.py "org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.6,org.elasticsearch:elasticsearch-hadoop:7.7.0"

Step 5

If your machine is still responsive and fully working after this, it's time to pratically see what we've achieved so far.
Go to Kibana in order to visualize the streaming data and what is trending!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit-Analyzer

Goal

Why do we need that?

Requirements

Data Pipeline

Step 1

Step 2

Step 3

Step 4

Elaboration steps

Step 5

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
bin		bin
connector		connector
doc		doc
elasticsearch		elasticsearch
images		images
kafka		kafka
spark		spark
LICENSE		LICENSE
README.md		README.md

License

Guberlo/Reddit-Analyzer

Folders and files

Latest commit

History

Repository files navigation

Reddit-Analyzer

Goal

Why do we need that?

Requirements

Data Pipeline

Step 1

Step 2

Step 3

Step 4

Elaboration steps

Step 5

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages