spark-samples

This repository holds samples of spark exercises completed for learning purposes. The data sets used come in the form of a CSV or plain text file.

movie-similarities-genre-filter.py

Using a data file mapping users to movies with ratings, we are able to provide recommendations for movies that are similar to a chosen movie. We first import the data set as key pair values in which keys are users and values are tuples in the form of (movieid, rating). We then join this RDD to itself to find all possible movie pairs for a given user. We filter out movie combinations in which each movie is the same or movies that have no genres in common. Next, we transform our RDD to eliminate the userID and instead use movie combinations as keys while while pairs of ratings exist as keys. Finally, we compute the cosine similarity for each movie pair and then print the movies that meet specified thresholds for both cooccurence and similarity score.

sales-by-customer.py

This is a simple aggregation grouped by customer using a sample csv file with each row representing a retail order.

word-count.py

This script intakes a text file and flattens it into an RDD. It counts the number of occurances of each word and then displays the words in order of increasing occurences.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-samples

This repository holds samples of spark exercises completed for learning purposes. The data sets used come in the form of a CSV or plain text file.

movie-similarities-genre-filter.py

sales-by-customer.py

This is a simple aggregation grouped by customer using a sample csv file with each row representing a retail order.

word-count.py

This script intakes a text file and flattens it into an RDD. It counts the number of occurances of each word and then displays the words in order of increasing occurences.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

spark-samples

This repository holds samples of spark exercises completed for learning purposes. The data sets used come in the form of a CSV or plain text file.

movie-similarities-genre-filter.py

sales-by-customer.py

This is a simple aggregation grouped by customer using a sample csv file with each row representing a retail order.

word-count.py

This script intakes a text file and flattens it into an RDD. It counts the number of occurances of each word and then displays the words in order of increasing occurences.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages