Word2Vec-and-Topic-Analysis-Yelp-Reviews

Explores The yelp dataset with an LDA topic analysis and a Word2Vec model using spaCy and Gensim

Data:

This analysis is based on the yelp dataset. More specifically the yelp_academic_dataset_reviews.json file. These reviews were filtered by category, this can be seen in extract_business_ids_from_database.ipynb (see Important Note below) where we create a text file that contains business ids linked to reviews to include in the analysis.

Important Note: The business ids file was created using extract_business_ids_from_database.ipynb must be run in the folder for Yelp_sqlite_database. This project can be run without filtering with a small change to config.yaml file, more on this below.

Setup:

Adjust the config.yaml file as needed. In the paths section set the base_data folder and set the paths to yelp_academic_dataset_reviews.json file and if applicable the path to the business_idx.txt file. If no filtering is desired then leave that entry blank.

Replication:

1.) First we need to prepare the text for modeling. This can be done by running prep_text.py from the command line with

python prep_text.py config.yaml

The result of which can be seen in inspect_prepared_text.ipynb. Parameters can be updated/changed in the 'data_prep' section of the config file.

2.) Optional Step: Next we search for the optimal number of topics by running search_for_best_num_topics.py from the command line with

python search_for_best_num_topics.py config.yaml

And the results of the search can be seen in num_topics_search_results.ipynb. Parameters can be updated/changed in the 'lda_tune' and 'lda' sections of the config file.

3.) At this stage we are ready to train the LDA model using the optimal number of topics obtained from step 2. This is done by running train_lda_model_prep_vis.py from the command line with

python train_lda_model_prep_vis.py config.yaml

The results, analysis and visualization can be seen in lda_yelp_reviews.ipynb. Parameters can be updated/changed in the 'lda' section of the config file.

4.) The final step is to train the word2vec model. To do this we can run the train_word2vec.py file from the command line with

python train_word2vec.py config.yaml

We can inspect the model results and analysis in Yelp_2_Vec_rsults.ipynb. Parameters can be updated/changed in the 'word_2_vec' section of the config file.

Note: The .py files take a significant amount of time to run.

Attributions:

This project is based on this notebook which is a great guide to using these nlp models in python.

Other helpful links:
https://spacy.io/
https://radimrehurek.com/gensim/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Word2Vec-and-Topic-Analysis-Yelp-Reviews

Data:

Setup:

Replication:

Attributions:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Yelp_2_Vec_results.ipynb		Yelp_2_Vec_results.ipynb
config.yaml		config.yaml
extract_business_ids_from_database.ipynb		extract_business_ids_from_database.ipynb
inspect_prepared_text.ipynb		inspect_prepared_text.ipynb
lda_yelp_reviews.ipynb		lda_yelp_reviews.ipynb
num_topics_search_results.ipynb		num_topics_search_results.ipynb
prep_text.py		prep_text.py
requirements.txt		requirements.txt
search_for_best_num_topics.py		search_for_best_num_topics.py
train_lda_model_prep_vis.py		train_lda_model_prep_vis.py
train_word2vec.py		train_word2vec.py
utils.py		utils.py

License

Alkoopman85/Word2Vec-and-Topic-Analysis-Yelp-Reviews

Folders and files

Latest commit

History

Repository files navigation

Word2Vec-and-Topic-Analysis-Yelp-Reviews

Data:

Setup:

Replication:

Attributions:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages