The Goal of this project is to use NLP techniques such as Question and Answering, Sentiment Analysis, WordCloud, document similarity and others to extract meaningful insights about Warren Buffet annual letters to the Berkshire Hathaway shareholders.
Create a virtual environment named ibm_venv.
$ python3 -m venv warren_venv -- for Linux and macOS
$ python -m venv warren_venv -- for Windows
After that, activate the python virtual environment
$ source warren_venv/bin/activate -- for Linux and macOS
$ warren_venv\Scripts\activate -- for Windows
Install the requirements
$ pip install -r requirements.txt
To run it you have to download the letters after 2000 at https://www.berkshirehathaway.com/letters/letters.html. After that you need to change the parameters from the function get_letters_corpus_dict to the directory containing the letters, after that you only need to run the desired cells of the notebook
You can get the most similar documents to a specific letter year by running the doc_sim_main.py.
python doc_sim_main.py --algorithm <algorithm> --distance <distance> --path <path> --target <target> --number <number> --pretrained <pretrained>
Where:
- algorithm: Could be tfidf, word2vec, doc2vect and transformer
- distance: Could be cosine or euclidean
- path: Pickle path to the letters dict
- target: The target letter year
- number: The number of letters to return
- pretrained: The pretrained model to use in transformers
To see the full analysis of this code, access my medium post at: https://medium.com/analytics-vidhya/best-nlp-algorithms-to-get-document-similarity-a5559244b23b https://medium.com/analytics-vidhya/using-nlp-to-get-inside-warren-buffet-mind-part-2-8e3557810a39 https://medium.com/analytics-vidhya/using-nlp-to-get-inside-warren-buffet-mind-part-i-666d717d0c2e