GitHub - CarloArpini/2024-DataManagement-Project: This project contains materials and result for the 2023-2024 Data Management Course's project

README

Project by Carlo Arpini, Alisia Sallemi, Fabio Di Rocco.

How to Use:

This project aims at creating a database to answer the question: is there any correlation between posts on r/walstreetbets' subreddit and yahoo finance's active, gainers and losers stocks of the day?

To acquire data, the "Data_Acquisition_Notebook.ipynb" is present. Different functions are created and used in this notebook, to both acquire all data and extract relevant information. Its usage should be as follow: the notebook should be place in a directory containing the "company_tickers.json" file. It can then be executed and it contains all the relevant data to connect to a database on MongoDB and to a reddit APP which enables download of data, as well as containing different functions to scrape data from yahoo finance; passwords and access to both Reddit's APP and MongoDB's database are present. The user should start the notebook a few minutes before market open hours (NY time) and the notebook itself should then acquire all data up until market closed hours, acquiring Yahoo financial data every 5 minutes and Reddit's post data every 3 hours. If an acquisition fails, due to any factor, one can note down the "actual_time" variable and restart the cell updating "acquisitions" variable with the last acquisition performed, and then note down at which time it restarted.

It's mandatory to keep a table of records within the project where one notes down time of when the acquisition failed and restarted throughout the days. This table will prove useful in integrating the dataset.

This ensures anyone is able to enrich the dataset through the "cleaning_integration_script.ipynb", the second most important document, which itself contains various functions and methods that aim at deleting or inserting new documents on MongoDB's database, and should provide an almost complete dataset whenever the acquisition failed; be mindful o adjust variables in datetime values of failures kept in table of records.

At last, all queries used to try and reply to our research question are present in the "query_notebook.ipynb" file, which access our database with the usual credentials and searches through the enriched collections.

Because of all this the order of execution (and the most logical one) should be first acquisition notebook, then the cleaning_integration_script script and at last the query notebook.

LAST UPDATE JAN 2024

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Data_Acquisition_Notebook.ipynb		Data_Acquisition_Notebook.ipynb
README.md		README.md
SCR-20240218-iwxm.png		SCR-20240218-iwxm.png
cleaning_integration_script.ipynb		cleaning_integration_script.ipynb
company_tickers.json		company_tickers.json
diagramma modificabile.png		diagramma modificabile.png
query_notebook.ipynb		query_notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages