Skip to content

21ce130/Finhack1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📊 AI Hype Decoded: Sentiment Spillover & Stock Price Prediction

FinHack 2026 — Case 4

Can AI-related text sentiment — both direct and spillover from related companies — predict whether a stock goes up or down in the 5 trading days after earnings?


What This Project Does

We build a sentiment-driven prediction system that tests whether AI news signals improve post-earnings stock return prediction. The key innovation is spillover — sentiment about NVDA might predict AMD's earnings move, not just NVDA's.

The 5-Model Progression

Each model adds complexity. The output table shows whether accuracy increases at each step:

# Model Features Method Question Answered
1 Baseline Market only Logistic Regression What can prices alone predict?
2 + Sentiment Market + direct sentiment Logistic Regression Does sentiment help?
3 + Spillover All features Logistic Regression Does cross-company sentiment help?
4 XGBoost All features Gradient Boosting Do nonlinear interactions help?
5 LSTM All features Deep Learning Does deep learning help?

Project Structure

finhack/
├── README.md                ← You are here
├── requirements.txt         ← Python dependencies
├── src/
│   ├── config.py            ← Stock universe, relationships, constants
│   ├── 01_collect_data.py   ← Step 1: Earnings dates, stock prices, news
│   ├── 02_sentiment.py      ← Step 2: FinBERT sentiment extraction
│   ├── 03_features.py       ← Step 3: Feature engineering (market, direct, spillover)
│   └── 04_model.py          ← Step 4: Train 5 models, print comparison table
├── dashboard/
│   └── app.py               ← Streamlit dashboard (4 tabs)
├── docs/
│   └── methodology.md       ← Full methodology writeup for presentation
└── data/                    ← Generated data files (created by pipeline, gitignored)

Quick Start

Prerequisites

  • Python 3.10+
  • pip
  • ~2GB disk space (for FinBERT model download)
  • Internet connection (for stock data + model download on first run)

1. Clone and install

git clone git@github.com:Palash-Mehta/finhack.git
cd finhack
pip install -r requirements.txt

2. Run the pipeline (4 steps, in order)

# Step 1: Collect stock prices, earnings dates, and news headlines (~2 min)
python src/01_collect_data.py

# Step 2: Run FinBERT sentiment on all headlines (~3-5 min on CPU)
python src/02_sentiment.py

# Step 3: Engineer features — market, direct sentiment, spillover (~1 min)
python src/03_features.py

# Step 4: Train all 5 models and print comparison table (~2-3 min)
python src/04_model.py

Each step saves output to data/ so you can re-run later steps without re-running earlier ones.

3. Launch the dashboard

streamlit run dashboard/app.py

Opens at http://localhost:8501 with four tabs:

  • 📈 Sentiment Timeline — Per-stock sentiment in the 7-day pre-earnings window
  • 🕸️ Spillover Network — Interactive graph of cross-company relationships
  • 🏆 Model Comparison — Side-by-side metrics + accuracy progression chart
  • 🔍 Prediction Explorer — Drill into individual earnings events

What Each Step Does

Step 1: 01_collect_data.py — Data Collection

Data Source Output
Daily stock prices Yahoo Finance (yfinance) data/prices.parquet
Earnings dates Yahoo Finance earnings calendar data/earnings.csv
News headlines Synthetic (see note below) data/news.parquet

⚠️ Note on news data: The prototype generates synthetic headlines for demo purposes. To use real data, replace the collect_news() function with API calls to NewsAPI, Polygon.io, or EODHD. The rest of the pipeline works identically.

Step 2: 02_sentiment.py — FinBERT Sentiment

  • Runs ProsusAI/finbert on every headline
  • FinBERT is BERT fine-tuned on financial text — understands "beats expectations" = positive
  • Outputs a continuous score: P(positive) - P(negative), range [-1, +1]
  • Auto-detects GPU; falls back to CPU

Step 3: 03_features.py — Feature Engineering

Three feature groups per earnings event:

Group Features Description
A — Market pre_ret_5d, pre_ret_20d, pre_vol_5d, mcap_bucket Price-based signals before earnings
B — Direct Sentiment sent_mean, sent_count, sent_trend, sent_extreme Sentiment about the target stock
C — Spillover spill_sent_mean, spill_sent_gap, spill_negative_count Sentiment about related companies

Target: Binary — is the 5-day post-earnings return positive (1) or negative (0)?

Step 4: 04_model.py — Modeling

Trains 5 models and prints a comparison table like:

══════════════════════════════════════════════════════════════════════════════
  MODEL COMPARISON — ACCURACY PROGRESSION (Test Set)
══════════════════════════════════════════════════════════════════════════════
  #    Model                                              Acc      F1     AUC   Δ Acc
  ---- -------------------------------------------------- ------- ------- ------- -------
  1    Baseline Logreg                                    0.XXXX  0.XXXX  0.XXXX    —
  2    Sentiment Logreg                                   0.XXXX  0.XXXX  0.XXXX +0.XXXX
  3    Spillover Logreg                                   0.XXXX  0.XXXX  0.XXXX +0.XXXX
  4    Xgboost                                            0.XXXX  0.XXXX  0.XXXX +0.XXXX
  5    Lstm                                               0.XXXX  0.XXXX  0.XXXX +0.XXXX

Data Leakage Controls

This is the #1 thing judges will scrutinize. Our controls:

Rule Implementation
Sentiment window Only news from [earnings_date - 7, earnings_date] used
Market features Only price data from before earnings date
Target return Computed from days [+1, +5] after earnings
Validation split Chronological only — train on 2023, validate 2024-H1, test 2024-H2+
No shuffling Time-series split, never k-fold

Stock Universe (20 tickers)

Tier Stocks Role
Mega-cap AI NVDA, MSFT, GOOGL, META, AMZN, AAPL Primary sentiment generators
AI Infrastructure AMD, AVGO, SMCI, MRVL, TSM Hardware supply chain
AI Software CRM, PLTR, SNOW, NOW, AI Enterprise AI
AI Adjacent TSLA, ORCL, IBM, INTC Broader tech

Cross-company relationships (for spillover) are defined in src/config.py via SUPPLY_CHAIN_LINKS — e.g., NVDA → AMD, SMCI, MRVL, AVGO, TSM.


Key Files to Modify

If you want to... Edit this
Add/remove stocks src/config.pySTOCK_UNIVERSE
Change relationships src/config.pySUPPLY_CHAIN_LINKS
Use real news API src/01_collect_data.pycollect_news()
Add features src/03_features.py → add to compute functions + feature group lists
Tune models src/04_model.py → hyperparameters in each train function
Change dashboard dashboard/app.py

For the Presentation

  • Read docs/methodology.md for the full writeup (can be adapted into slides)
  • The dashboard is the live demo — walk through all 4 tabs
  • The model comparison table is the punchline — show accuracy progression
  • Key narrative: "When NVDA sentiment drops, AMD stock reacts before AMD even reports. Can we capture that spillover signal?"

Troubleshooting

Problem Fix
ModuleNotFoundError Run pip install -r requirements.txt
FileNotFoundError: data/... Run the pipeline steps in order (01 → 02 → 03 → 04)
FinBERT download slow First run downloads ~400MB model. Subsequent runs use cache.
CUDA out of memory FinBERT falls back to CPU automatically. LSTM uses CPU by default on Mac.
Dashboard won't start Make sure you've run all 4 pipeline steps first
yfinance rate limited Wait a minute and re-run step 01. Or reduce ALL_TICKERS in config.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages