FinHack 2026 — Case 4
Can AI-related text sentiment — both direct and spillover from related companies — predict whether a stock goes up or down in the 5 trading days after earnings?
We build a sentiment-driven prediction system that tests whether AI news signals improve post-earnings stock return prediction. The key innovation is spillover — sentiment about NVDA might predict AMD's earnings move, not just NVDA's.
Each model adds complexity. The output table shows whether accuracy increases at each step:
| # | Model | Features | Method | Question Answered |
|---|---|---|---|---|
| 1 | Baseline | Market only | Logistic Regression | What can prices alone predict? |
| 2 | + Sentiment | Market + direct sentiment | Logistic Regression | Does sentiment help? |
| 3 | + Spillover | All features | Logistic Regression | Does cross-company sentiment help? |
| 4 | XGBoost | All features | Gradient Boosting | Do nonlinear interactions help? |
| 5 | LSTM | All features | Deep Learning | Does deep learning help? |
finhack/
├── README.md ← You are here
├── requirements.txt ← Python dependencies
├── src/
│ ├── config.py ← Stock universe, relationships, constants
│ ├── 01_collect_data.py ← Step 1: Earnings dates, stock prices, news
│ ├── 02_sentiment.py ← Step 2: FinBERT sentiment extraction
│ ├── 03_features.py ← Step 3: Feature engineering (market, direct, spillover)
│ └── 04_model.py ← Step 4: Train 5 models, print comparison table
├── dashboard/
│ └── app.py ← Streamlit dashboard (4 tabs)
├── docs/
│ └── methodology.md ← Full methodology writeup for presentation
└── data/ ← Generated data files (created by pipeline, gitignored)
- Python 3.10+
- pip
- ~2GB disk space (for FinBERT model download)
- Internet connection (for stock data + model download on first run)
git clone git@github.com:Palash-Mehta/finhack.git
cd finhack
pip install -r requirements.txt# Step 1: Collect stock prices, earnings dates, and news headlines (~2 min)
python src/01_collect_data.py
# Step 2: Run FinBERT sentiment on all headlines (~3-5 min on CPU)
python src/02_sentiment.py
# Step 3: Engineer features — market, direct sentiment, spillover (~1 min)
python src/03_features.py
# Step 4: Train all 5 models and print comparison table (~2-3 min)
python src/04_model.pyEach step saves output to data/ so you can re-run later steps without re-running earlier ones.
streamlit run dashboard/app.pyOpens at http://localhost:8501 with four tabs:
- 📈 Sentiment Timeline — Per-stock sentiment in the 7-day pre-earnings window
- 🕸️ Spillover Network — Interactive graph of cross-company relationships
- 🏆 Model Comparison — Side-by-side metrics + accuracy progression chart
- 🔍 Prediction Explorer — Drill into individual earnings events
| Data | Source | Output |
|---|---|---|
| Daily stock prices | Yahoo Finance (yfinance) |
data/prices.parquet |
| Earnings dates | Yahoo Finance earnings calendar | data/earnings.csv |
| News headlines | Synthetic (see note below) | data/news.parquet |
collect_news() function with API calls to
NewsAPI, Polygon.io, or
EODHD. The rest of the pipeline works identically.
- Runs ProsusAI/finbert on every headline
- FinBERT is BERT fine-tuned on financial text — understands "beats expectations" = positive
- Outputs a continuous score:
P(positive) - P(negative), range [-1, +1] - Auto-detects GPU; falls back to CPU
Three feature groups per earnings event:
| Group | Features | Description |
|---|---|---|
| A — Market | pre_ret_5d, pre_ret_20d, pre_vol_5d, mcap_bucket |
Price-based signals before earnings |
| B — Direct Sentiment | sent_mean, sent_count, sent_trend, sent_extreme |
Sentiment about the target stock |
| C — Spillover | spill_sent_mean, spill_sent_gap, spill_negative_count |
Sentiment about related companies |
Target: Binary — is the 5-day post-earnings return positive (1) or negative (0)?
Trains 5 models and prints a comparison table like:
══════════════════════════════════════════════════════════════════════════════
MODEL COMPARISON — ACCURACY PROGRESSION (Test Set)
══════════════════════════════════════════════════════════════════════════════
# Model Acc F1 AUC Δ Acc
---- -------------------------------------------------- ------- ------- ------- -------
1 Baseline Logreg 0.XXXX 0.XXXX 0.XXXX —
2 Sentiment Logreg 0.XXXX 0.XXXX 0.XXXX +0.XXXX
3 Spillover Logreg 0.XXXX 0.XXXX 0.XXXX +0.XXXX
4 Xgboost 0.XXXX 0.XXXX 0.XXXX +0.XXXX
5 Lstm 0.XXXX 0.XXXX 0.XXXX +0.XXXX
This is the #1 thing judges will scrutinize. Our controls:
| Rule | Implementation |
|---|---|
| Sentiment window | Only news from [earnings_date - 7, earnings_date] used |
| Market features | Only price data from before earnings date |
| Target return | Computed from days [+1, +5] after earnings |
| Validation split | Chronological only — train on 2023, validate 2024-H1, test 2024-H2+ |
| No shuffling | Time-series split, never k-fold |
| Tier | Stocks | Role |
|---|---|---|
| Mega-cap AI | NVDA, MSFT, GOOGL, META, AMZN, AAPL | Primary sentiment generators |
| AI Infrastructure | AMD, AVGO, SMCI, MRVL, TSM | Hardware supply chain |
| AI Software | CRM, PLTR, SNOW, NOW, AI | Enterprise AI |
| AI Adjacent | TSLA, ORCL, IBM, INTC | Broader tech |
Cross-company relationships (for spillover) are defined in src/config.py via
SUPPLY_CHAIN_LINKS — e.g., NVDA → AMD, SMCI, MRVL, AVGO, TSM.
| If you want to... | Edit this |
|---|---|
| Add/remove stocks | src/config.py → STOCK_UNIVERSE |
| Change relationships | src/config.py → SUPPLY_CHAIN_LINKS |
| Use real news API | src/01_collect_data.py → collect_news() |
| Add features | src/03_features.py → add to compute functions + feature group lists |
| Tune models | src/04_model.py → hyperparameters in each train function |
| Change dashboard | dashboard/app.py |
- Read
docs/methodology.mdfor the full writeup (can be adapted into slides) - The dashboard is the live demo — walk through all 4 tabs
- The model comparison table is the punchline — show accuracy progression
- Key narrative: "When NVDA sentiment drops, AMD stock reacts before AMD even reports. Can we capture that spillover signal?"
| Problem | Fix |
|---|---|
ModuleNotFoundError |
Run pip install -r requirements.txt |
FileNotFoundError: data/... |
Run the pipeline steps in order (01 → 02 → 03 → 04) |
| FinBERT download slow | First run downloads ~400MB model. Subsequent runs use cache. |
CUDA out of memory |
FinBERT falls back to CPU automatically. LSTM uses CPU by default on Mac. |
| Dashboard won't start | Make sure you've run all 4 pipeline steps first |
| yfinance rate limited | Wait a minute and re-run step 01. Or reduce ALL_TICKERS in config.py |