A machine learning project for predicting CS2 match outcomes using team statistics, player performance, and historical data.
emils-demos/
├── src/ # Main source code package
│ ├── data/ # Data preprocessing modules
│ │ ├── preprocess.py # Main preprocessing pipeline
│ │ ├── team_mapping.py
│ │ ├── map_mapping.py
│ │ ├── cumulative_stats.py
│ │ ├── match_features.py
│ │ ├── final_features.py
│ │ ├── player_stats.py
│ │ └── utils.py
│ ├── run/ # Training entrypoints
│ │ └── train.py # Model training scripts
│ ├── eval/ # Evaluation scripts and metrics
│ │ ├── metrics.py
│ │ ├── visualize.py
│ │ └── calibration.py
│ ├── inference/ # Match prediction and inference
│ │ ├── match_scraper.py
│ │ ├── historical_data.py
│ │ ├── fetch_data.py
│ │ ├── predict.py
│ │ └── betting.py
│ └── main.py # Main CLI entrypoint
├── notebooks/ # Jupyter notebooks for exploration
│ ├── preprocess.ipynb
│ ├── model.ipynb
│ └── xgboost.ipynb
├── data/ # Data directory (organized by type)
│ ├── raw/ # Raw/unprocessed data
│ │ ├── team_results/ # Team match results CSV files
│ │ ├── player_results/ # Player weekly statistics CSV files
│ │ └── rankings/ # Ranking files
│ │ ├── hltv_team_rankings_original.csv
│ │ └── teams_peak_36.csv
│ ├── preprocessed/ # Processed/preprocessed data
│ │ ├── final_features.csv
│ │ ├── match_features.csv
│ │ ├── team_map_cumulative_stats.csv
│ │ └── team_opponent_cumulative_stats.csv
│ └── mappings/ # Mapping/metadata files
│ ├── team_name_to_id.csv
│ └── map_name_to_id.csv
│ └── temp/ # Temporary fetched match data (auto-generated)
├── models/ # Trained models and calibration data
│ ├── xgboost_model.pkl
│ └── calibration_data.pkl
├── scripts/ # One-off utility scripts
├── tests/ # Unit and integration tests
├── plots/ # Generated visualization plots
└── pyproject.toml # Poetry configuration
The model uses the following features to predict match outcomes:
- Team vs Team Statistics: Cumulative wins/losses and win rate between two teams
- Map Performance: Each team's win rate and record on specific maps
- Global Rankings: HLTV ranking points for both teams at match time
- Win/Loss Streaks: Current winning and losing streaks for both teams
- Player Statistics: Overall rating, utility success, and opening rating for top 5 players per team (30 features total)
- Map One-Hot Encoding: One-hot encoded map IDs
# Install Poetry if you haven't already
curl -sSL https://install.python-poetry.org | python3 -
# Install dependencies
poetry install
# Activate virtual environment
poetry shell# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install pandas numpy scikit-learn xgboost matplotlib seaborn cloudscraper beautifulsoup4 jupyterThe project provides a CLI for common tasks:
# Run full preprocessing pipeline
python -m src.main preprocess
# Train XGBoost model (includes evaluation metrics)
python -m src.main train --data data/preprocessed/final_features.csv
# Create visualizations (confusion matrix, ROC curve, feature importance)
python -m src.main visualize --data data/preprocessed/final_features.csv
# Fetch match data from HLTV URL
python -m src.main fetch --url https://www.hltv.org/matches/2388125/spirit-vs-falcons-...
# Predict match outcome from fetched data
python -m src.main predict --fetched-data data/temp/match_2388125.json
# Predict with betting analysis (prompts for odds)
python -m src.main predict --fetched-data data/temp/match_2388125.json --bet
# Predict match outcome manually
python -m src.main predict --team-a "Team Spirit" --team-b "Team Falcons" --map "Mirage" --date "2025-01-15"
# Show project info
python -m src.main aboutPreprocess Command:
python -m src.main preprocess [--project-root PATH] [--quiet]Train Command:
python -m src.main train --data PATH [--no-balance] [--no-tuning] [--n-iter N] [--seed N] [--quiet]Visualize Command:
python -m src.main visualize --data PATH [--output-dir DIR] [--no-train] [--model-path PATH] [--seed N] [--quiet] [--show]Fetch Command:
# Fetch match data from HLTV URL
python -m src.main fetch --url URL [--output PATH] [--quiet]The fetch command:
- Scrapes match information and historical data from HLTV
- Saves fetched data to
data/temp/directory (default) - Outputs the full predict command to run on the fetched data
Predict Command:
# Using fetched data (recommended workflow)
python -m src.main predict --fetched-data PATH [--model-path PATH] [--bet] [--quiet]
# Using HLTV URL (automatically extracts match info)
python -m src.main predict --url URL [--model-path PATH] [--bet] [--quiet]
# Manual specification
python -m src.main predict --team-a TEAM_A --team-b TEAM_B --map MAP_NAME --date YYYY-MM-DD [--model-path PATH] [--bet] [--quiet]The predict command:
- Computes features for the match using historical data
- Loads a trained model and generates predictions
- Returns win probabilities for both teams and the predicted winner
- Shows model calibration accuracy for each prediction
- With
--betflag: Prompts for betting odds (American format: +100, -120, etc.) and calculates:- Expected Value (EV) for each team
- Expected profit per unit bet
- Betting recommendation (BET/AVOID)
- Best bet identification
The visualize command creates:
- Class distribution bar charts (before and after balancing)
- Confusion matrix heatmap
- ROC curve
- Precision-Recall curve
- Feature importance plot
- Loss curves (training and validation log loss over boosting rounds)
- Accuracy curves (training and validation accuracy over boosting rounds) (top 20 features)
Here's a complete workflow to preprocess data and train a model:
# Step 1: Run preprocessing pipeline
# This creates all necessary intermediate files and final_features.csv
# Output: Creates files in data/preprocessed/ and data/mappings/
python -m src.main preprocess
# Step 2: Train the model with hyperparameter tuning
# This performs RandomizedSearchCV to find best hyperparameters, then trains final model
# Output: Trained model and evaluation metrics on test set
python -m src.main train --data data/preprocessed/final_features.csv
# Step 3: Create visualizations
# This generates plots for confusion matrix, ROC curve, precision-recall curve, and feature importance
# Output: PNG files saved to plots/ directory
python -m src.main visualize --data data/preprocessed/final_features.csvNote: The train command automatically evaluates the model on the test set and displays metrics after training. The visualize command can train a model or use an existing one to generate visualizations.
To test the complete pipeline from scratch:
# 1. Verify data structure exists
ls -la data/raw/team_results/ # Should show team CSV files
ls -la data/raw/player_results/ # Should show player stats CSV files
ls -la data/raw/rankings/ # Should show ranking CSV files
# 2. Run preprocessing (this may take a few minutes)
python -m src.main preprocess
# Verify preprocessing outputs
ls -la data/preprocessed/ # Should show processed CSV files
ls -la data/mappings/ # Should show mapping CSV files
# 3. Train model (this may take several minutes due to hyperparameter tuning)
# The train command automatically evaluates and displays metrics
python -m src.main train --data data/preprocessed/final_features.csv
# 4. (Optional) Train without hyperparameter tuning for faster testing
python -m src.main train --data data/preprocessed/final_features.csv --no-tuning
# 5. (Optional) Train without class balancing
python -m src.main train --data data/preprocessed/final_features.csv --no-balance
# 6. Fetch match data from HLTV
python -m src.main fetch --url https://www.hltv.org/matches/2388125/spirit-vs-falcons-...
# 7. Make predictions
python -m src.main predict --fetched-data data/temp/match_2388125.json
# 8. Make predictions with betting analysis
python -m src.main predict --fetched-data data/temp/match_2388125.json --betYou can also use the modules directly in Python:
from pathlib import Path
from src.data.preprocess import run_preprocessing_pipeline
from src.run.train import load_and_prepare_data, train_xgboost_model
from src.eval.metrics import evaluate_model
# Run preprocessing
run_preprocessing_pipeline()
# Load data and train model
X_train, X_test, y_train, y_test = load_and_prepare_data(
data_file=Path("data/preprocessed/final_features.csv"),
balance_classes=True
)
model, best_params = train_xgboost_model(
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
hyperparameter_tuning=True
)
# Evaluate
y_pred = model.predict(X_test)
evaluate_model(y_test.values, y_pred)The notebooks in the notebooks/ directory provide interactive exploration:
preprocess.ipynb: Data preprocessing and feature engineeringmodel.ipynb: Neural network model trainingxgboost.ipynb: XGBoost model training with hyperparameter tuning
The preprocessing pipeline consists of 5 main steps:
- Team Mapping: Create team name to ID mapping from
data/raw/rankings/teams_peak_36.csv - Map Mapping: Extract map name to ID mapping from
data/raw/team_results/ - Opponent Statistics: Calculate cumulative wins/losses for each team against each opponent
- Map Statistics: Calculate cumulative wins/losses for each team on each map
- Match Features: Combine all statistics with rankings, streaks, and player stats to create feature dataset
All intermediate files are saved to data/preprocessed/ and mapping files to data/mappings/.
The project uses XGBoost (Gradient Boosting) for match prediction:
- Objective: Binary classification (team A wins vs loses)
- Hyperparameter Tuning: RandomizedSearchCV with 50 random combinations
- Evaluation: Accuracy, Precision, Recall, F1-Score, Confusion Matrix, ROC AUC, Precision-Recall AUC
- Validation: 80/20 train/test split for model evaluation
- Class Balancing: Optional undersampling to balance classes (enabled by default)
- Model Calibration: Calibration metrics calculated on test set for probability accuracy assessment
- Model Persistence: Trained models saved to
models/xgboost_model.pkland calibration data tomodels/calibration_data.pkl
The project uses Python's logging module for all output. Log levels:
- INFO: Default level, shows progress and results
- WARNING: Warnings and non-critical issues
- ERROR: Errors during processing
- DEBUG: Detailed debugging information (use
--quietflag to suppress)
Log format: %(asctime)s - %(name)s - %(levelname)s - %(message)s
pytest tests/The project follows PEP 8 style guidelines. Consider using black for code formatting:
pip install black
black src/The data/ folder is organized into four main categories:
-
raw/: Original, unprocessed data filesteam_results/: Individual team match CSV filesplayer_results/: Player weekly statistics CSV filesrankings/: Team ranking files
-
preprocessed/: Processed and intermediate data files- Cumulative statistics
- Match features
- Final feature sets ready for training
-
mappings/: Metadata and mapping files- Team name to ID mappings
- Map name to ID mappings
-
temp/: Temporary fetched match data- JSON files containing scraped match data from HLTV
- Auto-generated when using the
fetchcommand - Can be safely deleted after use
The plots/ directory contains visualization outputs generated by the visualize command:
class_distribution.png: Bar charts showing class distribution before and after balancingconfusion_matrix.png: Confusion matrix heatmap showing model predictionsroc_curve.png: ROC curve with AUC scoreprecision_recall_curve.png: Precision-Recall curvefeature_importance.png: Top 20 most important featuresloss_curves.png: Training and validation loss curves over boosting roundsaccuracy_curves.png: Training and validation accuracy curves over boosting roundserror_analysis.png: Error analysis showing patterns in misclassified samples, including:- Error type distribution (False Positives vs False Negatives)
- Top features with largest differences between correct and incorrect predictions
- Confidence distribution comparison
- Error rate by prediction probability bins
The models/ directory contains trained model artifacts:
xgboost_model.pkl: Trained XGBoost model (saved after training)calibration_data.pkl: Model calibration metrics for probability accuracy assessment
The --bet flag enables betting analysis for match predictions:
- Odds Format: Accepts American odds format (+100, -120, +150, etc.)
- Expected Value (EV): Calculates expected value based on model probabilities and calibration accuracy
- Expected Profit: Shows expected profit per unit bet
- Recommendations: Provides BET/AVOID recommendations for each team
- Best Bet: Identifies the most profitable betting option
Example:
python -m src.main predict --fetched-data data/temp/match_2388125.json --bet
# Prompts: Enter odds for Team A (e.g., +100 or -120): +100
# Prompts: Enter odds for Team B (e.g., +100 or -120): -130
# Output: Shows EV, expected profit, and betting recommendations for each mapSee LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
- Data sources: HLTV team rankings and match results
- Player statistics from weekly performance data