CS2 Match Prediction ML Project

A machine learning project for predicting CS2 match outcomes using team statistics, player performance, and historical data.

Project Structure

emils-demos/
├── src/                    # Main source code package
│   ├── data/              # Data preprocessing modules
│   │   ├── preprocess.py  # Main preprocessing pipeline
│   │   ├── team_mapping.py
│   │   ├── map_mapping.py
│   │   ├── cumulative_stats.py
│   │   ├── match_features.py
│   │   ├── final_features.py
│   │   ├── player_stats.py
│   │   └── utils.py
│   ├── run/               # Training entrypoints
│   │   └── train.py      # Model training scripts
│   ├── eval/              # Evaluation scripts and metrics
│   │   ├── metrics.py
│   │   ├── visualize.py
│   │   └── calibration.py
│   ├── inference/         # Match prediction and inference
│   │   ├── match_scraper.py
│   │   ├── historical_data.py
│   │   ├── fetch_data.py
│   │   ├── predict.py
│   │   └── betting.py
│   └── main.py           # Main CLI entrypoint
├── notebooks/             # Jupyter notebooks for exploration
│   ├── preprocess.ipynb
│   ├── model.ipynb
│   └── xgboost.ipynb
├── data/                  # Data directory (organized by type)
│   ├── raw/              # Raw/unprocessed data
│   │   ├── team_results/      # Team match results CSV files
│   │   ├── player_results/   # Player weekly statistics CSV files
│   │   └── rankings/          # Ranking files
│   │       ├── hltv_team_rankings_original.csv
│   │       └── teams_peak_36.csv
│   ├── preprocessed/     # Processed/preprocessed data
│   │   ├── final_features.csv
│   │   ├── match_features.csv
│   │   ├── team_map_cumulative_stats.csv
│   │   └── team_opponent_cumulative_stats.csv
│   └── mappings/         # Mapping/metadata files
│       ├── team_name_to_id.csv
│       └── map_name_to_id.csv
│   └── temp/            # Temporary fetched match data (auto-generated)
├── models/               # Trained models and calibration data
│   ├── xgboost_model.pkl
│   └── calibration_data.pkl
├── scripts/              # One-off utility scripts
├── tests/                # Unit and integration tests
├── plots/                # Generated visualization plots
└── pyproject.toml        # Poetry configuration

Features

The model uses the following features to predict match outcomes:

Team vs Team Statistics: Cumulative wins/losses and win rate between two teams
Map Performance: Each team's win rate and record on specific maps
Global Rankings: HLTV ranking points for both teams at match time
Win/Loss Streaks: Current winning and losing streaks for both teams
Player Statistics: Overall rating, utility success, and opening rating for top 5 players per team (30 features total)
Map One-Hot Encoding: One-hot encoded map IDs

Installation

Using Poetry (Recommended)

# Install Poetry if you haven't already
curl -sSL https://install.python-poetry.org | python3 -

# Install dependencies
poetry install

# Activate virtual environment
poetry shell

Using pip

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install pandas numpy scikit-learn xgboost matplotlib seaborn cloudscraper beautifulsoup4 jupyter

Usage

Command Line Interface

The project provides a CLI for common tasks:

# Run full preprocessing pipeline
python -m src.main preprocess

# Train XGBoost model (includes evaluation metrics)
python -m src.main train --data data/preprocessed/final_features.csv

# Create visualizations (confusion matrix, ROC curve, feature importance)
python -m src.main visualize --data data/preprocessed/final_features.csv

# Fetch match data from HLTV URL
python -m src.main fetch --url https://www.hltv.org/matches/2388125/spirit-vs-falcons-...

# Predict match outcome from fetched data
python -m src.main predict --fetched-data data/temp/match_2388125.json

# Predict with betting analysis (prompts for odds)
python -m src.main predict --fetched-data data/temp/match_2388125.json --bet

# Predict match outcome manually
python -m src.main predict --team-a "Team Spirit" --team-b "Team Falcons" --map "Mirage" --date "2025-01-15"

# Show project info
python -m src.main about

CLI Options

Preprocess Command:

python -m src.main preprocess [--project-root PATH] [--quiet]

Train Command:

python -m src.main train --data PATH [--no-balance] [--no-tuning] [--n-iter N] [--seed N] [--quiet]

Visualize Command:

python -m src.main visualize --data PATH [--output-dir DIR] [--no-train] [--model-path PATH] [--seed N] [--quiet] [--show]

Fetch Command:

# Fetch match data from HLTV URL
python -m src.main fetch --url URL [--output PATH] [--quiet]

The fetch command:

Scrapes match information and historical data from HLTV
Saves fetched data to data/temp/ directory (default)
Outputs the full predict command to run on the fetched data

Predict Command:

# Using fetched data (recommended workflow)
python -m src.main predict --fetched-data PATH [--model-path PATH] [--bet] [--quiet]

# Using HLTV URL (automatically extracts match info)
python -m src.main predict --url URL [--model-path PATH] [--bet] [--quiet]

# Manual specification
python -m src.main predict --team-a TEAM_A --team-b TEAM_B --map MAP_NAME --date YYYY-MM-DD [--model-path PATH] [--bet] [--quiet]

The predict command:

Computes features for the match using historical data
Loads a trained model and generates predictions
Returns win probabilities for both teams and the predicted winner
Shows model calibration accuracy for each prediction
With --bet flag: Prompts for betting odds (American format: +100, -120, etc.) and calculates:
- Expected Value (EV) for each team
- Expected profit per unit bet
- Betting recommendation (BET/AVOID)
- Best bet identification

The visualize command creates:

Class distribution bar charts (before and after balancing)
Confusion matrix heatmap
ROC curve
Precision-Recall curve
Feature importance plot
Loss curves (training and validation log loss over boosting rounds)
Accuracy curves (training and validation accuracy over boosting rounds) (top 20 features)

Full Workflow Example

Here's a complete workflow to preprocess data and train a model:

# Step 1: Run preprocessing pipeline
# This creates all necessary intermediate files and final_features.csv
# Output: Creates files in data/preprocessed/ and data/mappings/
python -m src.main preprocess

# Step 2: Train the model with hyperparameter tuning
# This performs RandomizedSearchCV to find best hyperparameters, then trains final model
# Output: Trained model and evaluation metrics on test set
python -m src.main train --data data/preprocessed/final_features.csv

# Step 3: Create visualizations
# This generates plots for confusion matrix, ROC curve, precision-recall curve, and feature importance
# Output: PNG files saved to plots/ directory
python -m src.main visualize --data data/preprocessed/final_features.csv

Note: The train command automatically evaluates the model on the test set and displays metrics after training. The visualize command can train a model or use an existing one to generate visualizations.

Testing the Full Workflow

To test the complete pipeline from scratch:

# 1. Verify data structure exists
ls -la data/raw/team_results/      # Should show team CSV files
ls -la data/raw/player_results/   # Should show player stats CSV files
ls -la data/raw/rankings/         # Should show ranking CSV files

# 2. Run preprocessing (this may take a few minutes)
python -m src.main preprocess

# Verify preprocessing outputs
ls -la data/preprocessed/         # Should show processed CSV files
ls -la data/mappings/             # Should show mapping CSV files

# 3. Train model (this may take several minutes due to hyperparameter tuning)
#    The train command automatically evaluates and displays metrics
python -m src.main train --data data/preprocessed/final_features.csv

# 4. (Optional) Train without hyperparameter tuning for faster testing
python -m src.main train --data data/preprocessed/final_features.csv --no-tuning

# 5. (Optional) Train without class balancing
python -m src.main train --data data/preprocessed/final_features.csv --no-balance

# 6. Fetch match data from HLTV
python -m src.main fetch --url https://www.hltv.org/matches/2388125/spirit-vs-falcons-...

# 7. Make predictions
python -m src.main predict --fetched-data data/temp/match_2388125.json

# 8. Make predictions with betting analysis
python -m src.main predict --fetched-data data/temp/match_2388125.json --bet

Python API

You can also use the modules directly in Python:

from pathlib import Path
from src.data.preprocess import run_preprocessing_pipeline
from src.run.train import load_and_prepare_data, train_xgboost_model
from src.eval.metrics import evaluate_model

# Run preprocessing
run_preprocessing_pipeline()

# Load data and train model
X_train, X_test, y_train, y_test = load_and_prepare_data(
    data_file=Path("data/preprocessed/final_features.csv"),
    balance_classes=True
)

model, best_params = train_xgboost_model(
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    hyperparameter_tuning=True
)

# Evaluate
y_pred = model.predict(X_test)
evaluate_model(y_test.values, y_pred)

Jupyter Notebooks

The notebooks in the notebooks/ directory provide interactive exploration:

preprocess.ipynb: Data preprocessing and feature engineering
model.ipynb: Neural network model training
xgboost.ipynb: XGBoost model training with hyperparameter tuning

Data Preprocessing Pipeline

The preprocessing pipeline consists of 5 main steps:

Team Mapping: Create team name to ID mapping from data/raw/rankings/teams_peak_36.csv
Map Mapping: Extract map name to ID mapping from data/raw/team_results/
Opponent Statistics: Calculate cumulative wins/losses for each team against each opponent
Map Statistics: Calculate cumulative wins/losses for each team on each map
Match Features: Combine all statistics with rankings, streaks, and player stats to create feature dataset

All intermediate files are saved to data/preprocessed/ and mapping files to data/mappings/.

Model

The project uses XGBoost (Gradient Boosting) for match prediction:

Objective: Binary classification (team A wins vs loses)
Hyperparameter Tuning: RandomizedSearchCV with 50 random combinations
Evaluation: Accuracy, Precision, Recall, F1-Score, Confusion Matrix, ROC AUC, Precision-Recall AUC
Validation: 80/20 train/test split for model evaluation
Class Balancing: Optional undersampling to balance classes (enabled by default)
Model Calibration: Calibration metrics calculated on test set for probability accuracy assessment
Model Persistence: Trained models saved to models/xgboost_model.pkl and calibration data to models/calibration_data.pkl

Logging

The project uses Python's logging module for all output. Log levels:

INFO: Default level, shows progress and results
WARNING: Warnings and non-critical issues
ERROR: Errors during processing
DEBUG: Detailed debugging information (use --quiet flag to suppress)

Log format: %(asctime)s - %(name)s - %(levelname)s - %(message)s

Development

Running Tests

pytest tests/

Code Style

The project follows PEP 8 style guidelines. Consider using black for code formatting:

pip install black
black src/

Data Organization

The data/ folder is organized into four main categories:

raw/: Original, unprocessed data files
- team_results/: Individual team match CSV files
- player_results/: Player weekly statistics CSV files
- rankings/: Team ranking files
preprocessed/: Processed and intermediate data files
- Cumulative statistics
- Match features
- Final feature sets ready for training
mappings/: Metadata and mapping files
- Team name to ID mappings
- Map name to ID mappings
temp/: Temporary fetched match data
- JSON files containing scraped match data from HLTV
- Auto-generated when using the fetch command
- Can be safely deleted after use

Output Files

The plots/ directory contains visualization outputs generated by the visualize command:

class_distribution.png: Bar charts showing class distribution before and after balancing
confusion_matrix.png: Confusion matrix heatmap showing model predictions
roc_curve.png: ROC curve with AUC score
precision_recall_curve.png: Precision-Recall curve
feature_importance.png: Top 20 most important features
loss_curves.png: Training and validation loss curves over boosting rounds
accuracy_curves.png: Training and validation accuracy curves over boosting rounds
error_analysis.png: Error analysis showing patterns in misclassified samples, including:
- Error type distribution (False Positives vs False Negatives)
- Top features with largest differences between correct and incorrect predictions
- Confidence distribution comparison
- Error rate by prediction probability bins

The models/ directory contains trained model artifacts:

xgboost_model.pkl: Trained XGBoost model (saved after training)
calibration_data.pkl: Model calibration metrics for probability accuracy assessment

Betting Analysis

The --bet flag enables betting analysis for match predictions:

Odds Format: Accepts American odds format (+100, -120, +150, etc.)
Expected Value (EV): Calculates expected value based on model probabilities and calibration accuracy
Expected Profit: Shows expected profit per unit bet
Recommendations: Provides BET/AVOID recommendations for each team
Best Bet: Identifies the most profitable betting option

Example:

python -m src.main predict --fetched-data data/temp/match_2388125.json --bet
# Prompts: Enter odds for Team A (e.g., +100 or -120): +100
# Prompts: Enter odds for Team B (e.g., +100 or -120): -130
# Output: Shows EV, expected profit, and betting recommendations for each map

License

See LICENSE file for details.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Acknowledgments

Data sources: HLTV team rankings and match results
Player statistics from weekly performance data

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
data		data
notebooks		notebooks
proposal		proposal
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Project_progress_report.pdf		Project_progress_report.pdf
README.md		README.md
Team_Falconslogo_square.png		Team_Falconslogo_square.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS2 Match Prediction ML Project

Project Structure

Features

Installation

Using Poetry (Recommended)

Using pip

Usage

Command Line Interface

CLI Options

Full Workflow Example

Testing the Full Workflow

Python API

Jupyter Notebooks

Data Preprocessing Pipeline

Model

Logging

Development

Running Tests

Code Style

Data Organization

Output Files

Betting Analysis

License

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CS2 Match Prediction ML Project

Project Structure

Features

Installation

Using Poetry (Recommended)

Using pip

Usage

Command Line Interface

CLI Options

Full Workflow Example

Testing the Full Workflow

Python API

Jupyter Notebooks

Data Preprocessing Pipeline

Model

Logging

Development

Running Tests

Code Style

Data Organization

Output Files

Betting Analysis

License

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages