Implement comprehensive overfitting simulation for high-dimensional linear models by Copilot · Pull Request #12 · gsaco/High_Dimensional_Linear_Models

Copilot · 2025-09-05T16:01:13Z

This PR implements a complete educational simulation demonstrating the overfitting phenomenon in high-dimensional linear models across three programming languages: Python, Julia, and R.

Overview

The simulation addresses the classic bias-variance tradeoff by fitting polynomial regression models with increasing complexity to a nonlinear data generating process. It clearly demonstrates how model performance on training data can be misleading when evaluating generalization capability.

Data Generating Process

The simulation uses the specified nonlinear relationship:

f(X) = exp(4 * X) - 1
Y = f(X) + ε, where ε ~ N(0, σ²)
n = 1000 observations
X ~ Uniform(-0.5, 0.5)
Intercept parameter = 0 (as specified)

Key Features

Multi-Language Implementation

simulation_python.ipynb: Complete Python implementation using scikit-learn
simulation_julia.ipynb: Native Julia implementation with efficient linear algebra
simulation_r.ipynb: R implementation using base R and ggplot2
simulation.ipynb: Main simulation file (Python version)

Comprehensive Analysis

Each notebook tests polynomial regression with 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000 features and calculates:

Training R²: Goodness of fit on training data (always increases)
Adjusted R²: R² penalized for model complexity (peaks then decreases)
Out-of-sample R²: Performance on 25% held-out test data (demonstrates overfitting)

Educational Visualizations

Three separate plots in each language show:

Training R² vs number of features (monotonic increase)
Adjusted R² vs number of features (inverted U-shape)
Out-of-sample R² vs number of features (peaks at optimal complexity)

Example Results

The simulation consistently shows the expected overfitting pattern:

Training R² increases from ~0.55 to ~0.93 as features increase
Test R² peaks at around 5 features (~0.937) then stabilizes
Adjusted R² decreases significantly with high feature counts due to complexity penalty

Features | Train R² | Adj R²  | Test R² | Overfitting
------------------------------------------------------
       1 |   0.5457 |  0.5451 |  0.5341 |      0.0116
       5 |   0.9316 |  0.9311 |  0.9366 |     -0.0050  ← Optimal
      50 |   0.9337 |  0.9290 |  0.9353 |     -0.0016
     200 |   0.9337 |  0.9096 |  0.9353 |     -0.0016

Supporting Infrastructure

requirements.txt: Python dependencies for easy setup
test_simulation.py: Basic validation of the implementation
comprehensive_test.py: Detailed analysis with overfitting metrics
create_plots.py: Standalone visualization generator
.gitignore: Proper version control excluding generated files
Updated README.md: Comprehensive documentation and usage instructions

Educational Value

This implementation provides a hands-on demonstration of fundamental concepts:

Bias-variance tradeoff: How model complexity affects generalization
Overfitting: Why high training accuracy doesn't guarantee good performance
Model selection: The importance of validation data in choosing optimal complexity
Cross-language comparison: How the same statistical concepts translate across different programming environments

The simulation serves as an excellent teaching tool for courses in machine learning, econometrics, and high-dimensional statistics, providing concrete evidence of theoretical concepts through reproducible code examples.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…thon, Julia, and R Co-authored-by: gsaco <[email protected]>

Co-authored-by: gsaco <[email protected]>

Initial plan

31ae0ee

Copilot AI assigned Copilot and gsaco Sep 5, 2025

Copilot started work on behalf of gsaco September 5, 2025 16:01 View session

Copilot AI and others added 2 commits September 5, 2025 16:08

Implement high-dimensional linear models overfitting simulation in Py…

e70c137

…thon, Julia, and R Co-authored-by: gsaco <[email protected]>

Add comprehensive testing, visualization, and project setup files

108dcae

Co-authored-by: gsaco <[email protected]>

Copilot AI requested a review from gsaco September 5, 2025 16:12

Copilot finished work on behalf of gsaco September 5, 2025 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement comprehensive overfitting simulation for high-dimensional linear models#12

Implement comprehensive overfitting simulation for high-dimensional linear models#12
Copilot wants to merge 3 commits intomainfrom
copilot/fix-5d9a92b9-d323-4b34-9e78-b367d809f933

Copilot AI commented Sep 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Data Generating Process

Key Features

Multi-Language Implementation

Comprehensive Analysis

Educational Visualizations

Example Results

Supporting Infrastructure

Educational Value

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Sep 5, 2025 •

edited

Loading