Skip to content

Implement correct overfitting analysis with exponential DGP and no intercept#10

Draft
Copilot wants to merge 2 commits intogabriel-sacofrom
copilot/fix-33ef632e-7111-4dba-9cb6-bbb3ca63d134
Draft

Implement correct overfitting analysis with exponential DGP and no intercept#10
Copilot wants to merge 2 commits intogabriel-sacofrom
copilot/fix-33ef632e-7111-4dba-9cb6-bbb3ca63d134

Conversation

Copy link
Contributor

Copilot AI commented Sep 5, 2025

This PR implements the overfitting analysis as specified in the problem statement, replacing previous implementations with the correct data generating process and model specifications.

Key Changes

Data Generating Process: Implemented the specified DGP y = np.exp(4 * W) + e without intercept, where:

  • W ~ Uniform(0,1), sorted, n=1000 observations
  • e ~ Normal(0,1)
  • Uses seed=42 for reproducibility

Model Estimation: Uses LinearRegression(fit_intercept=False) to exclude intercept from polynomial regression models as required.

Feature Engineering: Creates polynomial features W¹, W², W³, ..., Wᵏ for k ∈ {1, 2, 5, 10, 20, 50, 100, 200, 500, 1000}.

Metrics Calculation: For each model complexity:

  • R² on full sample (in-sample fit quality)
  • Adjusted R² on full sample (complexity-penalized fit)
  • Out-of-sample R² using 75%/25% train/test split (generalization performance)

Visualization: Three separate plots showing each R² metric versus number of features, demonstrating:

  • R² (full sample): Monotonic increase with model complexity
  • Adjusted R²: Peaks at optimal complexity (~20 features) then declines
  • Out-of-sample R²: Classic inverted U-shape showing overfitting at high complexity

Results

The analysis successfully demonstrates the bias-variance tradeoff:

  • Simple models (1-2 features) underfit the exponential relationship
  • Moderate complexity (5-50 features) captures the pattern well
  • High complexity (500+ features) severely overfits, with out-of-sample R² becoming negative

Optimal model complexity: 20 features by both Adjusted R² (0.9949) and out-of-sample R² (0.9959) criteria.

File Changes

  • Replaced Python/scripts/part2_overfitting.ipynb with correct implementation
  • Removed duplicate/incorrect notebooks: part2_overfitting.py, part2_overfitting_corrected.ipynb, part2_overfitting_corrected_new.ipynb
  • Cleaned repository of unnecessary files as requested

The notebook is fully functional, tested, and produces the expected overfitting patterns demonstrating the fundamental machine learning concept of model complexity versus generalization performance.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Usa estas librerias: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import r2_scor... Implement correct overfitting analysis with exponential DGP and no intercept Sep 5, 2025
Copilot AI requested a review from gsaco September 5, 2025 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants