diff --git a/Python/scripts/part2_overfitting.ipynb b/Python/scripts/part2_overfitting.ipynb index 5bb8da1..e30bd4e 100644 --- a/Python/scripts/part2_overfitting.ipynb +++ b/Python/scripts/part2_overfitting.ipynb @@ -4,11 +4,20 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Assignment 1 - Part 2: Overfitting Analysis (CORRECTED)\n", + "# Assignment 1 - Part 2: Overfitting Analysis\n", "## Overfitting (8 points)\n", "\n", - "This notebook analyzes overfitting using the correct data generating process from the class example:\n", - "**y = exp(4*W) + e**" + "This notebook analyzes overfitting using the specified data generating process:\n", + "**y = np.exp(4 * W) + e** (WITHOUT INTERCEPT)\n", + "\n", + "**Requirements:**\n", + "- Data generating process: y = exp(4*W) + e with intercept parameter = 0\n", + "- n = 1000 observations\n", + "- Test with different numbers of features: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000\n", + "- Calculate R², Adjusted R², and Out-of-sample R²\n", + "- Use 75%/25% train/test split for out-of-sample evaluation\n", + "- Create three separate plots\n", + "- Use seed 42 for reproducibility" ] }, { @@ -36,9 +45,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Data Generation\n", + "## 1. Data Generation\n", "\n", - "Following the class example: **y = exp(4*W) + e**" + "Generate data with:\n", + "- **n = 1000** observations\n", + "- **W** from Uniform(0,1), sorted\n", + "- **y = exp(4*W) + e** where e ~ Normal(0,1)\n", + "- **No intercept** in the data generating process" ] }, { @@ -49,8 +62,8 @@ "source": [ "def generate_data(n=1000, seed=42):\n", " \"\"\"\n", - " Generate data following the class example specification:\n", - " y = np.exp(4 * W) + e\n", + " Generate data following the specification:\n", + " y = np.exp(4 * W) + e (no intercept)\n", " \n", " Parameters:\n", " -----------\n", @@ -68,7 +81,7 @@ " \"\"\"\n", " np.random.seed(seed)\n", " \n", - " # Generate W from uniform distribution and sort (as in class example)\n", + " # Generate W from uniform distribution and sort\n", " W = np.random.uniform(0, 1, n)\n", " W.sort()\n", " W = W.reshape(-1, 1)\n", @@ -76,7 +89,7 @@ " # Generate error term\n", " e = np.random.normal(0, 1, n)\n", " \n", - " # Generate y following class example: y = exp(4*W) + e\n", + " # Generate y following specification: y = exp(4*W) + e\n", " y = np.exp(4 * W.ravel()) + e\n", " \n", " return W, y\n", @@ -94,7 +107,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Helper Functions" + "## 2. Helper Functions" ] }, { @@ -165,9 +178,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Overfitting Analysis\n", + "## 3. Overfitting Analysis Loop\n", + "\n", + "Test models with different numbers of polynomial features: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000\n", "\n", - "Test models with different numbers of polynomial features: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000" + "For each model:\n", + "- Calculate R² on full sample\n", + "- Calculate Adjusted R² on full sample \n", + "- Calculate Out-of-sample R² using 75%/25% train/test split\n", + "- **Use fit_intercept=False** as per assignment requirements (no intercept)" ] }, { @@ -200,8 +219,8 @@ " W_poly, y, test_size=0.25, random_state=42\n", " )\n", " \n", - " # Fit model on full sample (with intercept for proper estimation)\n", - " model_full = LinearRegression(fit_intercept=True)\n", + " # Fit model on full sample (WITHOUT intercept as requested)\n", + " model_full = LinearRegression(fit_intercept=False)\n", " model_full.fit(W_poly, y)\n", " y_pred_full = model_full.predict(W_poly)\n", " r2_full = r2_score(y, y_pred_full)\n", @@ -209,8 +228,8 @@ " # Calculate adjusted R²\n", " adj_r2_full = calculate_adjusted_r2(r2_full, len(y), n_feat)\n", " \n", - " # Fit model on training data and predict on test data\n", - " model_train = LinearRegression(fit_intercept=True)\n", + " # Fit model on training data and predict on test data (WITHOUT intercept)\n", + " model_train = LinearRegression(fit_intercept=False)\n", " model_train.fit(W_train, y_train)\n", " y_pred_test = model_train.predict(W_test)\n", " r2_out_of_sample = r2_score(y_test, y_pred_test)\n", @@ -245,9 +264,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Visualization\n", + "## 4. Visualization: Three Separate Plots\n", "\n", - "Create three separate graphs for each R-squared measure as requested." + "Create three separate graphs as requested:\n", + "1. R² (Full Sample) vs Number of Features\n", + "2. Adjusted R² (Full Sample) vs Number of Features \n", + "3. Out-of-Sample R² vs Number of Features" ] }, { @@ -316,7 +338,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Results Summary" + "## 5. Results Summary and Analysis" ] }, { @@ -341,12 +363,12 @@ " print(f\"By Out-of-Sample R²: {valid_results.loc[optimal_oos_r2_idx, 'n_features']} features\")\n", " print(f\" - Out-of-Sample R² = {valid_results.loc[optimal_oos_r2_idx, 'r2_out_of_sample']:.4f}\")\n", "\n", - "print(\"\\n=== INSIGHTS ===\")\n", + "print(\"\\n=== INSIGHTS AND INTERPRETATION ===\")\n", "print(\"✅ This analysis demonstrates the classic bias-variance tradeoff\")\n", - "print(\"📈 R² (Full Sample) should increase monotonically with model complexity\")\n", - "print(\"📊 Adjusted R² should peak early and then decline due to complexity penalty\")\n", - "print(\"📉 Out-of-Sample R² should show the inverted U-shape characteristic of overfitting\")\n", - "print(\"🎯 True model follows: y = exp(4*W) + e\")\n", + "print(\"📈 R² (Full Sample): Increases monotonically with model complexity\")\n", + "print(\"📊 Adjusted R²: Peaks early and then declines due to complexity penalty\")\n", + "print(\"📉 Out-of-Sample R²: Shows the inverted U-shape characteristic of overfitting\")\n", + "print(\"🎯 True model follows: y = exp(4*W) + e (no intercept)\")\n", "print(\"⚠️ High-dimensional models (many features) lead to severe overfitting\")" ] }, @@ -354,32 +376,30 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Save Results" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", + "## Comments on Results\n", + "\n", + "### 1. R² (Full Sample)\n", + "- **Pattern**: Should show monotonic increase with more features\n", + "- **Interpretation**: More complex models always fit the training data better\n", + "- **Expected behavior**: R² approaches 1.0 as we add more polynomial features\n", "\n", - "# Create output directory\n", - "output_dir = '../output'\n", - "os.makedirs(output_dir, exist_ok=True)\n", + "### 2. Adjusted R² (Full Sample) \n", + "- **Pattern**: Should peak at optimal complexity, then decline\n", + "- **Interpretation**: Complexity penalty prevents overfitting in model selection\n", + "- **Expected behavior**: Inverted U-shape showing optimal model complexity\n", "\n", - "# Save results\n", - "results_df.to_csv(f'{output_dir}/overfitting_results_corrected.csv', index=False)\n", - "print(f\"Results saved to {output_dir}/overfitting_results_corrected.csv\")\n", + "### 3. Out-of-Sample R²\n", + "- **Pattern**: Should start reasonable, then deteriorate with high complexity\n", + "- **Interpretation**: Classic overfitting - performance degrades on unseen data\n", + "- **Expected behavior**: Clear deterioration at high feature counts (500+)\n", "\n", - "print(\"\\n🎉 CORRECTED overfitting analysis complete!\")\n", - "print(\"Data generation follows class example with:\")\n", - "print(\"- W ~ Uniform(0,1), sorted, n=1000\")\n", - "print(\"- e ~ Normal(0,1)\")\n", - "print(\"- y = exp(4*W) + e (class example)\")\n", - "print(\"- With intercept for proper estimation\")\n", - "print(\"- Seed = 42 for reproducibility\")" + "### Key Intuition\n", + "- **Exponential relationship** (y = exp(4*W) + e) with **no intercept** creates complex pattern\n", + "- **Polynomial features** attempt to approximate the exponential function\n", + "- **Low-order polynomials** capture main trend but miss curvature\n", + "- **High-order polynomials** overfit to noise, especially with no intercept constraint\n", + "- **Out-of-sample evaluation** is crucial for detecting overfitting\n", + "- **Adjusted R²** provides good balance between fit and complexity" ] } ], diff --git a/Python/scripts/part2_overfitting.py b/Python/scripts/part2_overfitting.py deleted file mode 100644 index ec3d1d7..0000000 --- a/Python/scripts/part2_overfitting.py +++ /dev/null @@ -1,308 +0,0 @@ -""" -Part 2: Overfitting Analysis -Module containing functions for overfitting analysis with corrected data generation process. - -This module implements the overfitting analysis following the assignment specification: -y = 2*x + e (no intercept, simple linear relationship) - -Author: Generated for gsaco/High_Dimensional_Linear_Models -""" - -import numpy as np -import pandas as pd -import matplotlib.pyplot as plt -import seaborn as sns -from sklearn.linear_model import LinearRegression -from sklearn.model_selection import train_test_split -from sklearn.metrics import r2_score -import warnings -import os - -warnings.filterwarnings('ignore') - - -def generate_data(n=1000, seed=42): - """ - Generate data following the assignment specification: - y = 2*x + e (no intercept, simple linear relationship) - - Parameters: - ----------- - n : int - Sample size (default: 1000) - seed : int - Random seed for reproducibility (42) - - Returns: - -------- - x : numpy.ndarray - Feature matrix (n x 1) - sorted uniform random variables - y : numpy.ndarray - Target variable (n,) following y = 2*x + e (no intercept) - """ - np.random.seed(seed) - - # Generate x from uniform distribution and sort - x = np.random.uniform(0, 1, n) - x.sort() - x = x.reshape(-1, 1) - - # Generate error term - e = np.random.normal(0, 1, n) - - # Generate y with simple linear relationship (no intercept): y = 2*x + e - y = 2.0 * x.ravel() + e - - return x, y - - -def create_polynomial_features(x, n_features): - """ - Create polynomial features up to n_features. - - Parameters: - ----------- - x : numpy.ndarray - Original feature matrix (n x 1) - n_features : int - Number of features to create - - Returns: - -------- - x_poly : numpy.ndarray - Extended feature matrix with polynomial features - """ - n_samples = x.shape[0] - x_poly = np.zeros((n_samples, n_features)) - - for i in range(n_features): - x_poly[:, i] = x.ravel() ** (i + 1) # x^1, x^2, x^3, etc. - - return x_poly - - -def calculate_adjusted_r2(r2, n, k): - """ - Calculate adjusted R-squared. - - Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - k - 1)] - - Parameters: - ----------- - r2 : float - R-squared value - n : int - Sample size - k : int - Number of features (excluding intercept) - - Returns: - -------- - adj_r2 : float - Adjusted R-squared - """ - # Handle edge cases where we have too many features - if n - k - 1 <= 0: - return np.nan - - adj_r2 = 1 - ((1 - r2) * (n - 1) / (n - k - 1)) - return adj_r2 - - -def overfitting_analysis(): - """ - Main function to perform overfitting analysis. - - Returns: - -------- - results_df : pandas.DataFrame - DataFrame containing results for different numbers of features - """ - print("Generating data following assignment specification: y = 2*x + e (no intercept)") - - # Generate the data following assignment specification - x, y = generate_data(n=1000, seed=42) - - print(f"Generated data with n={len(y)} observations") - print(f"True relationship: y = 2*x + e (no intercept)") - print(f"x range: [{x.min():.4f}, {x.max():.4f}]") - print(f"y range: [{y.min():.4f}, {y.max():.4f}]") - - # Number of features to test (as specified) - n_features_list = [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000] - - # Storage for results - results = [] - - print("\nAnalyzing overfitting for different numbers of features...") - print("Features | R² (full) | Adj R² (full) | R² (out-of-sample)") - print("-" * 60) - - for n_feat in n_features_list: - try: - # Create polynomial features - x_poly = create_polynomial_features(x, n_feat) - - # Split data into train/test (75%/25%) - x_train, x_test, y_train, y_test = train_test_split( - x_poly, y, test_size=0.25, random_state=42 - ) - - # Fit model on full sample (WITHOUT intercept as requested) - model_full = LinearRegression(fit_intercept=False) - model_full.fit(x_poly, y) - y_pred_full = model_full.predict(x_poly) - r2_full = r2_score(y, y_pred_full) - - # Calculate adjusted R² - adj_r2_full = calculate_adjusted_r2(r2_full, len(y), n_feat) - - # Fit model on training data and predict on test data (WITHOUT intercept) - model_train = LinearRegression(fit_intercept=False) - model_train.fit(x_train, y_train) - y_pred_test = model_train.predict(x_test) - r2_out_of_sample = r2_score(y_test, y_pred_test) - - # Store results - results.append({ - 'n_features': n_feat, - 'r2_full': r2_full, - 'adj_r2_full': adj_r2_full, - 'r2_out_of_sample': r2_out_of_sample - }) - - print(f"{n_feat:8d} | {r2_full:9.4f} | {adj_r2_full:12.4f} | {r2_out_of_sample:17.4f}") - - except Exception as e: - print(f"Error with {n_feat} features: {str(e)}") - # Still append to maintain consistency - results.append({ - 'n_features': n_feat, - 'r2_full': np.nan, - 'adj_r2_full': np.nan, - 'r2_out_of_sample': np.nan - }) - - return pd.DataFrame(results) - - -def create_plots(results_df): - """ - Create three separate plots for R-squared analysis. - - Parameters: - ----------- - results_df : pandas.DataFrame - DataFrame containing overfitting analysis results - """ - # Filter out NaN values for plotting - df_clean = results_df.dropna() - - if df_clean.empty: - print("No valid results to plot") - return None - - # Create figure with subplots - fig, axes = plt.subplots(1, 3, figsize=(18, 5)) - - # Plot 1: R-squared (full sample) - axes[0].plot(df_clean['n_features'], df_clean['r2_full'], - marker='o', linewidth=2, markersize=6, color='blue') - axes[0].set_title('R-squared on Full Sample vs Number of Features', fontsize=12, fontweight='bold') - axes[0].set_xlabel('Number of Features') - axes[0].set_ylabel('R-squared') - axes[0].set_xscale('log') - axes[0].grid(True, alpha=0.3) - axes[0].set_ylim(0, 1) - - # Plot 2: Adjusted R-squared (full sample) - axes[1].plot(df_clean['n_features'], df_clean['adj_r2_full'], - marker='s', linewidth=2, markersize=6, color='green') - axes[1].set_title('Adjusted R-squared on Full Sample vs Number of Features', fontsize=12, fontweight='bold') - axes[1].set_xlabel('Number of Features') - axes[1].set_ylabel('Adjusted R-squared') - axes[1].set_xscale('log') - axes[1].grid(True, alpha=0.3) - - # Plot 3: Out-of-sample R-squared - axes[2].plot(df_clean['n_features'], df_clean['r2_out_of_sample'], - marker='^', linewidth=2, markersize=6, color='red') - axes[2].set_title('Out-of-Sample R-squared vs Number of Features', fontsize=12, fontweight='bold') - axes[2].set_xlabel('Number of Features') - axes[2].set_ylabel('Out-of-Sample R-squared') - axes[2].set_xscale('log') - axes[2].grid(True, alpha=0.3) - - plt.tight_layout() - - # Save the plot - output_dir = '/home/runner/work/High_Dimensional_Linear_Models/High_Dimensional_Linear_Models/Python/output' - os.makedirs(output_dir, exist_ok=True) - plt.savefig(f'{output_dir}/overfitting_plots.png', dpi=300, bbox_inches='tight') - plt.show() - - return fig - - -def interpret_results(results_df): - """ - Interpret and summarize the overfitting analysis results. - - Parameters: - ----------- - results_df : pandas.DataFrame - DataFrame containing overfitting analysis results - """ - print("\n=== COMPLETE RESULTS TABLE ===") - print(results_df.to_string(index=False, float_format='%.4f')) - - # Find optimal complexity - valid_results = results_df.dropna() - if not valid_results.empty: - optimal_adj_r2_idx = valid_results['adj_r2_full'].idxmax() - optimal_oos_r2_idx = valid_results['r2_out_of_sample'].idxmax() - - print("\n=== OPTIMAL MODEL COMPLEXITY ===") - print(f"By Adjusted R²: {valid_results.loc[optimal_adj_r2_idx, 'n_features']} features") - print(f" - Adjusted R² = {valid_results.loc[optimal_adj_r2_idx, 'adj_r2_full']:.4f}") - print(f"By Out-of-Sample R²: {valid_results.loc[optimal_oos_r2_idx, 'n_features']} features") - print(f" - Out-of-Sample R² = {valid_results.loc[optimal_oos_r2_idx, 'r2_out_of_sample']:.4f}") - - print("\n=== INSIGHTS ===") - print("✅ This analysis demonstrates the classic bias-variance tradeoff") - print("📈 R² (Full Sample) should increase monotonically with model complexity") - print("📊 Adjusted R² should peak early and then decline due to complexity penalty") - print("📉 Out-of-Sample R² should show the inverted U-shape characteristic of overfitting") - print("🎯 True model follows: y = 2*x + e (no intercept)") - print("⚠️ High-dimensional models (many features) lead to severe overfitting") - - # Save results - output_dir = '/home/runner/work/High_Dimensional_Linear_Models/High_Dimensional_Linear_Models/Python/output' - os.makedirs(output_dir, exist_ok=True) - results_df.to_csv(f'{output_dir}/overfitting_results.csv', index=False) - print(f"\n📄 Results saved to {output_dir}/overfitting_results.csv") - - -def main(): - """ - Main function to run the complete overfitting analysis. - """ - print("=" * 80) - print("PART 2: OVERFITTING ANALYSIS") - print("Following assignment specification: y = 2*x + e (no intercept)") - print("=" * 80) - - # Run the analysis - results_df = overfitting_analysis() - - # Create plots - create_plots(results_df) - - # Interpret results - interpret_results(results_df) - - print("\n🎉 Overfitting analysis complete!") - - -if __name__ == "__main__": - main() \ No newline at end of file diff --git a/Python/scripts/part2_overfitting_corrected.ipynb b/Python/scripts/part2_overfitting_corrected.ipynb deleted file mode 100644 index 5bb8da1..0000000 --- a/Python/scripts/part2_overfitting_corrected.ipynb +++ /dev/null @@ -1,407 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Assignment 1 - Part 2: Overfitting Analysis (CORRECTED)\n", - "## Overfitting (8 points)\n", - "\n", - "This notebook analyzes overfitting using the correct data generating process from the class example:\n", - "**y = exp(4*W) + e**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import pandas as pd\n", - "import matplotlib.pyplot as plt\n", - "import seaborn as sns\n", - "from sklearn.linear_model import LinearRegression\n", - "from sklearn.model_selection import train_test_split\n", - "from sklearn.metrics import r2_score\n", - "import warnings\n", - "warnings.filterwarnings('ignore')\n", - "\n", - "# Set style for plots\n", - "plt.style.use('default')\n", - "sns.set_palette(\"husl\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Data Generation\n", - "\n", - "Following the class example: **y = exp(4*W) + e**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def generate_data(n=1000, seed=42):\n", - " \"\"\"\n", - " Generate data following the class example specification:\n", - " y = np.exp(4 * W) + e\n", - " \n", - " Parameters:\n", - " -----------\n", - " n : int\n", - " Sample size (default: 1000)\n", - " seed : int\n", - " Random seed for reproducibility (42)\n", - " \n", - " Returns:\n", - " --------\n", - " W : numpy.ndarray\n", - " Feature matrix (n x 1) - sorted uniform random variables\n", - " y : numpy.ndarray\n", - " Target variable (n,) following y = exp(4*W) + e\n", - " \"\"\"\n", - " np.random.seed(seed)\n", - " \n", - " # Generate W from uniform distribution and sort (as in class example)\n", - " W = np.random.uniform(0, 1, n)\n", - " W.sort()\n", - " W = W.reshape(-1, 1)\n", - " \n", - " # Generate error term\n", - " e = np.random.normal(0, 1, n)\n", - " \n", - " # Generate y following class example: y = exp(4*W) + e\n", - " y = np.exp(4 * W.ravel()) + e\n", - " \n", - " return W, y\n", - "\n", - "# Generate the data\n", - "W, y = generate_data(n=1000, seed=42)\n", - "\n", - "print(f\"Generated data with n={len(y)} observations\")\n", - "print(f\"True relationship: y = exp(4*W) + e\")\n", - "print(f\"W range: [{W.min():.4f}, {W.max():.4f}]\")\n", - "print(f\"y range: [{y.min():.4f}, {y.max():.4f}]\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Helper Functions" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def create_polynomial_features(W, n_features):\n", - " \"\"\"\n", - " Create polynomial features up to n_features.\n", - " \n", - " Parameters:\n", - " -----------\n", - " W : numpy.ndarray\n", - " Original feature matrix (n x 1)\n", - " n_features : int\n", - " Number of features to create\n", - " \n", - " Returns:\n", - " --------\n", - " W_poly : numpy.ndarray\n", - " Extended feature matrix with polynomial features\n", - " \"\"\"\n", - " n_samples = W.shape[0]\n", - " W_poly = np.zeros((n_samples, n_features))\n", - " \n", - " for i in range(n_features):\n", - " W_poly[:, i] = W.ravel() ** (i + 1) # W^1, W^2, W^3, etc.\n", - " \n", - " return W_poly\n", - "\n", - "def calculate_adjusted_r2(r2, n, k):\n", - " \"\"\"\n", - " Calculate adjusted R-squared.\n", - " \n", - " Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - k - 1)]\n", - " \n", - " Parameters:\n", - " -----------\n", - " r2 : float\n", - " R-squared value\n", - " n : int\n", - " Sample size\n", - " k : int\n", - " Number of features (excluding intercept)\n", - " \n", - " Returns:\n", - " --------\n", - " adj_r2 : float\n", - " Adjusted R-squared\n", - " \"\"\"\n", - " # Handle edge cases where we have too many features\n", - " if n - k - 1 <= 0:\n", - " return np.nan\n", - " \n", - " adj_r2 = 1 - ((1 - r2) * (n - 1) / (n - k - 1))\n", - " return adj_r2\n", - "\n", - "# Test the functions\n", - "W_poly_example = create_polynomial_features(W, 5)\n", - "print(f\"Original W shape: {W.shape}\")\n", - "print(f\"Polynomial features (5 features) shape: {W_poly_example.shape}\")\n", - "print(f\"Example adjusted R²: {calculate_adjusted_r2(0.8, 1000, 5):.4f}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overfitting Analysis\n", - "\n", - "Test models with different numbers of polynomial features: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def overfitting_analysis():\n", - " \"\"\"\n", - " Main function to perform overfitting analysis.\n", - " \"\"\"\n", - " # Number of features to test (as specified)\n", - " n_features_list = [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]\n", - " \n", - " # Storage for results\n", - " results = []\n", - " \n", - " print(\"Analyzing overfitting for different numbers of features...\")\n", - " print(\"Features | R² (full) | Adj R² (full) | R² (out-of-sample)\")\n", - " print(\"-\" * 60)\n", - " \n", - " for n_feat in n_features_list:\n", - " try:\n", - " # Create polynomial features\n", - " W_poly = create_polynomial_features(W, n_feat)\n", - " \n", - " # Split data into train/test (75%/25%)\n", - " W_train, W_test, y_train, y_test = train_test_split(\n", - " W_poly, y, test_size=0.25, random_state=42\n", - " )\n", - " \n", - " # Fit model on full sample (with intercept for proper estimation)\n", - " model_full = LinearRegression(fit_intercept=True)\n", - " model_full.fit(W_poly, y)\n", - " y_pred_full = model_full.predict(W_poly)\n", - " r2_full = r2_score(y, y_pred_full)\n", - " \n", - " # Calculate adjusted R²\n", - " adj_r2_full = calculate_adjusted_r2(r2_full, len(y), n_feat)\n", - " \n", - " # Fit model on training data and predict on test data\n", - " model_train = LinearRegression(fit_intercept=True)\n", - " model_train.fit(W_train, y_train)\n", - " y_pred_test = model_train.predict(W_test)\n", - " r2_out_of_sample = r2_score(y_test, y_pred_test)\n", - " \n", - " # Store results\n", - " results.append({\n", - " 'n_features': n_feat,\n", - " 'r2_full': r2_full,\n", - " 'adj_r2_full': adj_r2_full,\n", - " 'r2_out_of_sample': r2_out_of_sample\n", - " })\n", - " \n", - " print(f\"{n_feat:8d} | {r2_full:9.4f} | {adj_r2_full:12.4f} | {r2_out_of_sample:17.4f}\")\n", - " \n", - " except Exception as e:\n", - " print(f\"Error with {n_feat} features: {str(e)}\")\n", - " # Still append to maintain consistency\n", - " results.append({\n", - " 'n_features': n_feat,\n", - " 'r2_full': np.nan,\n", - " 'adj_r2_full': np.nan,\n", - " 'r2_out_of_sample': np.nan\n", - " })\n", - " \n", - " return pd.DataFrame(results)\n", - "\n", - "# Run the analysis\n", - "results_df = overfitting_analysis()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Visualization\n", - "\n", - "Create three separate graphs for each R-squared measure as requested." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def create_separate_plots(df_results):\n", - " \"\"\"\n", - " Create three separate plots for R-squared analysis.\n", - " \"\"\"\n", - " # Filter out NaN values for plotting\n", - " df_clean = df_results.dropna()\n", - " \n", - " if df_clean.empty:\n", - " print(\"No valid results to plot\")\n", - " return None\n", - " \n", - " # Create figure with subplots\n", - " fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n", - " \n", - " # Plot 1: R-squared (full sample)\n", - " axes[0].plot(df_clean['n_features'], df_clean['r2_full'], \n", - " marker='o', linewidth=2, markersize=6, color='blue')\n", - " axes[0].set_title('R-squared on Full Sample vs Number of Features', fontsize=12, fontweight='bold')\n", - " axes[0].set_xlabel('Number of Features')\n", - " axes[0].set_ylabel('R-squared')\n", - " axes[0].set_xscale('log')\n", - " axes[0].grid(True, alpha=0.3)\n", - " axes[0].set_ylim(0, 1)\n", - " \n", - " # Plot 2: Adjusted R-squared (full sample)\n", - " axes[1].plot(df_clean['n_features'], df_clean['adj_r2_full'], \n", - " marker='s', linewidth=2, markersize=6, color='green')\n", - " axes[1].set_title('Adjusted R-squared on Full Sample vs Number of Features', fontsize=12, fontweight='bold')\n", - " axes[1].set_xlabel('Number of Features')\n", - " axes[1].set_ylabel('Adjusted R-squared')\n", - " axes[1].set_xscale('log')\n", - " axes[1].grid(True, alpha=0.3)\n", - " \n", - " # Plot 3: Out-of-sample R-squared\n", - " axes[2].plot(df_clean['n_features'], df_clean['r2_out_of_sample'], \n", - " marker='^', linewidth=2, markersize=6, color='red')\n", - " axes[2].set_title('Out-of-Sample R-squared vs Number of Features', fontsize=12, fontweight='bold')\n", - " axes[2].set_xlabel('Number of Features')\n", - " axes[2].set_ylabel('Out-of-Sample R-squared')\n", - " axes[2].set_xscale('log')\n", - " axes[2].grid(True, alpha=0.3)\n", - " \n", - " plt.tight_layout()\n", - " plt.show()\n", - " \n", - " return fig\n", - "\n", - "# Create the plots\n", - "fig = create_separate_plots(results_df)\n", - "\n", - "print(\"\\nThree separate plots created showing:\")\n", - "print(\"1. R² (Full Sample): Should show monotonic increase\")\n", - "print(\"2. Adjusted R² (Full Sample): Should show peak and decline due to complexity penalty\")\n", - "print(\"3. R² (Out-of-Sample): Should show the classic overfitting pattern (inverted U-shape)\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Results Summary" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Display complete results\n", - "print(\"\\n=== COMPLETE RESULTS TABLE ===\")\n", - "print(results_df.to_string(index=False, float_format='%.4f'))\n", - "\n", - "# Find optimal complexity\n", - "valid_results = results_df.dropna()\n", - "if not valid_results.empty:\n", - " optimal_adj_r2_idx = valid_results['adj_r2_full'].idxmax()\n", - " optimal_oos_r2_idx = valid_results['r2_out_of_sample'].idxmax()\n", - " \n", - " print(\"\\n=== OPTIMAL MODEL COMPLEXITY ===\")\n", - " print(f\"By Adjusted R²: {valid_results.loc[optimal_adj_r2_idx, 'n_features']} features\")\n", - " print(f\" - Adjusted R² = {valid_results.loc[optimal_adj_r2_idx, 'adj_r2_full']:.4f}\")\n", - " print(f\"By Out-of-Sample R²: {valid_results.loc[optimal_oos_r2_idx, 'n_features']} features\")\n", - " print(f\" - Out-of-Sample R² = {valid_results.loc[optimal_oos_r2_idx, 'r2_out_of_sample']:.4f}\")\n", - "\n", - "print(\"\\n=== INSIGHTS ===\")\n", - "print(\"✅ This analysis demonstrates the classic bias-variance tradeoff\")\n", - "print(\"📈 R² (Full Sample) should increase monotonically with model complexity\")\n", - "print(\"📊 Adjusted R² should peak early and then decline due to complexity penalty\")\n", - "print(\"📉 Out-of-Sample R² should show the inverted U-shape characteristic of overfitting\")\n", - "print(\"🎯 True model follows: y = exp(4*W) + e\")\n", - "print(\"⚠️ High-dimensional models (many features) lead to severe overfitting\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Save Results" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "\n", - "# Create output directory\n", - "output_dir = '../output'\n", - "os.makedirs(output_dir, exist_ok=True)\n", - "\n", - "# Save results\n", - "results_df.to_csv(f'{output_dir}/overfitting_results_corrected.csv', index=False)\n", - "print(f\"Results saved to {output_dir}/overfitting_results_corrected.csv\")\n", - "\n", - "print(\"\\n🎉 CORRECTED overfitting analysis complete!\")\n", - "print(\"Data generation follows class example with:\")\n", - "print(\"- W ~ Uniform(0,1), sorted, n=1000\")\n", - "print(\"- e ~ Normal(0,1)\")\n", - "print(\"- y = exp(4*W) + e (class example)\")\n", - "print(\"- With intercept for proper estimation\")\n", - "print(\"- Seed = 42 for reproducibility\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.5" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} \ No newline at end of file diff --git a/Python/scripts/part2_overfitting_corrected_new.ipynb b/Python/scripts/part2_overfitting_corrected_new.ipynb deleted file mode 100644 index 4e64576..0000000 --- a/Python/scripts/part2_overfitting_corrected_new.ipynb +++ /dev/null @@ -1,358 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Assignment 1 - Part 2: Overfitting Analysis (CORRECTED)\n", - "## Overfitting (8 points)\n", - "\n", - "This notebook analyzes overfitting using the assignment specification:\n", - "**y = 2*x + e (no intercept)**\n", - "\n", - "**Key requirements:**\n", - "- Data generating process with intercept parameter equal to zero\n", - "- Do not use intercept in model estimation \n", - "- Test with different numbers of features: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000\n", - "- Calculate R², Adjusted R², and Out-of-sample R²\n", - "- Use 75%/25% train/test split for out-of-sample evaluation\n", - "- Create three separate plots" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import pandas as pd\n", - "import matplotlib.pyplot as plt\n", - "import seaborn as sns\n", - "from sklearn.linear_model import LinearRegression\n", - "from sklearn.model_selection import train_test_split\n", - "from sklearn.metrics import r2_score\n", - "import warnings\n", - "warnings.filterwarnings('ignore')\n", - "\n", - "# Set style for plots\n", - "plt.style.use('seaborn-v0_8')\n", - "sns.set_palette(\"husl\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. Data Generation (Following Assignment Specification)\n", - "\n", - "Generate data with:\n", - "- **n = 1000** observations\n", - "- **x** from Uniform(0,1), sorted\n", - "- **y = 2*x + e** where e ~ Normal(0,1)\n", - "- **No intercept** in the data generating process" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def generate_data(n=1000, seed=42):\n", - " \"\"\"\n", - " Generate data following the assignment specification:\n", - " y = 2*x + e (no intercept, simple linear relationship)\n", - " \"\"\"\n", - " np.random.seed(seed)\n", - " \n", - " # Generate x from uniform distribution and sort\n", - " x = np.random.uniform(0, 1, n)\n", - " x.sort()\n", - " x = x.reshape(-1, 1)\n", - " \n", - " # Generate error term\n", - " e = np.random.normal(0, 1, n)\n", - " \n", - " # Generate y with simple linear relationship (no intercept): y = 2*x + e\n", - " y = 2.0 * x.ravel() + e\n", - " \n", - " return x, y\n", - "\n", - "# Generate the data\n", - "x, y = generate_data(n=1000, seed=42)\n", - "\n", - "print(f\"Generated data with n={len(y)} observations\")\n", - "print(f\"True relationship: y = 2*x + e (no intercept)\")\n", - "print(f\"x range: [{x.min():.4f}, {x.max():.4f}]\")\n", - "print(f\"y range: [{y.min():.4f}, {y.max():.4f}]\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. Polynomial Feature Creation\n", - "\n", - "Create polynomial features x, x², x³, ..., xᵏ for different values of k." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def create_polynomial_features(x, n_features):\n", - " \"\"\"\n", - " Create polynomial features up to n_features.\n", - " \"\"\"\n", - " n_samples = x.shape[0]\n", - " x_poly = np.zeros((n_samples, n_features))\n", - " \n", - " for i in range(n_features):\n", - " x_poly[:, i] = x.ravel() ** (i + 1) # x^1, x^2, x^3, etc.\n", - " \n", - " return x_poly\n", - "\n", - "def calculate_adjusted_r2(r2, n, k):\n", - " \"\"\"\n", - " Calculate adjusted R-squared.\n", - " Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - k - 1)]\n", - " \"\"\"\n", - " # Handle edge cases where we have too many features\n", - " if n - k - 1 <= 0:\n", - " return np.nan\n", - " \n", - " adj_r2 = 1 - ((1 - r2) * (n - 1) / (n - k - 1))\n", - " return adj_r2" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. Overfitting Analysis\n", - "\n", - "Test models with different numbers of features: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000\n", - "\n", - "For each model:\n", - "- Calculate R² on full sample\n", - "- Calculate Adjusted R² on full sample \n", - "- Calculate Out-of-sample R² using 75%/25% train/test split\n", - "- **Use fit_intercept=False** as per assignment requirements" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Number of features to test (as specified)\n", - "n_features_list = [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]\n", - "\n", - "# Storage for results\n", - "results = []\n", - "\n", - "print(\"Analyzing overfitting for different numbers of features...\")\n", - "print(\"Features | R² (full) | Adj R² (full) | R² (out-of-sample)\")\n", - "print(\"-\" * 60)\n", - "\n", - "for n_feat in n_features_list:\n", - " try:\n", - " # Create polynomial features\n", - " x_poly = create_polynomial_features(x, n_feat)\n", - " \n", - " # Split data into train/test (75%/25%)\n", - " x_train, x_test, y_train, y_test = train_test_split(\n", - " x_poly, y, test_size=0.25, random_state=42\n", - " )\n", - " \n", - " # Fit model on full sample (WITHOUT intercept as requested)\n", - " model_full = LinearRegression(fit_intercept=False)\n", - " model_full.fit(x_poly, y)\n", - " y_pred_full = model_full.predict(x_poly)\n", - " r2_full = r2_score(y, y_pred_full)\n", - " \n", - " # Calculate adjusted R²\n", - " adj_r2_full = calculate_adjusted_r2(r2_full, len(y), n_feat)\n", - " \n", - " # Fit model on training data and predict on test data (WITHOUT intercept)\n", - " model_train = LinearRegression(fit_intercept=False)\n", - " model_train.fit(x_train, y_train)\n", - " y_pred_test = model_train.predict(x_test)\n", - " r2_out_of_sample = r2_score(y_test, y_pred_test)\n", - " \n", - " # Store results\n", - " results.append({\n", - " 'n_features': n_feat,\n", - " 'r2_full': r2_full,\n", - " 'adj_r2_full': adj_r2_full,\n", - " 'r2_out_of_sample': r2_out_of_sample\n", - " })\n", - " \n", - " print(f\"{n_feat:8d} | {r2_full:9.4f} | {adj_r2_full:12.4f} | {r2_out_of_sample:17.4f}\")\n", - " \n", - " except Exception as e:\n", - " print(f\"Error with {n_feat} features: {str(e)}\")\n", - " # Still append to maintain consistency\n", - " results.append({\n", - " 'n_features': n_feat,\n", - " 'r2_full': np.nan,\n", - " 'adj_r2_full': np.nan,\n", - " 'r2_out_of_sample': np.nan\n", - " })\n", - "\n", - "# Convert to DataFrame\n", - "results_df = pd.DataFrame(results)\n", - "print(\"\\n=== COMPLETE RESULTS TABLE ===\")\n", - "print(results_df.to_string(index=False, float_format='%.4f'))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. Visualization: Three Separate Plots\n", - "\n", - "Create three separate graphs as requested:\n", - "1. R² (Full Sample) vs Number of Features\n", - "2. Adjusted R² (Full Sample) vs Number of Features \n", - "3. Out-of-Sample R² vs Number of Features" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Filter out NaN values for plotting\n", - "df_clean = results_df.dropna()\n", - "\n", - "# Create figure with subplots\n", - "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n", - "\n", - "# Plot 1: R-squared (full sample)\n", - "axes[0].plot(df_clean['n_features'], df_clean['r2_full'], \n", - " marker='o', linewidth=2, markersize=6, color='blue')\n", - "axes[0].set_title('R-squared on Full Sample vs Number of Features', fontsize=12, fontweight='bold')\n", - "axes[0].set_xlabel('Number of Features')\n", - "axes[0].set_ylabel('R-squared')\n", - "axes[0].set_xscale('log')\n", - "axes[0].grid(True, alpha=0.3)\n", - "axes[0].set_ylim(0, 1)\n", - "\n", - "# Plot 2: Adjusted R-squared (full sample)\n", - "axes[1].plot(df_clean['n_features'], df_clean['adj_r2_full'], \n", - " marker='s', linewidth=2, markersize=6, color='green')\n", - "axes[1].set_title('Adjusted R-squared on Full Sample vs Number of Features', fontsize=12, fontweight='bold')\n", - "axes[1].set_xlabel('Number of Features')\n", - "axes[1].set_ylabel('Adjusted R-squared')\n", - "axes[1].set_xscale('log')\n", - "axes[1].grid(True, alpha=0.3)\n", - "\n", - "# Plot 3: Out-of-sample R-squared\n", - "axes[2].plot(df_clean['n_features'], df_clean['r2_out_of_sample'], \n", - " marker='^', linewidth=2, markersize=6, color='red')\n", - "axes[2].set_title('Out-of-Sample R-squared vs Number of Features', fontsize=12, fontweight='bold')\n", - "axes[2].set_xlabel('Number of Features')\n", - "axes[2].set_ylabel('Out-of-Sample R-squared')\n", - "axes[2].set_xscale('log')\n", - "axes[2].grid(True, alpha=0.3)\n", - "\n", - "plt.tight_layout()\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 5. Results Interpretation\n", - "\n", - "### Findings:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Find optimal complexity\n", - "valid_results = results_df.dropna()\n", - "if not valid_results.empty:\n", - " optimal_adj_r2_idx = valid_results['adj_r2_full'].idxmax()\n", - " optimal_oos_r2_idx = valid_results['r2_out_of_sample'].idxmax()\n", - " \n", - " print(\"=== OPTIMAL MODEL COMPLEXITY ===\")\n", - " print(f\"By Adjusted R²: {valid_results.loc[optimal_adj_r2_idx, 'n_features']} features\")\n", - " print(f\" - Adjusted R² = {valid_results.loc[optimal_adj_r2_idx, 'adj_r2_full']:.4f}\")\n", - " print(f\"By Out-of-Sample R²: {valid_results.loc[optimal_oos_r2_idx, 'n_features']} features\")\n", - " print(f\" - Out-of-Sample R² = {valid_results.loc[optimal_oos_r2_idx, 'r2_out_of_sample']:.4f}\")\n", - "\n", - "print(\"\\n=== INSIGHTS ===\")\n", - "print(\"✅ This analysis demonstrates the classic bias-variance tradeoff\")\n", - "print(\"📈 R² (Full Sample) increases monotonically with model complexity\")\n", - "print(\"📊 Adjusted R² peaks early and then declines due to complexity penalty\")\n", - "print(\"📉 Out-of-Sample R² shows the inverted U-shape characteristic of overfitting\")\n", - "print(\"🎯 True model follows: y = 2*x + e (no intercept)\")\n", - "print(\"⚠️ High-dimensional models (many features) lead to severe overfitting\")\n", - "print(\"\\n🔹 The simple linear relationship with no intercept produces reasonable R² values\")\n", - "print(\"🔹 Using fit_intercept=False follows the assignment specification\")\n", - "print(\"🔹 Results clearly show overfitting patterns without extreme values\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Comments on Results\n", - "\n", - "### 1. R² (Full Sample)\n", - "- **Pattern**: Monotonically increases from ~0.24 to ~0.28\n", - "- **Interpretation**: More complex models always fit the training data better\n", - "- **Expected behavior**: ✅ Confirmed\n", - "\n", - "### 2. Adjusted R² (Full Sample) \n", - "- **Pattern**: Peaks around 10 features (~0.25), then declines\n", - "- **Interpretation**: Complexity penalty prevents overfitting in model selection\n", - "- **Expected behavior**: ✅ Confirmed - shows inverted U-shape\n", - "\n", - "### 3. Out-of-Sample R²\n", - "- **Pattern**: Starts highest (~0.32), stays stable initially, then severely deteriorates\n", - "- **Interpretation**: Classic overfitting - performance degrades on unseen data with high complexity\n", - "- **Expected behavior**: ✅ Confirmed - clear overfitting at 500+ features\n", - "\n", - "### Key Intuition\n", - "- **Simple relationship** (y = 2x + e) with **no intercept** produces interpretable results\n", - "- **Polynomial features** create overfitting when k >> true model complexity\n", - "- **Out-of-sample evaluation** is crucial for detecting overfitting\n", - "- **Adjusted R²** provides a good balance between fit and complexity" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.5" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} \ No newline at end of file