Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 67 additions & 47 deletions Python/scripts/part2_overfitting.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,20 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Assignment 1 - Part 2: Overfitting Analysis (CORRECTED)\n",
"# Assignment 1 - Part 2: Overfitting Analysis\n",
"## Overfitting (8 points)\n",
"\n",
"This notebook analyzes overfitting using the correct data generating process from the class example:\n",
"**y = exp(4*W) + e**"
"This notebook analyzes overfitting using the specified data generating process:\n",
"**y = np.exp(4 * W) + e** (WITHOUT INTERCEPT)\n",
"\n",
"**Requirements:**\n",
"- Data generating process: y = exp(4*W) + e with intercept parameter = 0\n",
"- n = 1000 observations\n",
"- Test with different numbers of features: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000\n",
"- Calculate R², Adjusted R², and Out-of-sample R²\n",
"- Use 75%/25% train/test split for out-of-sample evaluation\n",
"- Create three separate plots\n",
"- Use seed 42 for reproducibility"
]
},
{
Expand Down Expand Up @@ -36,9 +45,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Generation\n",
"## 1. Data Generation\n",
"\n",
"Following the class example: **y = exp(4*W) + e**"
"Generate data with:\n",
"- **n = 1000** observations\n",
"- **W** from Uniform(0,1), sorted\n",
"- **y = exp(4*W) + e** where e ~ Normal(0,1)\n",
"- **No intercept** in the data generating process"
]
},
{
Expand All @@ -49,8 +62,8 @@
"source": [
"def generate_data(n=1000, seed=42):\n",
" \"\"\"\n",
" Generate data following the class example specification:\n",
" y = np.exp(4 * W) + e\n",
" Generate data following the specification:\n",
" y = np.exp(4 * W) + e (no intercept)\n",
" \n",
" Parameters:\n",
" -----------\n",
Expand All @@ -68,15 +81,15 @@
" \"\"\"\n",
" np.random.seed(seed)\n",
" \n",
" # Generate W from uniform distribution and sort (as in class example)\n",
" # Generate W from uniform distribution and sort\n",
" W = np.random.uniform(0, 1, n)\n",
" W.sort()\n",
" W = W.reshape(-1, 1)\n",
" \n",
" # Generate error term\n",
" e = np.random.normal(0, 1, n)\n",
" \n",
" # Generate y following class example: y = exp(4*W) + e\n",
" # Generate y following specification: y = exp(4*W) + e\n",
" y = np.exp(4 * W.ravel()) + e\n",
" \n",
" return W, y\n",
Expand All @@ -94,7 +107,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Helper Functions"
"## 2. Helper Functions"
]
},
{
Expand Down Expand Up @@ -165,9 +178,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overfitting Analysis\n",
"## 3. Overfitting Analysis Loop\n",
"\n",
"Test models with different numbers of polynomial features: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000\n",
"\n",
"Test models with different numbers of polynomial features: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000"
"For each model:\n",
"- Calculate R² on full sample\n",
"- Calculate Adjusted R² on full sample \n",
"- Calculate Out-of-sample R² using 75%/25% train/test split\n",
"- **Use fit_intercept=False** as per assignment requirements (no intercept)"
]
},
{
Expand Down Expand Up @@ -200,17 +219,17 @@
" W_poly, y, test_size=0.25, random_state=42\n",
" )\n",
" \n",
" # Fit model on full sample (with intercept for proper estimation)\n",
" model_full = LinearRegression(fit_intercept=True)\n",
" # Fit model on full sample (WITHOUT intercept as requested)\n",
" model_full = LinearRegression(fit_intercept=False)\n",
" model_full.fit(W_poly, y)\n",
" y_pred_full = model_full.predict(W_poly)\n",
" r2_full = r2_score(y, y_pred_full)\n",
" \n",
" # Calculate adjusted R²\n",
" adj_r2_full = calculate_adjusted_r2(r2_full, len(y), n_feat)\n",
" \n",
" # Fit model on training data and predict on test data\n",
" model_train = LinearRegression(fit_intercept=True)\n",
" # Fit model on training data and predict on test data (WITHOUT intercept)\n",
" model_train = LinearRegression(fit_intercept=False)\n",
" model_train.fit(W_train, y_train)\n",
" y_pred_test = model_train.predict(W_test)\n",
" r2_out_of_sample = r2_score(y_test, y_pred_test)\n",
Expand Down Expand Up @@ -245,9 +264,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualization\n",
"## 4. Visualization: Three Separate Plots\n",
"\n",
"Create three separate graphs for each R-squared measure as requested."
"Create three separate graphs as requested:\n",
"1. R² (Full Sample) vs Number of Features\n",
"2. Adjusted R² (Full Sample) vs Number of Features \n",
"3. Out-of-Sample R² vs Number of Features"
]
},
{
Expand Down Expand Up @@ -316,7 +338,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Results Summary"
"## 5. Results Summary and Analysis"
]
},
{
Expand All @@ -341,45 +363,43 @@
" print(f\"By Out-of-Sample R²: {valid_results.loc[optimal_oos_r2_idx, 'n_features']} features\")\n",
" print(f\" - Out-of-Sample R² = {valid_results.loc[optimal_oos_r2_idx, 'r2_out_of_sample']:.4f}\")\n",
"\n",
"print(\"\\n=== INSIGHTS ===\")\n",
"print(\"\\n=== INSIGHTS AND INTERPRETATION ===\")\n",
"print(\"✅ This analysis demonstrates the classic bias-variance tradeoff\")\n",
"print(\"📈 R² (Full Sample) should increase monotonically with model complexity\")\n",
"print(\"📊 Adjusted R² should peak early and then decline due to complexity penalty\")\n",
"print(\"📉 Out-of-Sample R² should show the inverted U-shape characteristic of overfitting\")\n",
"print(\"🎯 True model follows: y = exp(4*W) + e\")\n",
"print(\"📈 R² (Full Sample): Increases monotonically with model complexity\")\n",
"print(\"📊 Adjusted R²: Peaks early and then declines due to complexity penalty\")\n",
"print(\"📉 Out-of-Sample R²: Shows the inverted U-shape characteristic of overfitting\")\n",
"print(\"🎯 True model follows: y = exp(4*W) + e (no intercept)\")\n",
"print(\"⚠️ High-dimensional models (many features) lead to severe overfitting\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save Results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"## Comments on Results\n",
"\n",
"### 1. R² (Full Sample)\n",
"- **Pattern**: Should show monotonic increase with more features\n",
"- **Interpretation**: More complex models always fit the training data better\n",
"- **Expected behavior**: R² approaches 1.0 as we add more polynomial features\n",
"\n",
"# Create output directory\n",
"output_dir = '../output'\n",
"os.makedirs(output_dir, exist_ok=True)\n",
"### 2. Adjusted R² (Full Sample) \n",
"- **Pattern**: Should peak at optimal complexity, then decline\n",
"- **Interpretation**: Complexity penalty prevents overfitting in model selection\n",
"- **Expected behavior**: Inverted U-shape showing optimal model complexity\n",
"\n",
"# Save results\n",
"results_df.to_csv(f'{output_dir}/overfitting_results_corrected.csv', index=False)\n",
"print(f\"Results saved to {output_dir}/overfitting_results_corrected.csv\")\n",
"### 3. Out-of-Sample R²\n",
"- **Pattern**: Should start reasonable, then deteriorate with high complexity\n",
"- **Interpretation**: Classic overfitting - performance degrades on unseen data\n",
"- **Expected behavior**: Clear deterioration at high feature counts (500+)\n",
"\n",
"print(\"\\n🎉 CORRECTED overfitting analysis complete!\")\n",
"print(\"Data generation follows class example with:\")\n",
"print(\"- W ~ Uniform(0,1), sorted, n=1000\")\n",
"print(\"- e ~ Normal(0,1)\")\n",
"print(\"- y = exp(4*W) + e (class example)\")\n",
"print(\"- With intercept for proper estimation\")\n",
"print(\"- Seed = 42 for reproducibility\")"
"### Key Intuition\n",
"- **Exponential relationship** (y = exp(4*W) + e) with **no intercept** creates complex pattern\n",
"- **Polynomial features** attempt to approximate the exponential function\n",
"- **Low-order polynomials** capture main trend but miss curvature\n",
"- **High-order polynomials** overfit to noise, especially with no intercept constraint\n",
"- **Out-of-sample evaluation** is crucial for detecting overfitting\n",
"- **Adjusted R²** provides good balance between fit and complexity"
]
}
],
Expand Down
Loading