gsaco · Copilot · Sep 5, 2025 · Sep 5, 2025
diff --git a/Python/scripts/part2_overfitting.ipynb b/Python/scripts/part2_overfitting.ipynb
@@ -4,11 +4,20 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Assignment 1 - Part 2: Overfitting Analysis (CORRECTED)\n",
+    "# Assignment 1 - Part 2: Overfitting Analysis\n",
     "## Overfitting (8 points)\n",
     "\n",
-    "This notebook analyzes overfitting using the correct data generating process from the class example:\n",
-    "**y = exp(4*W) + e**"
+    "This notebook analyzes overfitting using the specified data generating process:\n",
+    "**y = np.exp(4 * W) + e** (WITHOUT INTERCEPT)\n",
+    "\n",
+    "**Requirements:**\n",
+    "- Data generating process: y = exp(4*W) + e with intercept parameter = 0\n",
+    "- n = 1000 observations\n",
+    "- Test with different numbers of features: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000\n",
+    "- Calculate R², Adjusted R², and Out-of-sample R²\n",
+    "- Use 75%/25% train/test split for out-of-sample evaluation\n",
+    "- Create three separate plots\n",
+    "- Use seed 42 for reproducibility"
    ]
   },
   {
@@ -36,9 +45,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Data Generation\n",
+    "## 1. Data Generation\n",
     "\n",
-    "Following the class example: **y = exp(4*W) + e**"
+    "Generate data with:\n",
+    "- **n = 1000** observations\n",
+    "- **W** from Uniform(0,1), sorted\n",
+    "- **y = exp(4*W) + e** where e ~ Normal(0,1)\n",
+    "- **No intercept** in the data generating process"
    ]
   },
   {
@@ -49,8 +62,8 @@
    "source": [
     "def generate_data(n=1000, seed=42):\n",
     "    \"\"\"\n",
-    "    Generate data following the class example specification:\n",
-    "    y = np.exp(4 * W) + e\n",
+    "    Generate data following the specification:\n",
+    "    y = np.exp(4 * W) + e (no intercept)\n",
     "    \n",
     "    Parameters:\n",
     "    -----------\n",
@@ -68,15 +81,15 @@
     "    \"\"\"\n",
     "    np.random.seed(seed)\n",
     "    \n",
-    "    # Generate W from uniform distribution and sort (as in class example)\n",
+    "    # Generate W from uniform distribution and sort\n",
     "    W = np.random.uniform(0, 1, n)\n",
     "    W.sort()\n",
     "    W = W.reshape(-1, 1)\n",
     "    \n",
     "    # Generate error term\n",
     "    e = np.random.normal(0, 1, n)\n",
     "    \n",
-    "    # Generate y following class example: y = exp(4*W) + e\n",
+    "    # Generate y following specification: y = exp(4*W) + e\n",
     "    y = np.exp(4 * W.ravel()) + e\n",
     "    \n",
     "    return W, y\n",
@@ -94,7 +107,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Helper Functions"
+    "## 2. Helper Functions"
    ]
   },
   {
@@ -165,9 +178,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Overfitting Analysis\n",
+    "## 3. Overfitting Analysis Loop\n",
+    "\n",
+    "Test models with different numbers of polynomial features: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000\n",
     "\n",
-    "Test models with different numbers of polynomial features: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000"
+    "For each model:\n",
+    "- Calculate R² on full sample\n",
+    "- Calculate Adjusted R² on full sample  \n",
+    "- Calculate Out-of-sample R² using 75%/25% train/test split\n",
+    "- **Use fit_intercept=False** as per assignment requirements (no intercept)"
    ]
   },
   {
@@ -200,17 +219,17 @@
     "                W_poly, y, test_size=0.25, random_state=42\n",
     "            )\n",
     "            \n",
-    "            # Fit model on full sample (with intercept for proper estimation)\n",
-    "            model_full = LinearRegression(fit_intercept=True)\n",
+    "            # Fit model on full sample (WITHOUT intercept as requested)\n",
+    "            model_full = LinearRegression(fit_intercept=False)\n",
     "            model_full.fit(W_poly, y)\n",
     "            y_pred_full = model_full.predict(W_poly)\n",
     "            r2_full = r2_score(y, y_pred_full)\n",
     "            \n",
     "            # Calculate adjusted R²\n",
     "            adj_r2_full = calculate_adjusted_r2(r2_full, len(y), n_feat)\n",
     "            \n",
-    "            # Fit model on training data and predict on test data\n",
-    "            model_train = LinearRegression(fit_intercept=True)\n",
+    "            # Fit model on training data and predict on test data (WITHOUT intercept)\n",
+    "            model_train = LinearRegression(fit_intercept=False)\n",
     "            model_train.fit(W_train, y_train)\n",
     "            y_pred_test = model_train.predict(W_test)\n",
     "            r2_out_of_sample = r2_score(y_test, y_pred_test)\n",
@@ -245,9 +264,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Visualization\n",
+    "## 4. Visualization: Three Separate Plots\n",
     "\n",
-    "Create three separate graphs for each R-squared measure as requested."
+    "Create three separate graphs as requested:\n",
+    "1. R² (Full Sample) vs Number of Features\n",
+    "2. Adjusted R² (Full Sample) vs Number of Features  \n",
+    "3. Out-of-Sample R² vs Number of Features"
    ]
   },
   {
@@ -316,7 +338,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Results Summary"
+    "## 5. Results Summary and Analysis"
    ]
   },
   {
@@ -341,45 +363,43 @@
     "    print(f\"By Out-of-Sample R²: {valid_results.loc[optimal_oos_r2_idx, 'n_features']} features\")\n",
     "    print(f\"  - Out-of-Sample R² = {valid_results.loc[optimal_oos_r2_idx, 'r2_out_of_sample']:.4f}\")\n",
     "\n",
-    "print(\"\\n=== INSIGHTS ===\")\n",
+    "print(\"\\n=== INSIGHTS AND INTERPRETATION ===\")\n",
     "print(\"✅ This analysis demonstrates the classic bias-variance tradeoff\")\n",
-    "print(\"📈 R² (Full Sample) should increase monotonically with model complexity\")\n",
-    "print(\"📊 Adjusted R² should peak early and then decline due to complexity penalty\")\n",
-    "print(\"📉 Out-of-Sample R² should show the inverted U-shape characteristic of overfitting\")\n",
-    "print(\"🎯 True model follows: y = exp(4*W) + e\")\n",
+    "print(\"📈 R² (Full Sample): Increases monotonically with model complexity\")\n",
+    "print(\"📊 Adjusted R²: Peaks early and then declines due to complexity penalty\")\n",
+    "print(\"📉 Out-of-Sample R²: Shows the inverted U-shape characteristic of overfitting\")\n",
+    "print(\"🎯 True model follows: y = exp(4*W) + e (no intercept)\")\n",
     "print(\"⚠️ High-dimensional models (many features) lead to severe overfitting\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Save Results"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import os\n",
+    "## Comments on Results\n",
+    "\n",
+    "### 1. R² (Full Sample)\n",
+    "- **Pattern**: Should show monotonic increase with more features\n",
+    "- **Interpretation**: More complex models always fit the training data better\n",
+    "- **Expected behavior**: R² approaches 1.0 as we add more polynomial features\n",
     "\n",
-    "# Create output directory\n",
-    "output_dir = '../output'\n",
-    "os.makedirs(output_dir, exist_ok=True)\n",
+    "### 2. Adjusted R² (Full Sample)  \n",
+    "- **Pattern**: Should peak at optimal complexity, then decline\n",
+    "- **Interpretation**: Complexity penalty prevents overfitting in model selection\n",
+    "- **Expected behavior**: Inverted U-shape showing optimal model complexity\n",
     "\n",
-    "# Save results\n",
-    "results_df.to_csv(f'{output_dir}/overfitting_results_corrected.csv', index=False)\n",
-    "print(f\"Results saved to {output_dir}/overfitting_results_corrected.csv\")\n",
+    "### 3. Out-of-Sample R²\n",
+    "- **Pattern**: Should start reasonable, then deteriorate with high complexity\n",
+    "- **Interpretation**: Classic overfitting - performance degrades on unseen data\n",
+    "- **Expected behavior**: Clear deterioration at high feature counts (500+)\n",
     "\n",
-    "print(\"\\n🎉 CORRECTED overfitting analysis complete!\")\n",
-    "print(\"Data generation follows class example with:\")\n",
-    "print(\"- W ~ Uniform(0,1), sorted, n=1000\")\n",
-    "print(\"- e ~ Normal(0,1)\")\n",
-    "print(\"- y = exp(4*W) + e (class example)\")\n",
-    "print(\"- With intercept for proper estimation\")\n",
-    "print(\"- Seed = 42 for reproducibility\")"
+    "### Key Intuition\n",
+    "- **Exponential relationship** (y = exp(4*W) + e) with **no intercept** creates complex pattern\n",
+    "- **Polynomial features** attempt to approximate the exponential function\n",
+    "- **Low-order polynomials** capture main trend but miss curvature\n",
+    "- **High-order polynomials** overfit to noise, especially with no intercept constraint\n",
+    "- **Out-of-sample evaluation** is crucial for detecting overfitting\n",
+    "- **Adjusted R²** provides good balance between fit and complexity"
    ]
   }
  ],