From 2be6a985de8b953e910a2b91053c91d50088c76c Mon Sep 17 00:00:00 2001 From: clippert Date: Mon, 5 May 2025 06:37:20 +0200 Subject: [PATCH 01/43] analytical ridge regression removed some redundand material from prior example --- .../analytical_solution_ridge.md | 34 ++++--------------- 1 file changed, 6 insertions(+), 28 deletions(-) diff --git a/book/chapter_calculus/analytical_solution_ridge.md b/book/chapter_calculus/analytical_solution_ridge.md index 351be76..bd0354d 100644 --- a/book/chapter_calculus/analytical_solution_ridge.md +++ b/book/chapter_calculus/analytical_solution_ridge.md @@ -94,7 +94,7 @@ class RidgeRegression: ## Example usage -We will use the Ridge regression implementation to fit a model to the maximum temperature data from the year 1900. The data is available in the `data_train` and `data_test` variables, which contain the training and testing datasets, respectively. We will fit a model based on three tanh basis functions to the data and evaluate its performance using Mean Squared Error (MSE). +We will use the Ridge regression implementation to fit a model to the maximum temperature data from the year 1900. We will fit a model based on three tanh basis functions with the fixed parameters defined before, without optimizing over the basis functions. The model is given by @@ -115,27 +115,14 @@ $$ a_1 = 0.1, \quad a_2 = 0.2, \quad a_3 = 0.3 \quad \text{and} \quad b_1 = -10, \quad b_2 = -50, \quad b_3 = -100.0 $$ -To streamline the implementation, we will collect the hyperparameters for all basis functions $\phi_i$ in a single matrix $\mathbf{W}_\phi$: -$$ - \mathbf{W}_\phi = \begin{pmatrix} - a_1 & a_2 & a_3 \\ - b_1 & b_2 & b_3 - \end{pmatrix} -$$ - -Using this notation, we can express the tanh basis functions as: - -$$ - \boldsymbol{\phi}(x; \mathbf{W}_\phi) = - \begin{pmatrix} - \tanh(\mathbf{W}_\phi[0,i] x + \mathbf{W}_\phi[1,i]) - \end{pmatrix}_{i=1}^3 -$$ +```{code-cell} ipython3 +:tags: [hide-input] -We implement the tanh basis functions in a class called `TanhBasis`. The class has two methods: `XW` and `transform`. The `XW` method computes the product of the input data and the weights, while the `transform` method computes the tanh basis functions. +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt -```{code-cell} ipython3 import numpy as np class TanhBasis: @@ -151,16 +138,7 @@ class TanhBasis: def transform(self, x): """Compute the tanh basis functions.""" return np.tanh(self.XW(x)) -``` - -Let's use the `TanhBasis` class to fit a Ridge regression model to the maximum temperature data from the year 1900. We will use three tanh basis functions with the specified hyperparameters. - -```{code-cell} ipython3 -:tags: [hide-input] -import numpy as np -import pandas as pd -import matplotlib.pyplot as plt YEAR = 1900 def load_weather_data(year = None): """ From 4a278d3e3fe93896c7e6d88494a2260b3d42bfa0 Mon Sep 17 00:00:00 2001 From: clippert Date: Mon, 5 May 2025 06:40:29 +0200 Subject: [PATCH 02/43] analytical ridge regression renamed chapter --- book/_toc.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/_toc.yml b/book/_toc.yml index d45cc1a..b57d963 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -66,7 +66,7 @@ parts: - file: chapter_calculus/minima_first_order_condition title: First Order Condition - file: chapter_calculus/analytical_solution_ridge - title: Ridge Regression + title: Analytical Ridge Regression - file: chapter_calculus/line_search title: Line Search - file: chapter_calculus/hessian From 586b682ea3afdb62323da231327dfa93ecd8150c Mon Sep 17 00:00:00 2001 From: clippert Date: Mon, 5 May 2025 07:12:08 +0200 Subject: [PATCH 03/43] quadratic optimization --- .../analytical_solution_ridge.md | 67 ++++++++++++++++++- 1 file changed, 64 insertions(+), 3 deletions(-) diff --git a/book/chapter_calculus/analytical_solution_ridge.md b/book/chapter_calculus/analytical_solution_ridge.md index bd0354d..b08dc6f 100644 --- a/book/chapter_calculus/analytical_solution_ridge.md +++ b/book/chapter_calculus/analytical_solution_ridge.md @@ -10,9 +10,12 @@ kernelspec: language: python name: python3 --- -# Analytical Solution for Ridge Regression +# Ridge Regression as a Quadratic Optimization Problem + So far, we have optimized ridge regression using the gradient descent algorithm. -However, the first order condition tells us that at the minimum of the objective function, the gradient should vanish. We will use this knowledge to derive an analytical solution to the weights in ridge regression. +However, the first order condition tells us that at the minimum of the objective function, the gradient should vanish. We will use this knowledge to derive an analytical solution to the weights in ridge regression. We will show that Ridge Regression belongs to the set of quadratic Optimization Problems and will show how to solve quadratic optimization problems analytically. + +## Ride Regression The objective function for Ridge regression is given by: @@ -215,4 +218,62 @@ ax = plt.ylabel("Maximum Temperature - degree C") ax = plt.title("Year : %i N : %i" % (YEAR, N_train)) ``` We see that we obtain the nearly identical solution to the version using gradient descent. -However, in this version it would require some additional work to optimize over the basis function parameters. \ No newline at end of file +However, in this version it would require some additional work to optimize over the basis function parameters. + + +## Ridge Regression as a Qquadratic Optimization Problem + +Many problems in machine learning and statistics reduce to minimizing a **quadratic function** of the form + +$$ +f(\mathbf{w}) = \frac{1}{2} \mathbf{w}^\top \mathbf{A} \mathbf{w} - \mathbf{b}^\top \mathbf{w} +$$ + +where $\mathbf{A} \in \mathbb{R}^{d \times d}$ is a **symmetric positive definite** matrix, and $\mathbf{b} \in \mathbb{R}^d$. The minimum of this function can be found analytically by setting the gradient to zero: + +$$ +\nabla f(\mathbf{w}) = \mathbf{A} \mathbf{w} - \mathbf{b} = 0 \quad \Rightarrow \quad \boxed{\mathbf{w} = \mathbf{A}^{-1} \mathbf{b}} +$$ + +--- + +### Ridge Regression as a Special Case + +The Ridge Regression objective can be rewritten in this general form. Starting from: + +$$ +f(\mathbf{w}) = \frac{1}{2} \|\mathbf{y} - \mathbf{Xw}\|^2_2 + \frac{\lambda}{2} \|\mathbf{w}\|^2_2 +$$ + +we expand the squared norm: + +$$ +f(\mathbf{w}) = \frac{1}{2} (\mathbf{y} - \mathbf{Xw})^\top (\mathbf{y} - \mathbf{Xw}) + \frac{\lambda}{2} \mathbf{w}^\top \mathbf{w} +$$ + +$$ += \frac{1}{2} \left[ \mathbf{y}^\top \mathbf{y} - 2 \mathbf{y}^\top \mathbf{Xw} + \mathbf{w}^\top \mathbf{X}^\top \mathbf{X} \mathbf{w} \right] + \frac{\lambda}{2} \mathbf{w}^\top \mathbf{w} +$$ + +Dropping the constant term $\frac{1}{2} \mathbf{y}^\top \mathbf{y}$, the expression becomes: + +$$ +f(\mathbf{w}) = \frac{1}{2} \mathbf{w}^\top (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}) \mathbf{w} - \mathbf{w}^\top \mathbf{X}^\top \mathbf{y} +$$ + +This matches the generalized quadratic form with: + +* $\mathbf{A} = \mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}$ +* $\mathbf{b} = \mathbf{X}^\top \mathbf{y}$ + +Since $\mathbf{A}$ is symmetric and positive definite for $\lambda > 0$, the minimum is achieved at: + +$$ +\boxed{ +\mathbf{w} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y} +} +$$ + +--- + +This perspective makes it clear that Ridge Regression is simply a **quadratic optimization problem with a symmetric positive definite matrix**, and therefore has a unique analytical solution. This also connects to broader optimization theory and prepares us to explore other models — including Bayesian linear regression, kernel methods, and even Newton’s method — through the lens of **solving linear systems**. From fb7579fc6df4ef3d2b5410ceab28bc3bebd4446c Mon Sep 17 00:00:00 2001 From: clippert Date: Mon, 5 May 2025 07:26:04 +0200 Subject: [PATCH 04/43] renamed section of ridge to quadratic optimization --- book/_toc.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/_toc.yml b/book/_toc.yml index b57d963..4a9ee51 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -66,7 +66,7 @@ parts: - file: chapter_calculus/minima_first_order_condition title: First Order Condition - file: chapter_calculus/analytical_solution_ridge - title: Analytical Ridge Regression + title: Quadratic Optimization - file: chapter_calculus/line_search title: Line Search - file: chapter_calculus/hessian From 634b4ed0aa0a81cfa03ce14d6ec83adc9e193a87 Mon Sep 17 00:00:00 2001 From: clippert Date: Mon, 5 May 2025 07:27:22 +0200 Subject: [PATCH 05/43] renamed section of ridge to quadratic optimization --- book/chapter_calculus/analytical_solution_ridge.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/chapter_calculus/analytical_solution_ridge.md b/book/chapter_calculus/analytical_solution_ridge.md index b08dc6f..8970305 100644 --- a/book/chapter_calculus/analytical_solution_ridge.md +++ b/book/chapter_calculus/analytical_solution_ridge.md @@ -221,7 +221,7 @@ We see that we obtain the nearly identical solution to the version using gradien However, in this version it would require some additional work to optimize over the basis function parameters. -## Ridge Regression as a Qquadratic Optimization Problem +## Quadratic Optimization Problems Many problems in machine learning and statistics reduce to minimizing a **quadratic function** of the form From 7cfbd66227349fe84d1a73c6446418f88663c5c8 Mon Sep 17 00:00:00 2001 From: clippert Date: Mon, 5 May 2025 17:58:15 +0200 Subject: [PATCH 06/43] changed toc for remaining content --- book/_toc.yml | 15 +++++++++------ book/appendix/scalar-scalar_chain_rule.md | 11 +++++++++-- 2 files changed, 18 insertions(+), 8 deletions(-) diff --git a/book/_toc.yml b/book/_toc.yml index 4a9ee51..bff41b7 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -13,7 +13,7 @@ parts: chapters: # week 1 - file: chapter_ml_basics/intro - title: Machine Learning Basics + title: Machine Learning Problems sections: - file: chapter_ml_basics/classification title: Classification @@ -73,7 +73,8 @@ parts: title: Hessian # study the properties of matrices # - file: chapter_decompositions/overview_decompositions # chapter_linear_algebra/linear_algebra -# sections: +# title: Matrix Analysis and Decompositions +# sections: # - file: chapter_decompositions/eigenvectors # - file: chapter_decompositions/trace_determinant # - file: chapter_decompositions/orthogonal_matrices @@ -84,7 +85,12 @@ parts: # - file: chapter_decompositions/pseudoinverse # - file: chapter_decompositions/low_rank_approximation # - file: chapter_decompositions/matrix_norms +# - file: chapter_convexity/overview_convexity +# title: Convexity +# sections: +# - file: chapter_convexity/convexity # continue with second order optimization +# title: Second-Order Optimization # - file: chapter_calculus/newtons_method # title: Newton's Method # - file: chapter_taylor/minima_second_order_condition @@ -93,16 +99,13 @@ parts: # - file: chapter_calculus/orthogonal_projections # - file: chapter_taylor/overview_taylor # sections: -# - file: chapter_convexity/overview_convexity -# sections: -# - file: chapter_convexity/convexity # - file: chapter_optimization/overview_optimization # sections: # - file: chapter_optimization/optimization # - file: chapter_optimization/optimization_second_order # - file: chapter_optimization/bfgs -# - file: chapter_optimization/orthogonal_projection # - file: chapter_probability/overview_probability +# title: Probability and Random Variables # sections: # - file: chapter_probability/probability_basics # - file: chapter_probability/random_variables diff --git a/book/appendix/scalar-scalar_chain_rule.md b/book/appendix/scalar-scalar_chain_rule.md index a939512..7dd46f7 100644 --- a/book/appendix/scalar-scalar_chain_rule.md +++ b/book/appendix/scalar-scalar_chain_rule.md @@ -1,5 +1,8 @@ # The Chain Rule for Scalar-Scalar Functions +The **Chain Rule** is a fundamental theorem in calculus that describes how to differentiate composite functions. It states that if you have two functions, $f$ and $g$, and you want to differentiate their composition $f(g(x))$, you can do so by multiplying the derivative of $f$ evaluated at $g(x)$ by the derivative of $g$ evaluated at $x$. +This is particularly useful when dealing with functions that are composed of other functions, as it allows us to break down the differentiation process into manageable parts. + ::: {prf:theorem} scalar-scalar chain rule :label: thm-scalar-scalar-chain-rule-appendix :nonumber: @@ -33,7 +36,9 @@ Set $$ \Delta u = g(x_0+\Delta x)-g(x_0), $$ -so that $\Delta u\to0$ and $\tfrac{\Delta u}{\Delta x}\to g'(x_0)$ by differentiability of $g$. We now write +so that $\Delta u\to0$ and $\tfrac{\Delta u}{\Delta x}\to g'(x_0)$ by differentiability of $g$. + +We now write $$ \frac{f\bigl(g(x_0+\Delta x)\bigr)-f\bigl(g(x_0)\bigr)}{\Delta x} @@ -48,7 +53,9 @@ $$ \frac{f(u_0+\Delta u)-f(u_0)}{\Delta u} = f'(\xi). $$ -As $\Delta x\to0$, we have $\xi\to u_0$, and hence $f'(\xi)\to f'(u_0)$. Therefore +As $\Delta x\to0$, we have $\xi\to u_0$, and hence $f'(\xi)\to f'(u_0)$. + +Therefore $$ h'(x_0) From 7de6a28c5d1796727334bea38d5e4ee3de765841 Mon Sep 17 00:00:00 2001 From: clippert Date: Tue, 6 May 2025 23:03:04 +0200 Subject: [PATCH 07/43] added taylors theorem --- book/_toc.yml | 5 + book/appendix/Clairauts_theorem.md | 77 ++++ book/appendix/squeeze_theorem.md | 1 + book/chapter_calculus/hessian.md | 391 +++++++++++++++----- book/chapter_calculus/taylors_theorem.md | 432 +++++++++++++++++++++++ 5 files changed, 821 insertions(+), 85 deletions(-) create mode 100644 book/appendix/Clairauts_theorem.md create mode 100644 book/chapter_calculus/taylors_theorem.md diff --git a/book/_toc.yml b/book/_toc.yml index bff41b7..6148730 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -71,6 +71,9 @@ parts: title: Line Search - file: chapter_calculus/hessian title: Hessian + - file: chapter_calculus/taylors_theorem +# - file: chapter_calculus/irls +# title: Iteratively Re-Weighted Least Squares # study the properties of matrices # - file: chapter_decompositions/overview_decompositions # chapter_linear_algebra/linear_algebra # title: Matrix Analysis and Decompositions @@ -152,6 +155,8 @@ parts: title: First Fundamental Theorem of Calculus - file: appendix/second_fundamental_theorem_calculus title: Second Fundamental Theorem of Calculus + - file: appendix/Clairauts_theorem + title: Clairaut's Theorem - file: appendix/differentiation_rules title: Differentiation Rules # sections: diff --git a/book/appendix/Clairauts_theorem.md b/book/appendix/Clairauts_theorem.md new file mode 100644 index 0000000..16f3872 --- /dev/null +++ b/book/appendix/Clairauts_theorem.md @@ -0,0 +1,77 @@ +# Symmetry of Mixed Partial Derivatives (Clairaut’s Theorem) + +:::{prf:theorem} Clairaut Schwarz +:label: thm-Clairaut-appendix +:nonumber: + +Let $f: \mathbb{R}^2 \to \mathbb{R}$ be a function such that both mixed partial derivatives $\frac{\partial^2 f}{\partial x \partial y}$ and $\frac{\partial^2 f}{\partial y \partial x}$ exist and are **continuous** on an open set containing a point $(x_0, y_0)$ + +Then: + +$$ +\boxed{ +\frac{\partial^2 f}{\partial x \partial y}(x_0, y_0) = \frac{\partial^2 f}{\partial y \partial x}(x_0, y_0) +} +$$ + +That is, **the order of differentiation can be interchanged**. +::: + +## Intuition + +If a function is smooth enough (specifically, if the second-order partial derivatives exist and are continuous), then the "curvature" in the $x$ direction after differentiating in the $y$ direction is the same as the curvature in the $y$ direction after differentiating in the $x$ direction. + +--- + +## Proof Sketch + +We will sketch a proof using the **mean value theorem** and the definition of partial derivatives. Let’s assume that $f$ has continuous second partial derivatives in an open rectangle around the point $(x_0, y_0)$. + +Define: + +$$ +F(h,k) = \frac{f(x_0 + h, y_0 + k) - f(x_0 + h, y_0) - f(x_0, y_0 + k) + f(x_0, y_0)}{hk} +$$ + +Then, as $h, k \to 0$, $F(h,k) \to \frac{\partial^2 f}{\partial y \partial x}(x_0, y_0)$ and also $F(h,k) \to \frac{\partial^2 f}{\partial x \partial y}(x_0, y_0)$, provided the second partial derivatives are continuous. + +### Step-by-step: + +1. By the **Mean Value Theorem**, the numerator of $F(h,k)$ can be interpreted as a finite difference approximation to a mixed partial derivative. +2. Using Taylor’s Theorem with remainder, or via integral representations of derivatives, one can show that: + + $$ + \lim_{(h,k) \to (0,0)} F(h,k) = \frac{\partial^2 f}{\partial x \partial y}(x_0, y_0) + $$ + + and also + + $$ + \lim_{(h,k) \to (0,0)} F(h,k) = \frac{\partial^2 f}{\partial y \partial x}(x_0, y_0) + $$ + + due to continuity of the second derivatives. +3. Hence, the limits agree and the mixed partials are equal. + +Therefore: + +$$ +\frac{\partial^2 f}{\partial x \partial y}(x_0, y_0) = \frac{\partial^2 f}{\partial y \partial x}(x_0, y_0) +$$ + +--- + +## When Clairaut's Theorem **Does Not Apply** + +If the second-order mixed partial derivatives exist but are **not continuous**, the symmetry may fail. A classic counterexample is: + +$$ +f(x, y) = +\begin{cases} +\frac{xy(x^2 - y^2)}{x^2 + y^2}, & \text{if } (x, y) \neq (0, 0) \\ +0, & \text{if } (x, y) = (0, 0) +\end{cases} +$$ + +This function has both mixed partial derivatives at the origin, but they are not equal because they are not continuous there. + diff --git a/book/appendix/squeeze_theorem.md b/book/appendix/squeeze_theorem.md index aadeb76..aab7f2b 100644 --- a/book/appendix/squeeze_theorem.md +++ b/book/appendix/squeeze_theorem.md @@ -14,6 +14,7 @@ kernelspec: :::{prf:theorem} Squeeze theorem :label: squeeze_theorem-appendix +:nonumber: Let $g(x), h(x), f(x)$ be functions defined near $c$. Suppose that there is an open interval around $c$, except possibly at $c$ itself, such that: diff --git a/book/chapter_calculus/hessian.md b/book/chapter_calculus/hessian.md index 96326c8..5f20568 100644 --- a/book/chapter_calculus/hessian.md +++ b/book/chapter_calculus/hessian.md @@ -12,8 +12,10 @@ kernelspec: --- # The Hessian -The **Hessian** matrix of a scalar-valued function $ f : \mathbb{R}^d \to \mathbb{R} $ is a square matrix -of second-order partial derivatives: +In one variable, the second derivative of a function is a number that tells us about the curvature of the function. +But in many variables, each partial derivative can change in many directions—so we need a **matrix** of second derivatives: + +The **Hessian** matrix of a scalar-valued function $ f : \mathbb{R}^d \to \mathbb{R} $ is a square matrix of second-order partial derivatives: $$\nabla^2 f(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \dots & \frac{\partial^2 f}{\partial x_1 \partial x_d} \\ @@ -22,119 +24,338 @@ $$\nabla^2 f(\mathbf{x}) = \begin{bmatrix} \end{bmatrix}, \quad\text{i.e.,}\quad [\nabla^2 f]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} $$ -If the second partial derivatives are continuous (as they often are in optimization), then by **Clairaut's theorem**, -the order of differentiation can be interchanged: $ \frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i} $, -which implies that the Hessian matrix is symmetric. ---- +:::{prf:theorem} Clairaut Schwarz +:label: thm-Clairaut +:nonumber: -## First-Order Taylor Expansion +Let $f: \mathbb{R}^d \to \mathbb{R}$ be a function such that both mixed partial derivatives $\frac{\partial^2 f}{\partial x_i \partial x_j}$ and $\frac{\partial^2 f}{\partial x_i \partial x_j}$ exist and are **continuous** on an open set containing a point $\mathbf{x}_0$ -Recall, that we can create a locally linear approximation to a function at a point $\mathbf{x}_0 \in \mathbb{R}^d $ using the gradient at $\nabla f(\mathbf{x}_0)$. +Then: -$$ f(\mathbf{x}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top (\mathbf{x} - \mathbf{x}_0) . $$ +$$ +\boxed{ +\frac{\partial^2 f}{\partial x_i \partial x_j}(\mathbf{x}_0) = \frac{\partial^2 f}{\partial y \partial x}(\mathbf{x}_0) +} +$$ -This affine approximation is also known as the **first-order Taylor approximation**. +That is, **the order of differentiation can be interchanged**. +::: -It agrees with the original function in value and gradient at the point $ \mathbf{x}_0 $. -## Second-Order Taylor Expansion +Clairut's Theorem implies that the Hessian matrix is symmetric. We provide a proof sketch in the appendix. -The Hessian appears naturally in the **second-order Taylor approximation** of a function around a point $ \mathbf{x}_0 \in \mathbb{R}^d $. -For a sufficiently smooth function $ f : \mathbb{R}^d \to \mathbb{R} $, we can approximate its values near $ \mathbf{x}_0 $ as: +## **Curvature in One Dimension** -$$ f(\mathbf{x}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top (\mathbf{x} - \mathbf{x}_0) + \frac{1}{2}(\mathbf{x} - \mathbf{x}_0)^\top \nabla^2 f(\mathbf{x}_0)(\mathbf{x} - \mathbf{x}_0). $$ +Recall the second derivative in one dimension: -This is a **local quadratic approximation** to the function. It agrees with the original function in value, gradient, and Hessian at the point $ \mathbf{x}_0 $. +* $f(x) = x^2$: curve is "smiling" ⇒ second derivative is positive ⇒ function is curving upward. +* $f(x) = -x^2$: curve is "frowning" ⇒ second derivative is negative ⇒ function is curving downward. +* Point: second derivative tells us **how the function curves**. -### Interpretation: -- The **gradient** term captures the linear behavior (slope) of the function near $ \mathbf{x}_0 $. -- The **Hessian** term captures the curvature — how the gradient changes in different directions. +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt ---- +x = np.linspace(-2, 2, 400) +f1 = x**2 +f2 = -x**2 +f1_dd = np.full_like(x, 2) # Second derivative of x^2 +f2_dd = np.full_like(x, -2) # Second derivative of -x^2 + +fig, axes = plt.subplots(1, 2, figsize=(10, 4)) + +# Plot for f(x) = x^2 +axes[0].plot(x, f1, label='$f(x) = x^2$') +axes[0].plot(x, f1_dd, '--', label='$f\'\'(x) = 2$') +axes[0].set_title('Positive Curvature') +axes[0].legend() +axes[0].grid(True) + +# Plot for f(x) = -x^2 +axes[1].plot(x, f2, label='$f(x) = -x^2$') +axes[1].plot(x, f2_dd, '--', label='$f\'\'(x) = -2$') +axes[1].set_title('Negative Curvature') +axes[1].legend() +axes[1].grid(True) + +plt.suptitle("Second Derivative as Curvature in 1D", fontsize=14) +plt.tight_layout() +plt.show() +``` -We illustrate both the first-order and the second-order Taylor expansion using the following function. +The Hessian generalizes this intuition to multiple Dimensions. -$$ f(x, y) = \log(1 + x^2 + y^2) $$ +## **Curvature in Two Dimensions** -We compute the first-order and second-order Taylor approximations at the point $ (x_0, y_0) = (0.3, 0.3) $. +Now, let's look at a simple 2D surface like: -The true function and the linear approximation match in value and gradient at the point $ (x_0, y_0)$ but differ elsewhere. Similarly, the quadratic approximation match in value, gradient, and Hessian at this point but differ elsewhere. +* $f(x, y) = x^2 + y^2$: bowl shape +* $f(x, y) = x^2 - y^2$: saddle shape ```{code-cell} ipython3 :tags: [hide-input] -import numpy as np -import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D +from matplotlib import cm -# Define the function and its gradient and Hessian -f = lambda x, y: np.log(1 + x**2 + y**2) -x0, y0 = 0.3, 0.3 - -# Compute value, gradient, and Hessian at (x0, y0) -r2 = x0**2 + y0**2 -f0 = np.log(1 + r2) -grad = np.array([2*x0, 2*y0]) / (1 + r2) -H = (2 / (1 + r2)) * np.eye(2) - (4 * np.outer([x0, y0], [x0, y0])) / (1 + r2)**2 - -# Taylor expansion up to second order -def f_taylor_first_order(x, y): - dx = x - x0 - dy = y - y0 - delta = np.array([dx, dy]) - return f0 + (grad @ delta).sum() - -# Taylor expansion up to second order -def f_taylor_second_order(x, y): - dx = x - x0 - dy = y - y0 - delta = np.array([dx, dy]) - return f0 + (grad @ delta).sum() + 0.5 * (delta @ H @ delta).sum() - -# Create grid for plotting -x_vals = np.linspace(x0-1, x0+1, 100) -y_vals = np.linspace(y0-1, y0+1, 100) -X, Y = np.meshgrid(x_vals, y_vals) -Z_true = f(X,Y) -Z_first = np.zeros(X.shape) -Z_second = np.zeros(X.shape) - -for i in range(X.shape[0]): - for j in range(X.shape[1]): - Z_first[i,j] = f_taylor_first_order(X[i,j],Y[i,j]) - Z_second[i,j] = f_taylor_second_order(X[i,j],Y[i,j]) - -# Plot both Taylor approximations -fig = plt.figure(figsize=(14, 6)) -ax1 = fig.add_subplot(1, 2, 1, projection='3d') -ax2 = fig.add_subplot(1, 2, 2, projection='3d') +x = np.linspace(-2, 2, 100) +y = np.linspace(-2, 2, 100) +X, Y = np.meshgrid(x, y) + +# Bowl: f(x, y) = x^2 + y^2 +Z_bowl = X**2 + Y**2 + +# Saddle: f(x, y) = x^2 - y^2 +Z_saddle = X**2 - Y**2 + +fig = plt.figure(figsize=(12, 5)) -true_surface1 = ax1.plot_surface(X, Y, Z_true, cmap='viridis', alpha=0.6) -approx_surface1 = ax1.plot_surface(X, Y, Z_first, cmap='coolwarm', alpha=0.7) -ax1.scatter(x0, y0, f0, color='red', s=50, label=r'$\mathbf{x}_0$') -ax1.set_title("First-Order Taylor Approximation") +# Bowl surface +ax1 = fig.add_subplot(1, 2, 1, projection='3d') +ax1.plot_surface(X, Y, Z_bowl, cmap=cm.viridis, alpha=0.9) +ax1.set_title("Bowl: $f(x, y) = x^2 + y^2$") ax1.set_xlabel("x") ax1.set_ylabel("y") -ax1.legend() -ax1.set_zlim([-0.5,2]) +ax1.set_zlabel("f(x, y)") +# Add annotations for curvature +ax1.text(0, 0, 0, '∂²f/∂x² = 2\n∂²f/∂y² = 2', fontsize=10) -true_surface2 = ax2.plot_surface(X, Y, Z_true, cmap='viridis', alpha=0.6) -approx_surface2 = ax2.plot_surface(X, Y, Z_second, cmap='coolwarm', alpha=0.7) -ax2.scatter(x0, y0, f0, color='red', s=50, label=r'$\mathbf{x}_0$') -ax2.set_title("Second-Order Taylor Approximation") +# Saddle surface +ax2 = fig.add_subplot(1, 2, 2, projection='3d') +ax2.plot_surface(X, Y, Z_saddle, cmap=cm.coolwarm, alpha=0.9) +ax2.set_title("Saddle: $f(x, y) = x^2 - y^2$") ax2.set_xlabel("x") ax2.set_ylabel("y") -ax2.legend() -ax2.set_zlim([-0.5,2]) +ax2.set_zlabel("f(x, y)") +ax2.text(0, 0, 0, '∂²f/∂x² = 2\n∂²f/∂y² = -2', fontsize=10) + +plt.suptitle("Curvature in 2D: Bowl vs Saddle", fontsize=14) +plt.tight_layout() +plt.show() +``` + +At each point, the function curves more or less in certain directions. The Hessian is a matrix that captures all this curvature information—it tells us how the slope (the gradient) changes in every direction. + +--- + +### **A Simple Example** + + +$$ +f(x, y) = 3x^2 + 2xy + y^2 +$$ + +* $\frac{\partial f}{\partial x} = 6x + 2y$ +* $\frac{\partial f}{\partial y} = 2x + 2y$ +* Hessian: + + $$ + \nabla^2 f = \begin{bmatrix} + 6 & 2 \\ + 2 & 2 + \end{bmatrix} + $$ + +Each entry corresponds to a second derivative—either in the x-direction, y-direction, or mixed for the off-diagonals. + +## Gradient Vector Fields +The **Hessian matrix** describes how the **gradient vector** changes as you move through space. Let's visualize this in a grid with arrows pointing in the direction of the gradient — i.e., where the function increases most steeply. + +```{code-cell} ipython3 +:tags: [hide-input] +x = np.linspace(-3, 3, 30) +y = np.linspace(-3, 3, 30) +X, Y = np.meshgrid(x, y) + +# Gradients +U_bowl = 2 * X +V_bowl = 2 * Y + +U_saddle = 2 * X +V_saddle = -2 * Y + +fig, axes = plt.subplots(1, 2, figsize=(12, 5)) + +# Bowl gradient field +axes[0].quiver(X, Y, U_bowl, V_bowl, color='green') +axes[0].set_title('Gradient Field: $f(x, y) = x^2 + y^2$') +axes[0].set_xlabel('x') +axes[0].set_ylabel('y') +axes[0].axis('equal') +axes[0].grid(True) +axes[0].set_ylim([-2.3,2.3]) +axes[0].set_xlim([-2.3,2.3]) + +# Saddle gradient field +axes[1].quiver(X, Y, U_saddle, V_saddle, color='blue') +axes[1].set_title('Gradient Field: $f(x, y) = x^2 - y^2$') +axes[1].set_xlabel('x') +axes[1].set_ylabel('y') +axes[1].axis('equal') +axes[1].grid(True) +axes[1].set_ylim([-2.3,2.3]) +axes[1].set_xlim([-2.3,2.3]) + + +plt.suptitle("Gradient Vector Fields Show How ∇f Changes", fontsize=14) plt.tight_layout() plt.show() ``` -This visualization shows how the first-order (left) and second-order (right) Taylor expansions approximate the original function locally around the point $ (0.3,0.3) $, but deviates farther away. Both approximations are shown in blue to red and the original function in yellow to green colors. +* The **gradient vector field** shows how gradients vary over space. +* The **Hessian** is the *rate of change of the gradient*—it tells you how steep the slope is getting in every direction. +* The direction and length of arrows = the **gradient vector** at each point. +* The **rate of change** of those arrows = what the **Hessian** captures. + +--- + +## 🔍 How This Works in the Two Examples + +### 🟢 **Bowl: $f(x, y) = x^2 + y^2$** + +* **Gradient**: $\nabla f(x, y) = [2x,\ 2y]$ +* **Hessian**: + + $$ + \nabla^2 f = \begin{bmatrix} + 2 & 0 \\ + 0 & 2 + \end{bmatrix} + $$ + +This means: + +* In the **x-direction**, the gradient increases by 2 units per unit of x. +* In the **y-direction**, the gradient increases by 2 units per unit of y. +* The gradient field shows arrows pointing radially outward—getting longer linearly with distance from the origin. +* This **linear increase** in slope is exactly what the constant entries (2) in the Hessian mean. -## Summary and Outlook -An advantage of a local quadratic approximation is that we can find its minimum analytically. -This idea lies at the heart of **Newton's method**. -The Hessian matrix also allows us also to better understand the properties of stationary points of a function and derive **second-order conditions of minima**. +### 🔵 **Saddle: $f(x, y) = x^2 - y^2$** + +* **Gradient**: $\nabla f(x, y) = [2x,\ -2y]$ +* **Hessian**: + + $$ + \nabla^2 f = \begin{bmatrix} + 2 & 0 \\ + 0 & -2 + \end{bmatrix} + $$ + +This means: + +* In the **x-direction**, the gradient increases at the same rate as before: 2 per unit of x. +* In the **y-direction**, the gradient **decreases** (negative rate): -2 per unit of y. +* The gradient field shows **outward arrows** in the x-direction, but **inward arrows** in the y-direction. +* That flip in sign in the **Hessian entry $\partial^2 f/\partial y^2 = -2$** explains why the gradient pulls you *toward* the origin in y. + +## 🧩 Optional Extension: The Hessian as Jacobian of the Gradient + +We can think of the Hessian as the **Jacobian of the gradient** — it's the matrix of all partial derivatives of the components of the gradient vector field. + +That is: + +$$ +\nabla f(x, y) = +\begin{bmatrix} +\frac{\partial f}{\partial x} \\ +\frac{\partial f}{\partial y} +\end{bmatrix} +\quad\Rightarrow\quad +\nabla^2 f(x, y) = \text{Jacobian}\left( \nabla f(x, y) \right) +$$ + +## Gradient Descent and the Hessian: Why Off-Diagonal Terms Matter + +### 🧠 Key Idea + +Gradient descent minimizes functions by moving in the direction **opposite the gradient**. + +For quadratic functions: + +$$ +f(x) = \frac{1}{2} x^\top A x +\quad \text{with gradient} \quad \nabla f(x) = A x +$$ + +Here, $A$ is the **Hessian matrix**, and it determines the **shape of level sets** and how gradient descent behaves. + +* If $A$ is diagonal → level sets are **axis-aligned ellipses** (or circles). +* If $A$ has off-diagonal elements → ellipses are **rotated**, and gradient descent struggles (zig-zags). + + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +def gradient_descent(A, x0, lr=0.1, steps=30): + traj = [x0] + x = x0 + for _ in range(steps): + grad = A @ x + x = x - lr * grad + traj.append(x) + return np.array(traj) + +def plot_descent(A, title, lr=0.1): + x = np.linspace(-100, 100, 100) + y = np.linspace(-100, 100, 100) + X, Y = np.meshgrid(x, y) + Z = 0.5 * (A[0,0]*X**2 + 2*A[0,1]*X*Y + A[1,1]*Y**2) + + fig, ax = plt.subplots(figsize=(6, 6)) + ax.contour(X, Y, Z, levels=40, cmap='viridis') + + x0 = np.array([80, 90]) + traj = gradient_descent(A, x0, lr=lr, steps=30) + ax.plot(traj[:,0], traj[:,1], 'ro--', label='GD Path') + + ax.set_title(title) + ax.set_xlabel('x') + ax.set_ylabel('y') + ax.set_aspect('equal') + ax.grid(True) + ax.legend() + plt.show() +``` + +### Case 1: Spherical Hessian (Identity Matrix) + +```{code-cell} ipython3 +A_sphere = np.array([[1, 0], [0, 1]]) +plot_descent(A_sphere, "Spherical Hessian: $A = I$") +``` + +* Level sets are circles. +* Gradient descent takes straight, efficient steps toward the minimum. + + +### Case 2: Anisotropic Hessian (Different Curvatures) + +```{code-cell} ipython3 +:tags: [hide-input] +A_aniso = np.array([[15, 0], [0, 1]]) +plot_descent(A_aniso, "Anisotropic Hessian: $A = \\mathrm{diag}(10, 1)$", lr=0.1) +``` + +* Level sets are stretched ellipses. +* Gradient descent zig-zags, especially in the steep direction. + +--- + +### Case 3: Skewed Hessian (Off-Diagonal Elements) + +```{code-cell} ipython3 +:tags: [hide-input] +A_skew = np.array([[10, 6], [6, 8]]) +plot_descent(A_skew, "Skewed Hessian", lr=0.1) +``` +$A = \begin{bmatrix} 10 & 6 \\ 6 & 8 \end{bmatrix}$ + +* Level sets are rotated ellipses. +* Gradient descent strongly zig-zags and converges slowly. +* The skew comes directly from the **off-diagonal elements in the Hessian**. -Before we will explore these two topics further, we first have to better understand the **properties of matrices** such as the Hessian. So let's turn to the topic of **matrix algebra**. +Off-diagonal terms in the Hessian rotate the level curves. Since gradient descent moves perpendicular to level curves, it zig-zags when these are skewed. This is one of the motivations for using **second-order methods** that take the Hessian into account. diff --git a/book/chapter_calculus/taylors_theorem.md b/book/chapter_calculus/taylors_theorem.md new file mode 100644 index 0000000..c31e0cf --- /dev/null +++ b/book/chapter_calculus/taylors_theorem.md @@ -0,0 +1,432 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Taylor’s Theorem + +Polynomials provide a framework for function approximation. It turns out, that many functions can be approximated well by so-called Taylor polynomials and that for a large class of infintely differentiable functions this approximation can be exact. We call this class of functions *analytic*. + +We state and prove the first order Taylor's Theorem with remainder in the multivariable case and state it in the second order, as is typically encountered in machine learning and optimization contexts. + +We’ll first state the **single-variable version**, then the **multivariable** version (more relevant to gradient and Hessian-based methods), and give a **proof** for the single-variable case using the **mean value form** of the remainder. + +--- +:::{prf:theorem} Taylor’s Theorem with Remainder (Single Variable) +:label: thm-taylor-single +:nonumber: + +Let $f : \mathbb{R} \to \mathbb{R}$ be $(n+1)$-times continuously differentiable on an open interval containing $a$ and $x$. + +Then: + +$$ +f(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \dots + \frac{f^{(n)}(a)}{n!}(x-a)^n + R_{n+1}(x) +$$ + +Where the **remainder term** is given by the **Lagrange form**: + +$$ +R_{n+1}(x) = \frac{f^{(n+1)}(\xi)}{(n+1)!}(x-a)^{n+1} +\quad \text{for some } \xi \in (a, x) +$$ +::: + +Let's visualize a function $f : \mathbb{R} \to \mathbb{R}$ along with its **Taylor approximations** of increasing degree $n = 1, 2, \dots, N$ centered at a point $a$. We overlay each approximation on the graph of the true function to show how the Taylor series converges. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt +from math import factorial +import sympy as sp + +# Define the function f symbolically +x = sp.symbols('x') +f_expr = sp.sin(x) # Change this to any (n+1)-times differentiable function +f = sp.lambdify(x, f_expr, modules='numpy') + +# Taylor expansion at point a +a = 0 +N = 5 # Highest degree of Taylor polynomial to visualize +x_vals = np.linspace(-2*np.pi, 2*np.pi, 400) + +# Generate the Taylor polynomial of degree n +def taylor_poly(expr, a, n): + return sum((expr.diff(x, k).subs(x, a) / factorial(k)) * (x - a)**k for k in range(n+1)) + +# Plotting +fig, ax = plt.subplots(figsize=(10, 6)) +plt.plot(x_vals, f(x_vals), label='True function', color='black') + +colors = plt.cm.viridis(np.linspace(0, 1, N)) +for n in range(1, N+1): + taylor_expr = taylor_poly(f_expr, a, n) + taylor_func = sp.lambdify(x, taylor_expr, modules='numpy') + plt.plot(x_vals, taylor_func(x_vals), label=f'Taylor degree {n}', color=colors[n-1]) + +plt.axvline(a, color='gray', linestyle='--', alpha=0.5) +plt.title(f'Taylor Approximations of $f(x) = \sin(x)$ at $x = {a}$') +plt.xlabel('x') +plt.ylabel('f(x)') +plt.legend() +plt.grid(True) +ax.set_ylim([-1.7,1.7]) +ax.set_xlim([-6.1,6.1]) +plt.tight_layout() +plt.show() +``` + + +:::{prf:proof} (Single Variable, Lagrange Form of the Remainder) + +We want to prove that: + +$$ +f(x) = T_n(x) + R_{n+1}(x) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(x-a)^k + \frac{f^{(n+1)}(\xi)}{(n+1)!}(x - a)^{n+1} +\quad \text{for some } \xi \in (a, x) +$$ + +--- + +### Step 1: Define the Taylor Polynomial and Remainder + +Let + +$$ +T_n(t) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(t-a)^k +\quad \text{and} \quad +R_{n+1}(x) = f(x) - T_n(x) +$$ + +We want to find an expression for $R_{n+1}(x)$. + +--- + +### Step 2: Define an Auxiliary Function + +Define a function $\phi(t)$ that measures the difference between $f(t)$ and the Taylor polynomial + an extra term we choose to vanish at $t = x$: + +$$ +\phi(t) = f(t) - T_n(t) - \frac{f^{(n+1)}(x)}{(n+1)!}(t-a)^{n+1} +$$ + +We designed this function such that: + +* $\phi(a) = f(a) - T_n(a) - 0 = 0$ (since $T_n(a) = f(a)$) +* $\phi(x) = f(x) - T_n(x) - \frac{f^{(n+1)}(x)}{(n+1)!}(x - a)^{n+1}$ + +So $\phi(x) = R_{n+1}(x) - \frac{f^{(n+1)}(x)}{(n+1)!}(x - a)^{n+1}$ + +Now, the goal is to compare this to a function that we can analyze using **Cauchy's Mean Value Theorem**. + +--- + +### Step 3: Construct a Function with a Known Root + +Let: + +$$ +h(t) := (t - a)^{n+1} +$$ + +and define: + +$$ +F(t) := \phi(t) \quad \text{and} \quad G(t) := h(t) +$$ + +Both $F(t)$ and $G(t)$ are $C^1$ functions, and they vanish at $t = a$: $F(a) = G(a) = 0$ + +We now apply **Cauchy's Mean Value Theorem** to $F$ and $G$ on the interval $[a, x]$: + +> If $F$ and $G$ are differentiable and $G'(t) \neq 0$ on $(a, x)$, then there exists $\xi \in (a, x)$ such that: +> +> $$\frac{F(x) - F(a)}{G(x) - G(a)} = \frac{F'(\xi)}{G'(\xi)}$$ + +Apply it: + +* $F(x) - F(a) = \phi(x) - 0 = R_{n+1}(x) - \frac{f^{(n+1)}(x)}{(n+1)!}(x-a)^{n+1}$ +* $G(x) - G(a) = (x-a)^{n+1} - 0$ + +So: + +$$ +\frac{R_{n+1}(x) - \frac{f^{(n+1)}(x)}{(n+1)!}(x-a)^{n+1}}{(x-a)^{n+1}} = \frac{\phi'(\xi)}{(n+1)(\xi - a)^n} +$$ + +Compute $\phi'(t)$: + +* $\phi'(t) = f'(t) - T_n'(t) - \frac{f^{(n+1)}(x)}{(n+1)!} \cdot (n+1)(t - a)^n$ + +But recall that $T_n'(t) = \sum_{k=1}^n \frac{f^{(k)}(a)}{(k-1)!}(t - a)^{k-1}$, so $\phi'(t)$ behaves like a difference between $f'(t)$ and the Taylor expansion of $f'$. + +But instead of continuing with $\phi(t)$, there's a **simpler and cleaner proof** using a function designed for **Lagrange’s form**. + +--- + +### Using Cauchy's Mean Value Theorem + +Let’s define: + +$$ +h(t) := f(t) - T_n(t) +\quad \text{and} \quad +g(t) := (t - a)^{n+1} +$$ + +Note: + +* $h(a) = 0$, because $T_n(a) = f(a)$ +* $g(a) = 0$ + +Then apply Cauchy’s Mean Value Theorem to $h$ and $g$ on $[a, x]$: + +There exists $\xi \in (a, x)$ such that: + +$$ +\frac{h(x)}{g(x)} = \frac{h'(\xi)}{g'(\xi)} +$$ + +Let’s compute: + +* $g(x) = (x - a)^{n+1}$, and $g'(\xi) = (n+1)(\xi - a)^n$ +* $h(x) = f(x) - T_n(x) = R_{n+1}(x)$ +* $h'(\xi) = f^{(n+1)}(\xi) \cdot \frac{(\xi - a)^n}{n!}$ (this is a known identity) + +Then: + +$$ +\frac{R_{n+1}(x)}{(x - a)^{n+1}} = \frac{f^{(n+1)}(\xi)}{(n+1)!} +\quad \Rightarrow \quad +R_{n+1}(x) = \frac{f^{(n+1)}(\xi)}{(n+1)!}(x - a)^{n+1} +$$ + +Q.E.D. + +::: +--- + +## Taylor Expansion in Multiple Variables + +Recall, that we can create a locally linear approximation to a function at a point $\mathbf{x}_0 \in \mathbb{R}^d $ using the gradient at $\nabla f(\mathbf{x}_0)$. + +$$ f(\mathbf{x}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top (\mathbf{x} - \mathbf{x}_0) . $$ + +This affine approximation is also known as the **first-order Taylor approximation**. +It agrees with the original function in value and gradient at the point $ \mathbf{x}_0 $. + +If you explicitly include the second-order term evaluated at $\mathbf{x}_0$, then you’re writing the **second-order Taylor expansion**: + +$$ +f(\mathbf{x}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T (\mathbf{x} - \mathbf{x}_0) + \frac{1}{2} (\mathbf{x} - \mathbf{x}_0)^T \nabla^2 f(\mathbf{x}_0) (\mathbf{x} - \mathbf{x}_0) +$$ + +This is a **local quadratic approximation** to the function. It agrees with the original function in value, gradient, and Hessian at the point $ \mathbf{x}_0 $. +The Hessian appears naturally in the **second-order Taylor approximation** of a function around a point $ \mathbf{x}_0 \in \mathbb{R}^d $. + + +- The **gradient** term captures the linear behavior (slope) of the function near $ \mathbf{x}_0 $. +- The **Hessian** term captures the curvature — how the gradient changes in different directions. +- In this case, the remainder (if stated) would involve third derivatives, and the approximation is called **second-order** because you're explicitly using second-order information in the main approximation. + +--- + +We illustrate both the first-order and the second-order Taylor expansion using the following function. + +$$ f(x, y) = \log(1 + x^2 + y^2) $$ + +We compute the first-order and second-order Taylor approximations at the point $ (x_0, y_0) = (0.3, 0.3) $. + +The true function and the linear approximation match in value and gradient at the point $ (x_0, y_0)$ but differ elsewhere. Similarly, the quadratic approximation match in value, gradient, and Hessian at this point but differ elsewhere. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt +from mpl_toolkits.mplot3d import Axes3D + +# Define the function and its gradient and Hessian +f = lambda x, y: np.log(1 + x**2 + y**2) +x0, y0 = 0.3, 0.3 + +# Compute value, gradient, and Hessian at (x0, y0) +r2 = x0**2 + y0**2 +f0 = np.log(1 + r2) +grad = np.array([2*x0, 2*y0]) / (1 + r2) +H = (2 / (1 + r2)) * np.eye(2) - (4 * np.outer([x0, y0], [x0, y0])) / (1 + r2)**2 + +# Taylor expansion up to second order +def f_taylor_first_order(x, y): + dx = x - x0 + dy = y - y0 + delta = np.array([dx, dy]) + return f0 + (grad @ delta).sum() + +# Taylor expansion up to second order +def f_taylor_second_order(x, y): + dx = x - x0 + dy = y - y0 + delta = np.array([dx, dy]) + return f0 + (grad @ delta).sum() + 0.5 * (delta @ H @ delta).sum() + +# Create grid for plotting +x_vals = np.linspace(x0-1, x0+1, 100) +y_vals = np.linspace(y0-1, y0+1, 100) +X, Y = np.meshgrid(x_vals, y_vals) +Z_true = f(X,Y) +Z_first = np.zeros(X.shape) +Z_second = np.zeros(X.shape) + +for i in range(X.shape[0]): + for j in range(X.shape[1]): + Z_first[i,j] = f_taylor_first_order(X[i,j],Y[i,j]) + Z_second[i,j] = f_taylor_second_order(X[i,j],Y[i,j]) + +# Plot both Taylor approximations +fig = plt.figure(figsize=(14, 6)) +ax1 = fig.add_subplot(1, 2, 1, projection='3d') +ax2 = fig.add_subplot(1, 2, 2, projection='3d') + +true_surface1 = ax1.plot_surface(X, Y, Z_true, cmap='viridis', alpha=0.6) +approx_surface1 = ax1.plot_surface(X, Y, Z_first, cmap='coolwarm', alpha=0.7) +ax1.scatter(x0, y0, f0, color='red', s=50, label=r'$\mathbf{x}_0$') +ax1.set_title("First-Order Taylor Approximation") +ax1.set_xlabel("x") +ax1.set_ylabel("y") +ax1.legend() +ax1.set_zlim([-0.5,2]) + +true_surface2 = ax2.plot_surface(X, Y, Z_true, cmap='viridis', alpha=0.6) +approx_surface2 = ax2.plot_surface(X, Y, Z_second, cmap='coolwarm', alpha=0.7) +ax2.scatter(x0, y0, f0, color='red', s=50, label=r'$\mathbf{x}_0$') +ax2.set_title("Second-Order Taylor Approximation") +ax2.set_xlabel("x") +ax2.set_ylabel("y") +ax2.legend() +ax2.set_zlim([-0.5,2]) + +plt.tight_layout() +plt.show() +``` + +This visualization shows how the first-order (left) and second-order (right) Taylor expansions approximate the original function locally around the point $ (0.3,0.3) $, but deviates farther away. Both approximations are shown in blue to red and the original function in yellow to green colors. + + +## Taylor's Theorem in Multiple Variables + +:::{prf:theorem} Taylor’s Theorem in Multiple Variables (Second-Order Remainder) +:label: thm-taylor-multiple-first +:nonumber: + +Let $f: \mathbb{R}^d \to \mathbb{R}$ be a function that is **three times continuously differentiable** on an open set $U \subset \mathbb{R}^d$. Let $\mathbf{x}_0 \in U$, and let $\mathbf{x} \in U$ be such that the **line segment** between $\mathbf{x}_0$ and $\mathbf{x}$ lies entirely in $U$. Then: + +$$ +f(\mathbf{x}) = f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T (\mathbf{x} - \mathbf{x}_0) + \frac{1}{2} (\mathbf{x} - \mathbf{x}_0)^T \nabla^2 f(\boldsymbol{\xi}) (\mathbf{x} - \mathbf{x}_0) +$$ + +for some point $\boldsymbol{\xi}$ on the open segment between $\mathbf{x}_0$ and $\mathbf{x}$. +::: + +This is the **first-order Taylor approximation** with **remainder in integral form or mean value form**. + +We observe: + +* We’re approximating $f(\mathbf{x})$ using only the **first-order derivative**, but the **error** (or remainder) is controlled by the **second-order derivative**, specifically involving the Hessian at some intermediate point $\boldsymbol{\xi}$. +* Therefore, it's a **first-order approximation with a second-order remainder**. + +:::{prf:theorem} Taylor’s Theorem in Multiple Variables (Third-Order Integral Remainder) +:label: thm-taylor-multiple-second +:nonumber: + +Let $f: \mathbb{R}^d \to \mathbb{R}$ be **four times continuously differentiable** on an open set $U \subset \mathbb{R}^d$, and let $\mathbf{x}_0, \mathbf{x} \in U$ such that the line segment between them lies entirely in $U$. Then: + +$$ +f(\mathbf{x}) = f(\mathbf{x}_0) ++ \nabla f(\mathbf{x}_0)^T (\mathbf{x} - \mathbf{x}_0) ++ \frac{1}{2} (\mathbf{x} - \mathbf{x}_0)^T \nabla^2 f(\mathbf{x}_0) (\mathbf{x} - \mathbf{x}_0) ++ \frac{1}{6} \sum_{i,j,k=1}^d \frac{\partial^3 f}{\partial x_i \partial x_j \partial x_k}(\boldsymbol{\xi}) (x_i - x_{0,i})(x_j - x_{0,j})(x_k - x_{0,k}) +$$ + +for some $\boldsymbol{\xi}$ on the segment between $\mathbf{x}_0$ and $\mathbf{x}$. +::: + +**Notes on Higher-Order Remainders** + +* The **third-order term** involves a **third-order tensor** (all third partial derivatives), and the remainder is often written using **multi-index notation** or **tensor contraction**. +* For applications in optimization and machine learning, most practical Taylor approximations stop at **second-order**, because third- and higher-order terms are expensive to compute and rarely needed unless using higher-order optimization methods. + + +While in principle, one can state and prove Taylor's theorem for a remainder of arbitrary order, ee prove only the version of the theorem for the first order Taylor expansion with second-order remainder. + + +:::{prf:proof} Proof Sketch (Mean Value Form of the Remainder) + +Let’s define the path: + +$$ +\gamma(t) = \mathbf{x}_0 + t(\mathbf{x} - \mathbf{x}_0), \quad t \in [0,1] +$$ + +This is a straight-line path from $\mathbf{x}_0$ to $\mathbf{x}$. + +Define the composite function $g(t) = f(\gamma(t))$. Then $g : [0,1] \to \mathbb{R}$ is a function of one variable. + +Using the **chain rule**, we have: + +$$ +g'(t) = \nabla f(\gamma(t))^T (\mathbf{x} - \mathbf{x}_0) +$$ + +and + +$$ +g''(t) = (\mathbf{x} - \mathbf{x}_0)^T \nabla^2 f(\gamma(t)) (\mathbf{x} - \mathbf{x}_0) +$$ + +Now apply the **Taylor expansion of $g(t)$ around $t = 0$** with **Lagrange remainder** (from single-variable calculus): + +$$ +g(1) = g(0) + g'(0) + \frac{1}{2} g''(\tau) \quad \text{for some } \tau \in (0,1) +$$ + +Substitute back: + +* $g(0) = f(\mathbf{x}_0)$ +* $g'(0) = \nabla f(\mathbf{x}_0)^T (\mathbf{x} - \mathbf{x}_0)$ +* $g''(\tau) = (\mathbf{x} - \mathbf{x}_0)^T \nabla^2 f(\gamma(\tau)) (\mathbf{x} - \mathbf{x}_0)$ + +So: + +$$ +f(\mathbf{x}) = f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T (\mathbf{x} - \mathbf{x}_0) + \frac{1}{2} (\mathbf{x} - \mathbf{x}_0)^T \nabla^2 f(\boldsymbol{\xi}) (\mathbf{x} - \mathbf{x}_0) +$$ + +where $\boldsymbol{\xi} = \gamma(\tau)$ lies on the open segment between $\mathbf{x}_0$ and $\mathbf{x}$. + +Q.E.D. +::: + + + + + + +--- + +## 🔍 Summary + +| Expansion Type | Uses | Remainder Involves | +| -------------- | ------------------------------------------- | ---------------------------- | +| First-order | $f, \nabla f$ at $\mathbf{x}_0$ | Second derivatives (Hessian) | +| Second-order | $f, \nabla f, \nabla^2 f$ at $\mathbf{x}_0$ | Third derivatives | + +## Outlook +A nice property of second-order Taylor expansion is that the resulting function is a quadratic and that we know how to analytically solve quadratic optimization problems. This observation is the key idea behind Newton's method. Thus, similar to how linear approximation using the gradient (a.k.a. first-order Taylor expansion) was the basis for first-order optimization, the second order Taylor expansion will be the basis for second-order optimization methods. However, before we delve into second-order optimization, we have to study the properties of matrices such as the Hessian at a deeper level. \ No newline at end of file From adc99030cc253c7a7b6be90b3a717147d70773a3 Mon Sep 17 00:00:00 2001 From: clippert Date: Tue, 6 May 2025 23:05:03 +0200 Subject: [PATCH 08/43] added taylors theorem --- book/chapter_calculus/taylors_theorem.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/chapter_calculus/taylors_theorem.md b/book/chapter_calculus/taylors_theorem.md index c31e0cf..38505bd 100644 --- a/book/chapter_calculus/taylors_theorem.md +++ b/book/chapter_calculus/taylors_theorem.md @@ -429,4 +429,4 @@ Q.E.D. | Second-order | $f, \nabla f, \nabla^2 f$ at $\mathbf{x}_0$ | Third derivatives | ## Outlook -A nice property of second-order Taylor expansion is that the resulting function is a quadratic and that we know how to analytically solve quadratic optimization problems. This observation is the key idea behind Newton's method. Thus, similar to how linear approximation using the gradient (a.k.a. first-order Taylor expansion) was the basis for first-order optimization, the second order Taylor expansion will be the basis for second-order optimization methods. However, before we delve into second-order optimization, we have to study the properties of matrices such as the Hessian at a deeper level. \ No newline at end of file +A nice property of second-order Taylor expansion is that the resulting function is a quadratic and that we know how to analytically solve quadratic optimization problems. This observation is the key idea behind Newton's method. Thus, similarly to how linear approximation using the gradient (a.k.a. first-order Taylor expansion) was the basis for first-order optimization, the second order Taylor expansion will be the basis for second-order optimization methods. However, before we delve into second-order optimization, we have to study the properties of matrices such as the Hessian at a deeper level. \ No newline at end of file From c94175212b5bbb102f57c0bf5b0981d09e61aff6 Mon Sep 17 00:00:00 2001 From: clippert Date: Wed, 7 May 2025 06:07:50 +0200 Subject: [PATCH 09/43] Taylors theorem --- book/chapter_calculus/taylors_theorem.md | 59 +++++++++++++++++++++++- 1 file changed, 58 insertions(+), 1 deletion(-) diff --git a/book/chapter_calculus/taylors_theorem.md b/book/chapter_calculus/taylors_theorem.md index 38505bd..c66b9cb 100644 --- a/book/chapter_calculus/taylors_theorem.md +++ b/book/chapter_calculus/taylors_theorem.md @@ -84,6 +84,35 @@ plt.tight_layout() plt.show() ``` +## Big-O Form of Taylor's Remainder (Single Variable) + +:::{prf:corollary} +:label: thm-taylor-single-BigO +:nonumber: + +Let $f: \mathbb{R} \to \mathbb{R}$ be $(n+1)$-times continuously differentiable in a neighborhood of $a$. + +Then: + +$$ +f(x) = \sum_{k=0}^n \frac{f^{(k)}(a)}{k!}(x - a)^k + \mathcal{O}((x - a)^{n+1}) +\quad \text{as } x \to a +$$ + +::: + +This means: + +> There exists a constant $C$ and a neighborhood around $a$ such that + +$$ +|R_{n+1}(x)| \leq C |x - a|^{n+1} +\quad \text{as } x \to a +$$ + +The notation tells us that **the remainder vanishes at the same rate as $(x - a)^{n+1}$** as $x \to a$. + +Let's prove Taylor's Theorem with the exact expression for the remainder. :::{prf:proof} (Single Variable, Lagrange Form of the Remainder) @@ -358,14 +387,42 @@ $$ for some $\boldsymbol{\xi}$ on the segment between $\mathbf{x}_0$ and $\mathbf{x}$. ::: + **Notes on Higher-Order Remainders** * The **third-order term** involves a **third-order tensor** (all third partial derivatives), and the remainder is often written using **multi-index notation** or **tensor contraction**. * For applications in optimization and machine learning, most practical Taylor approximations stop at **second-order**, because third- and higher-order terms are expensive to compute and rarely needed unless using higher-order optimization methods. -While in principle, one can state and prove Taylor's theorem for a remainder of arbitrary order, ee prove only the version of the theorem for the first order Taylor expansion with second-order remainder. +:::{prf:theorem} Big-O Remainder in Multivariable Case +:label: thm-taylor-multiple-BigO +:nonumber: + +For $f: \mathbb{R}^d \to \mathbb{R}$, we can write: + +$$ +f(\mathbf{x}) = \sum_{|\alpha| \leq n} \frac{D^\alpha f(\mathbf{x}_0)}{\alpha!} (\mathbf{x} - \mathbf{x}_0)^\alpha + \mathcal{O}(\|\mathbf{x} - \mathbf{x}_0\|^{n+1}) +\quad \text{as } \mathbf{x} \to \mathbf{x}_0 +$$ + +Where: + +* $\alpha \in \mathbb{N}^d$ is a multi-index, +* $D^\alpha f$ is the partial derivative corresponding to $\alpha$, +* $(\mathbf{x} - \mathbf{x}_0)^\alpha = \prod_i (x_i - x_{0,i})^{\alpha_i}$, +* And $|\alpha| = \sum_i \alpha_i$. +::: + +* The **exact form** (Lagrange or integral remainder) gives precise values, but is often impractical. +* The **Big-O remainder** focuses on **how the error behaves**, not what it is exactly. +* This is especially useful in: + + * Error estimates + * Convergence analysis + * Algorithm design (e.g. gradient descent, Newton’s method) + +While we can state and prove Taylor's theorem for a remainder of arbitrary order, we prove only the version of the theorem for the first order Taylor expansion with second-order remainder. :::{prf:proof} Proof Sketch (Mean Value Form of the Remainder) From 88396074de4a7ecc5b9012d5a18072b57524d015 Mon Sep 17 00:00:00 2001 From: clippert Date: Wed, 7 May 2025 06:10:20 +0200 Subject: [PATCH 10/43] Taylors theorem --- book/chapter_calculus/taylors_theorem.md | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/book/chapter_calculus/taylors_theorem.md b/book/chapter_calculus/taylors_theorem.md index c66b9cb..8d8743c 100644 --- a/book/chapter_calculus/taylors_theorem.md +++ b/book/chapter_calculus/taylors_theorem.md @@ -104,11 +104,8 @@ $$ This means: > There exists a constant $C$ and a neighborhood around $a$ such that - -$$ -|R_{n+1}(x)| \leq C |x - a|^{n+1} -\quad \text{as } x \to a -$$ +> +> $$ |R_{n+1}(x)| \leq C |x - a|^{n+1} \quad \text{as } x \to a $$ The notation tells us that **the remainder vanishes at the same rate as $(x - a)^{n+1}$** as $x \to a$. From ba51855a62109af927ca54aed3c078b2ac9307c5 Mon Sep 17 00:00:00 2001 From: clippert Date: Wed, 7 May 2025 06:37:57 +0200 Subject: [PATCH 11/43] se ction on analytic functions in taylor --- book/chapter_calculus/taylors_theorem.md | 113 +++++++++++++++++++++++ 1 file changed, 113 insertions(+) diff --git a/book/chapter_calculus/taylors_theorem.md b/book/chapter_calculus/taylors_theorem.md index 8d8743c..702a978 100644 --- a/book/chapter_calculus/taylors_theorem.md +++ b/book/chapter_calculus/taylors_theorem.md @@ -238,8 +238,121 @@ $$ Q.E.D. ::: + +### Analytic Functions + +**Analytic functions** are intimately related to Taylor series and to the **remainder** behavior. + +### 🔍 What Is an Analytic Function? + +> A function $f : \mathbb{R} \to \mathbb{R}$ (or $f : \mathbb{R}^d \to \mathbb{R}$) is called **analytic at a point** $a$ if: +> +> The Taylor series of $f$ at $a$ **converges to** the function in a neighborhood of $a$: +> +> $$f(x) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!}(x - a)^k\quad \text{for all } x \text{ near } a$$ + +That is: + +* Not only does the Taylor series **exist** (i.e., $f$ is infinitely differentiable), +* But it **converges to the true function** (i.e., the remainder $R_n(x) \to 0$ as $n \to \infty$). + +### 🚫 Not All Smooth Functions Are Analytic + +An important subtlety: + +> There exist functions that are **infinitely differentiable** (smooth), but **not analytic**. + +For example, the function + +$$ +f(x) = \begin{cases} +e^{-1/x^2} & \text{if } x \neq 0 \\\\ +0 & \text{if } x = 0 +\end{cases} +$$ + +is **$C^\infty$** everywhere, but its **Taylor series at 0 is identically zero** (all derivatives vanish at 0) — even though the function is not identically zero. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt +from math import factorial + +# Define the smooth but non-analytic function +def f(x): + out = np.zeros_like(x) + nonzero = x != 0 + out[nonzero] = np.exp(-1 / x[nonzero]**2) + return out + +# Compute Taylor polynomials at x=0 (they are all zero) +def taylor_approx(x, n): + return np.zeros_like(x) # All derivatives at 0 are zero + +# Set up x-values +x_vals = np.linspace(-1, 1, 400) +f_vals = f(x_vals) + +# Plot the true function and several Taylor approximations +# Create the plot +fig, ax = plt.subplots(figsize=(10, 6)) +ax.plot(x_vals, f_vals, label='$f(x) = e^{-1/x^2}$ (extended by 0 at 0)', color='black') + +colors = plt.cm.viridis(np.linspace(0, 1, 5)) +for n, c in zip([1, 3, 5, 10, 20], colors): + plt.plot(x_vals, taylor_approx(x_vals, n), linestyle='--', color=c, label=f'Taylor degree {n}') + +plt.axvline(0, color='gray', linestyle='--', alpha=0.5) +plt.title('Smooth but Non-Analytic Function at $x = 0$') +plt.xlabel('x') +plt.ylabel('f(x)') +ax.set_ylim([-0.03, 0.2]) +ax.set_xlim([-0.75, 0.75]) +plt.legend() +plt.grid(True) +plt.tight_layout() +plt.show() +``` + +This function is **infinitely differentiable** (smooth) at $x = 0$, but **not analytic** there: all of its derivatives at 0 vanish, so every Taylor polynomial is the zero function. Yet the function is clearly nonzero for any $x \neq 0$. + +* The true function $f(x)$ (black curve), sharply rising near zero. +* All Taylor polynomials (dashed lines) are identically zero and fail to approximate the function anywhere except exactly at $x = 0$. + +So: +✅ smooth ≠ analytic +✅ analytic ⇒ smooth +❌ smooth ⇒ analytic + + +## 🔄 How This Relates to the Big-O Remainder + +* The **Big‑O bound** tells you that the remainder **goes to zero like $(x - a)^{n+1}$** near $a$, *for fixed $n$*. +* But to be **analytic**, you need: + + $$ + \lim_{n \to \infty} R_n(x) = 0 + \quad \text{for all } x \text{ in a neighborhood of } a + $$ + + i.e., convergence of the full infinite series, not just the rate of vanishing of each finite approximation. + +So, **Big-O bounds are necessary** (they control approximation error), but **not sufficient** for analyticity. You need the entire remainder sequence $R_n(x) \to 0$ for analytic behavior. + --- +## 🧠 Summary Table + +| Property | What It Implies | +| ----------------- | ------------------------------------------- | +| Smooth $C^\infty$ | All derivatives exist and are continuous | +| Analytic | Taylor series converges to function locally | +| Big-O remainder | Controls approximation error for fixed $n$ | +| $R_n(x) \to 0$ | Required for analyticity | + + + ## Taylor Expansion in Multiple Variables Recall, that we can create a locally linear approximation to a function at a point $\mathbf{x}_0 \in \mathbb{R}^d $ using the gradient at $\nabla f(\mathbf{x}_0)$. From ee719475aa676b566a8a90780d2d4495cb3252e3 Mon Sep 17 00:00:00 2001 From: clippert Date: Wed, 7 May 2025 15:29:22 +0200 Subject: [PATCH 12/43] fix typo in Clairuts theorem --- book/chapter_calculus/hessian.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/book/chapter_calculus/hessian.md b/book/chapter_calculus/hessian.md index 5f20568..85ced0c 100644 --- a/book/chapter_calculus/hessian.md +++ b/book/chapter_calculus/hessian.md @@ -29,13 +29,13 @@ $$\nabla^2 f(\mathbf{x}) = \begin{bmatrix} :label: thm-Clairaut :nonumber: -Let $f: \mathbb{R}^d \to \mathbb{R}$ be a function such that both mixed partial derivatives $\frac{\partial^2 f}{\partial x_i \partial x_j}$ and $\frac{\partial^2 f}{\partial x_i \partial x_j}$ exist and are **continuous** on an open set containing a point $\mathbf{x}_0$ +Let $f: \mathbb{R}^d \to \mathbb{R}$ be a function such that both mixed partial derivatives $\frac{\partial^2 f}{\partial x_i \partial x_j}$ and $\frac{\partial^2 f}{\partial x_j \partial x_i}$ exist and are **continuous** on an open set containing a point $\mathbf{x}_0$ Then: $$ \boxed{ -\frac{\partial^2 f}{\partial x_i \partial x_j}(\mathbf{x}_0) = \frac{\partial^2 f}{\partial y \partial x}(\mathbf{x}_0) +\frac{\partial^2 f}{\partial x_i \partial x_j}(\mathbf{x}_0) = \frac{\partial^2 f}{\partial x_j \partial x_i}(\mathbf{x}_0) } $$ From d7208b8d0c50887b2e21d359f41cea5c1cace73f Mon Sep 17 00:00:00 2001 From: clippert Date: Wed, 7 May 2025 17:05:26 +0200 Subject: [PATCH 13/43] increased degree of taylor approximation and moved approximation point a in plot to be more illustrative --- book/chapter_calculus/taylors_theorem.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/book/chapter_calculus/taylors_theorem.md b/book/chapter_calculus/taylors_theorem.md index 702a978..4ae571d 100644 --- a/book/chapter_calculus/taylors_theorem.md +++ b/book/chapter_calculus/taylors_theorem.md @@ -54,9 +54,9 @@ f_expr = sp.sin(x) # Change this to any (n+1)-times differentiable function f = sp.lambdify(x, f_expr, modules='numpy') # Taylor expansion at point a -a = 0 -N = 5 # Highest degree of Taylor polynomial to visualize -x_vals = np.linspace(-2*np.pi, 2*np.pi, 400) +a = 1 +N = 16 # Highest degree of Taylor polynomial to visualize +x_vals = np.linspace(-2*np.pi+a, 2*np.pi+a, 400) # Generate the Taylor polynomial of degree n def taylor_poly(expr, a, n): @@ -64,7 +64,7 @@ def taylor_poly(expr, a, n): # Plotting fig, ax = plt.subplots(figsize=(10, 6)) -plt.plot(x_vals, f(x_vals), label='True function', color='black') +plt.plot(x_vals, f(x_vals), label='True function', color='black', linewidth=5) colors = plt.cm.viridis(np.linspace(0, 1, N)) for n in range(1, N+1): @@ -78,8 +78,8 @@ plt.xlabel('x') plt.ylabel('f(x)') plt.legend() plt.grid(True) -ax.set_ylim([-1.7,1.7]) -ax.set_xlim([-6.1,6.1]) +ax.set_ylim([-2.7,2.7]) +ax.set_xlim([-2*np.pi+a, 2*np.pi+a]) plt.tight_layout() plt.show() ``` From 11ad6e91d157b26160c456b093ae9f73a44f316d Mon Sep 17 00:00:00 2001 From: clippert Date: Wed, 7 May 2025 17:06:50 +0200 Subject: [PATCH 14/43] increased degree of taylor approximation and moved approximation point a in plot to be more illustrative --- book/chapter_calculus/taylors_theorem.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/chapter_calculus/taylors_theorem.md b/book/chapter_calculus/taylors_theorem.md index 4ae571d..aea2f45 100644 --- a/book/chapter_calculus/taylors_theorem.md +++ b/book/chapter_calculus/taylors_theorem.md @@ -73,7 +73,7 @@ for n in range(1, N+1): plt.plot(x_vals, taylor_func(x_vals), label=f'Taylor degree {n}', color=colors[n-1]) plt.axvline(a, color='gray', linestyle='--', alpha=0.5) -plt.title(f'Taylor Approximations of $f(x) = \sin(x)$ at $x = {a}$') +plt.title(r'Taylor Approximations of $f(x) = \sin(x)$ at $x = {a}$') plt.xlabel('x') plt.ylabel('f(x)') plt.legend() From cb1b6fb40753d10dd703e442b8b71eedfc4fe036 Mon Sep 17 00:00:00 2001 From: clippert Date: Thu, 8 May 2025 05:04:25 +0200 Subject: [PATCH 15/43] increased degree of taylor approximation and moved approximation point a in plot to be more illustrative --- book/chapter_calculus/taylors_theorem.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/chapter_calculus/taylors_theorem.md b/book/chapter_calculus/taylors_theorem.md index aea2f45..c767c4a 100644 --- a/book/chapter_calculus/taylors_theorem.md +++ b/book/chapter_calculus/taylors_theorem.md @@ -64,7 +64,7 @@ def taylor_poly(expr, a, n): # Plotting fig, ax = plt.subplots(figsize=(10, 6)) -plt.plot(x_vals, f(x_vals), label='True function', color='black', linewidth=5) +plt.plot(x_vals, f(x_vals), label='True function', color='black', linewidth=6) colors = plt.cm.viridis(np.linspace(0, 1, N)) for n in range(1, N+1): From 5978199c197306577b6884c5580ef961ecdfee9d Mon Sep 17 00:00:00 2001 From: clippert Date: Thu, 8 May 2025 05:15:53 +0200 Subject: [PATCH 16/43] created stubs for next chapter decompositions --- book/_toc.yml | 26 +++++++++---------- .../chapter_decompositions/big_picture.md | 0 .../chapter_decompositions/eigenvectors.md | 0 .../chapter_decompositions/matrix_norms.md | 0 .../orthogonal_matrices.md | 0 .../overview_decompositions.md | 0 .../chapter_decompositions/psd_matrices.md | 0 .../chapter_decompositions/pseudoinverse.md | 0 .../chapter_decompositions/svd.md | 0 .../symmetric_matrices.md | 0 .../trace_determinant.md | 0 11 files changed, 13 insertions(+), 13 deletions(-) rename {drafts => book}/chapter_decompositions/big_picture.md (100%) rename {drafts => book}/chapter_decompositions/eigenvectors.md (100%) rename {drafts => book}/chapter_decompositions/matrix_norms.md (100%) rename {drafts => book}/chapter_decompositions/orthogonal_matrices.md (100%) rename {drafts => book}/chapter_decompositions/overview_decompositions.md (100%) rename {drafts => book}/chapter_decompositions/psd_matrices.md (100%) rename {drafts => book}/chapter_decompositions/pseudoinverse.md (100%) rename {drafts => book}/chapter_decompositions/svd.md (100%) rename {drafts => book}/chapter_decompositions/symmetric_matrices.md (100%) rename {drafts => book}/chapter_decompositions/trace_determinant.md (100%) diff --git a/book/_toc.yml b/book/_toc.yml index 6148730..f61881c 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -75,19 +75,19 @@ parts: # - file: chapter_calculus/irls # title: Iteratively Re-Weighted Least Squares # study the properties of matrices -# - file: chapter_decompositions/overview_decompositions # chapter_linear_algebra/linear_algebra -# title: Matrix Analysis and Decompositions -# sections: -# - file: chapter_decompositions/eigenvectors -# - file: chapter_decompositions/trace_determinant -# - file: chapter_decompositions/orthogonal_matrices -# - file: chapter_decompositions/symmetric_matrices -# - file: chapter_decompositions/psd_matrices -# - file: chapter_decompositions/svd -# - file: chapter_decompositions/big_picture -# - file: chapter_decompositions/pseudoinverse -# - file: chapter_decompositions/low_rank_approximation -# - file: chapter_decompositions/matrix_norms + - file: chapter_decompositions/overview_decompositions + title: Matrix Analysis and Decompositions + sections: + - file: chapter_decompositions/eigenvectors + - file: chapter_decompositions/trace_determinant + - file: chapter_decompositions/orthogonal_matrices + - file: chapter_decompositions/symmetric_matrices + - file: chapter_decompositions/psd_matrices + - file: chapter_decompositions/svd + - file: chapter_decompositions/big_picture + - file: chapter_decompositions/pseudoinverse + - file: chapter_decompositions/low_rank_approximation + - file: chapter_decompositions/matrix_norms # - file: chapter_convexity/overview_convexity # title: Convexity # sections: diff --git a/drafts/chapter_decompositions/big_picture.md b/book/chapter_decompositions/big_picture.md similarity index 100% rename from drafts/chapter_decompositions/big_picture.md rename to book/chapter_decompositions/big_picture.md diff --git a/drafts/chapter_decompositions/eigenvectors.md b/book/chapter_decompositions/eigenvectors.md similarity index 100% rename from drafts/chapter_decompositions/eigenvectors.md rename to book/chapter_decompositions/eigenvectors.md diff --git a/drafts/chapter_decompositions/matrix_norms.md b/book/chapter_decompositions/matrix_norms.md similarity index 100% rename from drafts/chapter_decompositions/matrix_norms.md rename to book/chapter_decompositions/matrix_norms.md diff --git a/drafts/chapter_decompositions/orthogonal_matrices.md b/book/chapter_decompositions/orthogonal_matrices.md similarity index 100% rename from drafts/chapter_decompositions/orthogonal_matrices.md rename to book/chapter_decompositions/orthogonal_matrices.md diff --git a/drafts/chapter_decompositions/overview_decompositions.md b/book/chapter_decompositions/overview_decompositions.md similarity index 100% rename from drafts/chapter_decompositions/overview_decompositions.md rename to book/chapter_decompositions/overview_decompositions.md diff --git a/drafts/chapter_decompositions/psd_matrices.md b/book/chapter_decompositions/psd_matrices.md similarity index 100% rename from drafts/chapter_decompositions/psd_matrices.md rename to book/chapter_decompositions/psd_matrices.md diff --git a/drafts/chapter_decompositions/pseudoinverse.md b/book/chapter_decompositions/pseudoinverse.md similarity index 100% rename from drafts/chapter_decompositions/pseudoinverse.md rename to book/chapter_decompositions/pseudoinverse.md diff --git a/drafts/chapter_decompositions/svd.md b/book/chapter_decompositions/svd.md similarity index 100% rename from drafts/chapter_decompositions/svd.md rename to book/chapter_decompositions/svd.md diff --git a/drafts/chapter_decompositions/symmetric_matrices.md b/book/chapter_decompositions/symmetric_matrices.md similarity index 100% rename from drafts/chapter_decompositions/symmetric_matrices.md rename to book/chapter_decompositions/symmetric_matrices.md diff --git a/drafts/chapter_decompositions/trace_determinant.md b/book/chapter_decompositions/trace_determinant.md similarity index 100% rename from drafts/chapter_decompositions/trace_determinant.md rename to book/chapter_decompositions/trace_determinant.md From 1595f12735fef71565fa711fcbf7b99512a3e5a7 Mon Sep 17 00:00:00 2001 From: clippert Date: Mon, 12 May 2025 09:20:47 +0200 Subject: [PATCH 17/43] monday lecture done. PLU section slightly messy still --- book/_toc.yml | 25 +- book/chapter_decompositions/big_picture.md | 31 +- book/chapter_decompositions/determinant.md | 97 ++ book/chapter_decompositions/eigenvectors.md | 741 +++++------- book/chapter_decompositions/matrix_norms.md | 57 +- book/chapter_decompositions/matrix_rank.md | 115 ++ .../orthogonal_matrices.md | 353 ++---- book/chapter_decompositions/pseudoinverse.md | 46 +- .../chapter_decompositions/row_equivalence.md | 1047 +++++++++++++++++ .../chapter_decompositions/square_matrices.md | 82 ++ book/chapter_decompositions/svd.md | 13 - .../symmetric_matrices.md | 621 +++++++--- book/chapter_decompositions/trace.md | 108 ++ .../trace_determinant.md | 51 - 14 files changed, 2349 insertions(+), 1038 deletions(-) create mode 100644 book/chapter_decompositions/determinant.md create mode 100644 book/chapter_decompositions/matrix_rank.md create mode 100644 book/chapter_decompositions/row_equivalence.md create mode 100644 book/chapter_decompositions/square_matrices.md create mode 100644 book/chapter_decompositions/trace.md delete mode 100644 book/chapter_decompositions/trace_determinant.md diff --git a/book/_toc.yml b/book/_toc.yml index f61881c..5bc2b09 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -76,18 +76,21 @@ parts: # title: Iteratively Re-Weighted Least Squares # study the properties of matrices - file: chapter_decompositions/overview_decompositions - title: Matrix Analysis and Decompositions + title: Matrix Analysis sections: - - file: chapter_decompositions/eigenvectors - - file: chapter_decompositions/trace_determinant - - file: chapter_decompositions/orthogonal_matrices - - file: chapter_decompositions/symmetric_matrices - - file: chapter_decompositions/psd_matrices - - file: chapter_decompositions/svd - - file: chapter_decompositions/big_picture - - file: chapter_decompositions/pseudoinverse - - file: chapter_decompositions/low_rank_approximation - - file: chapter_decompositions/matrix_norms + - file: chapter_decompositions/matrix_rank + - file: chapter_decompositions/determinant + - file: chapter_decompositions/row_equivalence + - file: chapter_decompositions/square_matrices +# - file: chapter_decompositions/eigenvectors +# - file: chapter_decompositions/trace +# - file: chapter_decompositions/orthogonal_matrices +# - file: chapter_decompositions/symmetric_matrices +# - file: chapter_decompositions/psd_matrices +# - file: chapter_decompositions/svd +# - file: chapter_decompositions/big_picture +# - file: chapter_decompositions/pseudoinverse +# - file: chapter_decompositions/matrix_norms # - file: chapter_convexity/overview_convexity # title: Convexity # sections: diff --git a/book/chapter_decompositions/big_picture.md b/book/chapter_decompositions/big_picture.md index 3f01694..07f132d 100644 --- a/book/chapter_decompositions/big_picture.md +++ b/book/chapter_decompositions/big_picture.md @@ -1,12 +1,23 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- ## The fundamental subspaces of a matrix -[]: # -[]: # The fundamental subspaces of a matrix $A$ are the four subspaces associated with the matrix and its transpose. These subspaces are important in linear algebra and numerical analysis, particularly in the context of solving linear systems and eigenvalue problems. -[]: # -[]: # 1. **Column Space (Range) of A**: The column space of a matrix $A$ is the set of all possible linear combinations of its columns. It represents the span of the columns of $A$ and is denoted as $\text{Col}(A)$ or $\text{Range}(A)$. -[]: # -[]: # 2. **Null Space (Kernel) of A**: The null space of a matrix $A$ is the set of all vectors $\mathbf{x}$ such that $A\mathbf{x} = \mathbf{0}$. It represents the solutions to the homogeneous equation associated with $A$ and is denoted as $\text{Null}(A)$ or $\text{Ker}(A)$. -[]: # -[]: # 3. **Row Space of A**: The row space of a matrix $A$ is the set of all possible linear combinations of its rows. It is equivalent to the column space of its transpose, $A^\top$, and is denoted as $\text{Row}(A)$ or $\text{Col}(A^\top)$. -[]: # -[]: # 4. **Left Null Space (Kernel) of A**: The left null space of a matrix $A$ is the set of all vectors $\mathbf{y}$ such that $A^\top\mathbf{y} = \mathbf{0}$. It represents the solutions to the homogeneous equation associated with $A^\top$ and is denoted as $\text{Null}(A^\top)$ or $\text{Ker}(A^\top)$. +The fundamental subspaces of a matrix $A$ are the four subspaces associated with the matrix and its transpose. These subspaces are important in linear algebra and numerical analysis, particularly in the context of solving linear systems and eigenvalue problems. + +1. **Column Space (Range) of A**: The column space of a matrix $A$ is the set of all possible linear combinations of its columns. It represents the span of the columns of $A$ and is denoted as $\text{Col}(A)$ or $\text{Range}(A)$. + +2. **Null Space (Kernel) of A**: The null space of a matrix $A$ is the set of all vectors $\mathbf{x}$ such that $A\mathbf{x} = \mathbf{0}$. It represents the solutions to the homogeneous equation associated with $A$ and is denoted as $\text{Null}(A)$ or $\text{Ker}(A)$. + +3. **Row Space of A**: The row space of a matrix $A$ is the set of all possible linear combinations of its rows. It is equivalent to the column space of its transpose, $A^\top$, and is denoted as $\text{Row}(A)$ or $\text{Col}(A^\top)$. + +4. **Left Null Space (Kernel) of A**: The left null space of a matrix $A$ is the set of all vectors $\mathbf{y}$ such that $A^\top\mathbf{y} = \mathbf{0}$. It represents the solutions to the homogeneous equation associated with $A^\top$ and is denoted as $\text{Null}(A^\top)$ or $\text{Ker}(A^\top)$. diff --git a/book/chapter_decompositions/determinant.md b/book/chapter_decompositions/determinant.md new file mode 100644 index 0000000..6bfee75 --- /dev/null +++ b/book/chapter_decompositions/determinant.md @@ -0,0 +1,97 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Determinant + +The **determinant** of a square matrix can be defined in several +different ways. + +Let's illustrate the determinant geometrically. +The determinant can be considered a factor on the change of volume of a unit square after transformation. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt +from matplotlib.patches import Polygon + +# Define matrices to show area effects +matrices = { + "Area = 1 (Identity)": np.array([[1, 0], [0, 1]]), + "Area > 1": np.array([[2, 0.5], [0.5, 1.5]]), + "Area < 0 (Flip)": np.array([[0, 1], [1, 0]]), + "Rotation (Area = 1)": np.array([[np.cos(np.pi/4), -np.sin(np.pi/4)], + [np.sin(np.pi/4), np.cos(np.pi/4)]]) +} + +# Unit square +square = np.array([[0, 0], + [1, 0], + [1, 1], + [0, 1], + [0, 0]]).T + +fig, axes = plt.subplots(1, 4, figsize=(20, 5)) + +for ax, (title, M) in zip(axes, matrices.items()): + transformed_square = M @ square + area = np.abs(np.linalg.det(M)) + det = np.linalg.det(M) + + # Plot original unit square + ax.plot(square[0], square[1], 'k--', label='Unit square') + ax.fill(square[0], square[1], facecolor='lightgray', alpha=0.4) + + # Plot transformed shape + ax.plot(transformed_square[0], transformed_square[1], 'b-', label='Transformed') + ax.fill(transformed_square[0], transformed_square[1], facecolor='skyblue', alpha=0.6) + + # Add vector arrows for columns of M + origin = np.array([[0, 0]]).T + for i in range(2): + vec = M[:, i] + ax.quiver(*origin, vec[0], vec[1], angles='xy', scale_units='xy', scale=1, color='red') + + ax.set_title(f"{title}\nDet = {det:.2f}, Area = {area:.2f}") + ax.set_xlim(-2, 3) + ax.set_ylim(-2, 3) + ax.set_aspect('equal') + ax.grid(True) + ax.legend() + +plt.suptitle("Geometric Interpretation of the Determinant (Area Scaling and Orientation)", fontsize=16) +plt.tight_layout(rect=[0, 0, 1, 0.93]) +plt.show() +``` + +1. **Identity**: No change — area = 1. +2. **Stretch**: Expands area — determinant > 1. +3. **Flip**: Reflects across the diagonal — determinant < 0. +4. **Rotation**: Rotates without distortion — determinant = 1. + +--- + +The determinant has several important properties: + +(i) $\det(\mathbf{I}) = 1$ + +(ii) $\det(\mathbf{A}^{\!\top\!}) = \det(\mathbf{A})$ + +(iii) $\det(\mathbf{A}\mathbf{B}) = \det(\mathbf{A})\det(\mathbf{B})$ + +(iv) $\det(\mathbf{A}^{-1}) = \det(\mathbf{A})^{-1}$ + +(v) $\det(\alpha\mathbf{A}) = \alpha^n \det(\mathbf{A})$ + +--- + + diff --git a/book/chapter_decompositions/eigenvectors.md b/book/chapter_decompositions/eigenvectors.md index 8486ba8..99e30e6 100644 --- a/book/chapter_decompositions/eigenvectors.md +++ b/book/chapter_decompositions/eigenvectors.md @@ -1,27 +1,113 @@ -## Eigenthings - -For a square matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$, there may +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Eigenvalues and Eigenvectors + +For a *square matrix* $\mathbf{A} \in \mathbb{R}^{n \times n}$, there may be vectors which, when $\mathbf{A}$ is applied to them, are simply -scaled by some constant. We say that a nonzero vector -$\mathbf{x} \in \mathbb{R}^n$ is an **eigenvector** of $\mathbf{A}$ -corresponding to **eigenvalue** $\lambda$ if +scaled by some constant. + +We say that a nonzero vector $\mathbf{x} \in \mathbb{R}^n$ is an **eigenvector** of $\mathbf{A}$ corresponding to **eigenvalue** $\lambda$ if $$\mathbf{A}\mathbf{x} = \lambda\mathbf{x}$$ -The zero vector is excluded -from this definition because -$\mathbf{A}\mathbf{0} = \mathbf{0} = \lambda\mathbf{0}$ for every -$\lambda$. +The zero vector is excluded from this definition because +$\mathbf{A}\mathbf{0} = \mathbf{0} = \lambda\mathbf{0}$ +for every $\lambda$. + +--- + +Let's look at an example and how multiplication with a matrix $\mathbf{A}$ transforms vectors that lie on the unit circle and, in particular, how it changes it's eivenvectors during multiplication. + +$$ +\mathbf{A} = \begin{pmatrix}1.5 & 0.5 \\ 0.1 & 1.2\end{pmatrix} +$$ + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# --- Base matrix --- +A = np.array([[1.5, 0.5], + [0.1, 1.2]]) + +# Compute eigenvalues and eigenvectors +eigvals, eigvecs = np.linalg.eig(A) + +from IPython.display import display, Markdown + +λ1, λ2 = eigvals +display(Markdown(f"The matrix has Eigenvalues λ₁ = {eigvals[0]:.2f}, λ₂ = {eigvals[1]:.2f}.")) + +square = np.array([[0, 1, 1, 0, 0], + [0, 0, 1, 1, 0]]) +transformed_square = A @ square + +# Unit circle for reference +theta = np.linspace(0, 2*np.pi, 100) +circle = np.stack((np.cos(theta), np.sin(theta)), axis=0) + +# Transformed unit circle +circle_transformed = A @ circle + +# Plot settings +fig, ax = plt.subplots(figsize=(6,6)) +ax.plot(circle[0], circle[1], ':', color='gray', label='Unit circle') +ax.plot(circle_transformed[0], circle_transformed[1], color='gray', linestyle='--', label='A ∘ circle') + +# Plot eigenvectors +for i in range(2): + vec = eigvecs[:, i] + ax.quiver(0, 0, vec[0], vec[1], angles='xy', scale_units='xy', scale=1.0, color='blue', width=0.01) + ax.quiver(0, 0, *(eigvals[i] * vec), angles='xy', scale_units='xy', scale=1.0, color='red', width=0.01) + ax.text(*(1.1 * vec), f"v{i+1}", color='blue') + ax.text(*(1.05 * eigvals[i] * vec), f"λ{i+1}·v{i+1}", color='red') + +# Axes +ax.axhline(0, color='gray', lw=1) +ax.axvline(0, color='gray', lw=1) +ax.set_aspect('equal') +ax.set_xlim(-2.1, 2.1) +ax.set_ylim(-2.1, 2.1) +ax.set_title("Eigenvectors are Invariant Directions") +ax.plot(square[0], square[1], 'g:', label='Original square') +ax.plot(transformed_square[0], transformed_square[1], 'g--', label='A ∘ square') + +ax.legend() +plt.grid(True) +plt.show() +``` +The visualization shows: +- The **original unit circle** (dashed black) +- The **transformed unit circle** under $\mathbf{A}$ (solid red) +- The **eigenvectors** in blue and their **scaled images** in green + +Note how the eigenvectors are aligned with the directions that remain unchanged in orientation under transformation — they are only scaled by their respective eigenvalues. + +--- We now give some useful results about how eigenvalues change after various manipulations. - -{prf:proposition} Eigenvalues and eigenvectors +:::{prf:proposition} Eigenvalues and Eigenvectors :label: eigenvalues_eigenvectors_properties +:nonumber: + Let $\mathbf{x}$ be an eigenvector of $\mathbf{A}$ with corresponding -eigenvalue $\lambda$. Then +eigenvalue $\lambda$. + +Then (i) For any $\gamma \in \mathbb{R}$, $\mathbf{x}$ is an eigenvector of $\mathbf{A} + \gamma\mathbf{I}$ with eigenvalue $\lambda + \gamma$. @@ -32,8 +118,88 @@ eigenvalue $\lambda$. Then (iii) $\mathbf{A}^k\mathbf{x} = \lambda^k\mathbf{x}$ for any $k \in \mathbb{Z}$ (where $\mathbf{A}^0 = \mathbf{I}$ by definition). +::: - +Below we illustrate the geometric meaning of Propositions (i)–(iii) using the same original matrix $\mathbf{A}$. + +Each subplot shows: +- The **unit circle** (dashed black) +- The **circle transformed by the original matrix $\mathbf{A}$** (dotted gray) +- The **circle transformed by the modified matrix** (solid red) +- An **eigenvector of $\mathbf{A}$** (blue) +- The **eigenvector after transformation** by the modified matrix (red arrow) + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +def plot_eig_effect(ax, A_original, A_transformed, transformation_label, proposition_label, color_vec='red', color_circle='crimson',): + # Unit circle + theta = np.linspace(0, 2 * np.pi, 200) + circle = np.vstack((np.cos(theta), np.sin(theta))) + circle_A = A_original @ circle + circle_transformed = A_transformed @ circle + + # Eigenvectors and values of A_original + eigvals, eigvecs = np.linalg.eig(A_original) + + # Plot unit and transformed circles + ax.plot(circle[0], circle[1], 'k--', label='Unit Circle') + ax.plot(circle_A[0], circle_A[1], color='gray', linestyle=':', label='A ∘ Circle') + ax.plot(circle_transformed[0], circle_transformed[1], color=color_circle, label=transformation_label+' ∘ Circle') + + for i in range(2): + v = eigvecs[:, i] + v = v / np.linalg.norm(v) + Atrans_v = A_transformed @ v + + # Plot eigenvector and its transformed image + ax.quiver(0, 0, v[0], v[1], angles='xy', scale_units='xy', scale=1, color='blue', label=r'Eigenvector $\mathbf{v}$' if i == 0 else None) + ax.quiver(0, 0, Atrans_v[0], Atrans_v[1], angles='xy', scale_units='xy', scale=1, color=color_vec, label=transformation_label+r' ∘ $\mathbf{v}$' if i == 0 else None) + + # Formatting + ax.set_xlim(-3, 3) + ax.set_ylim(-3, 3) + ax.set_aspect('equal') + ax.axhline(0, color='gray', lw=0.5) + ax.axvline(0, color='gray', lw=0.5) + ax.set_title(proposition_label + transformation_label) + ax.grid(True) + +# --- Base matrix --- +A = np.array([[1.5, 0.5], + [0.1, 1.2]]) + +# --- Matrix variants --- +gamma = 0.5 +A_shifted = A + gamma * np.eye(2) +A_inv = np.linalg.inv(A) +A_sq = A @ A + +# --- Plotting --- +fig, axes = plt.subplots(1, 3, figsize=(18, 6)) + +plot_eig_effect(axes[0], A, A_shifted, "(A + γI)", proposition_label= "i): ") +plot_eig_effect(axes[1], A, A_inv, "A⁻¹", proposition_label= "ii): ") +plot_eig_effect(axes[2], A, A_sq, "A²", proposition_label= "iii): ") + +for ax in axes: + ax.legend() + +plt.suptitle("Eigenvector Transformations for Proposition (i)–(iii)", fontsize=16) +plt.tight_layout(rect=[0, 0, 1, 0.95]) +plt.show() + +``` +We observe that: +- The eigenvector direction is **invariant** (it doesn’t rotate) +- The **scaling changes** depending on the transformation: + - In (i), $\mathbf{A} + \gamma \mathbf{I}$ adds $\gamma$ to the eigenvalue. + - In (ii), $\mathbf{A}^{-1}$ inverts the eigenvalue. + - In (iii), $\mathbf{A}^2$ squares the eigenvalue. + +Note how the red-transformed circles deform differently in each panel, but the eigenvector stays aligned. :::{prf:proof} (i) follows readily: @@ -53,485 +219,148 @@ Then the general case $k \in \mathbb{Z}$ follows by combining the $k \geq 0$ case with (ii). ◻ ::: -## Trace - -The **trace** of a square matrix is the sum of its diagonal entries: - -$$\operatorname{tr}(\mathbf{A}) = \sum_{i=1}^n A_{ii}$$ - -The trace has several nice -algebraic properties: - -(i) $\operatorname{tr}(\mathbf{A}+\mathbf{B}) = \operatorname{tr}(\mathbf{A}) + \operatorname{tr}(\mathbf{B})$ - -(ii) $\operatorname{tr}(\alpha\mathbf{A}) = \alpha\operatorname{tr}(\mathbf{A})$ -(iii) $\operatorname{tr}(\mathbf{A}^{\!\top\!}) = \operatorname{tr}(\mathbf{A})$ +## Eigenvectors can be real-valued or complex. +There's a deep and intuitive geometric distinction between linear maps that have **only real eigenvectors** and those that have **complex eigenvectors**. -(iv) $\operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D}) = \operatorname{tr}(\mathbf{B}\mathbf{C}\mathbf{D}\mathbf{A}) = \operatorname{tr}(\mathbf{C}\mathbf{D}\mathbf{A}\mathbf{B}) = \operatorname{tr}(\mathbf{D}\mathbf{A}\mathbf{B}\mathbf{C})$ +Here’s a breakdown: -The first three properties follow readily from the definition. The last -is known as **invariance under cyclic permutations**. Note that the -matrices cannot be reordered arbitrarily, for example -$\operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D}) \neq \operatorname{tr}(\mathbf{B}\mathbf{A}\mathbf{C}\mathbf{D})$ -in general. Also, there is nothing special about the product of four -matrices -- analogous rules hold for more or fewer matrices. -Interestingly, the trace of a matrix is equal to the sum of its -eigenvalues (repeated according to multiplicity): +### Real Eigenvectors → Maps That Stretch or Reflect Along Fixed Directions -$$\operatorname{tr}(\mathbf{A}) = \sum_i \lambda_i(\mathbf{A})$$ +If a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ has **only real eigenvalues and eigenvectors**, it means: -## Determinant +* There exist **real directions** in space that are preserved (up to scaling). +* The action of the matrix is intuitively: -The **determinant** of a square matrix can be defined in several -different confusing ways, none of which are particularly important for -our purposes; go look at an introductory linear algebra text (or -Wikipedia) if you need a definition. But it's good to know the -properties: + * **Scaling** (positive eigenvalues) + * **Reflection + scaling** (negative eigenvalues) +* You can visualize this as: -(i) $\det(\mathbf{I}) = 1$ - -(ii) $\det(\mathbf{A}^{\!\top\!}) = \det(\mathbf{A})$ - -(iii) $\det(\mathbf{A}\mathbf{B}) = \det(\mathbf{A})\det(\mathbf{B})$ - -(iv) $\det(\mathbf{A}^{-1}) = \det(\mathbf{A})^{-1}$ - -(v) $\det(\alpha\mathbf{A}) = \alpha^n \det(\mathbf{A})$ - -Interestingly, the determinant of a matrix is equal to the product of -its eigenvalues (repeated according to multiplicity): - -$$\det(\mathbf{A}) = \prod_i \lambda_i(\mathbf{A})$$ + * Pulling/stretching space along certain axes + * Possibly flipping directions -## Orthogonal matrices -A matrix $\mathbf{Q} \in \mathbb{R}^{n \times n}$ is said to be -**orthogonal** if its columns are pairwise orthonormal. This definition -implies that +### Complex Eigenvectors → Maps That Rotate or Spiral -$$\mathbf{Q}^{\!\top\!} \mathbf{Q} = \mathbf{Q}\mathbf{Q}^{\!\top\!} = \mathbf{I}$$ +If a matrix has **complex eigenvalues** and **no real eigenvectors**, it **cannot leave any real direction invariant**. This typically corresponds to: -or equivalently, $\mathbf{Q}^{\!\top\!} = \mathbf{Q}^{-1}$. A nice thing -about orthogonal matrices is that they preserve inner products: +* **Rotation** or **spiral** motion +* Sometimes **rotation + scaling** (when complex eigenvalues have modulus $\ne 1$) +* The action in real space: -$$(\mathbf{Q}\mathbf{x})^{\!\top\!}(\mathbf{Q}\mathbf{y}) = \mathbf{x}^{\!\top\!} \mathbf{Q}^{\!\top\!} \mathbf{Q}\mathbf{y} = \mathbf{x}^{\!\top\!} \mathbf{I}\mathbf{y} = \mathbf{x}^{\!\top\!}\mathbf{y}$$ + * **No real eigenvector** + * Points are **rotated** or **rotated and scaled** + * Repeated application creates **circular** or **spiraling trajectories** -A direct result of this fact is that they also preserve 2-norms: +#### Example: Stretching vs. Shearing vs. Rotation +* **Stretching**: scales space differently along the axes. The matrix has only real eigenvalues and eigenvectors. +* **Shearing**: shifts one axis direction while keeping the other fixed. The matrix has only real eigenvalues and eigenvectors. +* **Rotation**: turns everything around the origin. The matrix has only complex eigenvalues and eigenvectors. -$$\|\mathbf{Q}\mathbf{x}\|_2 = \sqrt{(\mathbf{Q}\mathbf{x})^{\!\top\!}(\mathbf{Q}\mathbf{x})} = \sqrt{\mathbf{x}^{\!\top\!}\mathbf{x}} = \|\mathbf{x}\|_2$$ +Each transformation is applied to a **unit square** and a **grid**, so you can clearly see how space is deformed under each linear map. +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt -Therefore multiplication by an orthogonal matrix can be considered as a -transformation that preserves length, but may rotate or reflect the -vector about the origin. +def apply_transform(grid, matrix): + return np.tensordot(matrix, grid, axes=1) -## Symmetric matrices +def draw_transform(ax, matrix, title, color='red'): + # Draw original grid + x = np.linspace(-1, 1, 11) + y = np.linspace(-1, 1, 11) + for xi in x: + ax.plot([xi]*len(y), y, color='lightgray', lw=0.5) + for yi in y: + ax.plot(x, [yi]*len(x), color='lightgray', lw=0.5) -A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ is said to be -**symmetric** if it is equal to its own transpose -($\mathbf{A} = \mathbf{A}^{\!\top\!}$), meaning that $A_{ij} = A_{ji}$ -for all $(i,j)$. This definition seems harmless enough but turns out to -have some strong implications. We summarize the most important of these -as + # Draw transformed grid + for xi in x: + line = np.stack(([xi]*len(y), y)) + transformed = apply_transform(line, matrix) + ax.plot(transformed[0], transformed[1], color=color, lw=1) + for yi in y: + line = np.stack((x, [yi]*len(x))) + transformed = apply_transform(line, matrix) + ax.plot(transformed[0], transformed[1], color=color, lw=1) -*Theorem.* -(Spectral Theorem) If $\mathbf{A} \in \mathbb{R}^{n \times n}$ is -symmetric, then there exists an orthonormal basis for $\mathbb{R}^n$ -consisting of eigenvectors of $\mathbf{A}$. + # Draw unit square before and after + square = np.array([[0, 1, 1, 0, 0], + [0, 0, 1, 1, 0]]) + transformed_square = matrix @ square + ax.plot(square[0], square[1], 'k--', label='Original square') + ax.plot(transformed_square[0], transformed_square[1], 'k-', label='Transformed square') + ax.set_aspect('equal') + ax.set_xlim(-2, 2) + ax.set_ylim(-2, 2) + ax.set_title(title) + ax.legend() -The practical application of this theorem is a particular factorization -of symmetric matrices, referred to as the **eigendecomposition** or -**spectral decomposition**. Denote the orthonormal basis of eigenvectors -$\mathbf{q}_1, \dots, \mathbf{q}_n$ and their eigenvalues -$\lambda_1, \dots, \lambda_n$. Let $\mathbf{Q}$ be an orthogonal matrix -with $\mathbf{q}_1, \dots, \mathbf{q}_n$ as its columns, and -$\mathbf{\Lambda} = \operatorname{diag}(\lambda_1, \dots, \lambda_n)$. Since by -definition $\mathbf{A}\mathbf{q}_i = \lambda_i\mathbf{q}_i$ for every -$i$, the following relationship holds: +# Define transformation matrices +stretch = np.array([[1.5, 0], + [0, 0.5]]) -$$\mathbf{A}\mathbf{Q} = \mathbf{Q}\mathbf{\Lambda}$$ +shear = np.array([[1, 1], + [0, 1]]) -Right-multiplying -by $\mathbf{Q}^{\!\top\!}$, we arrive at the decomposition +theta = np.pi / 4 +rotation = np.array([[np.cos(theta), -np.sin(theta)], + [np.sin(theta), np.cos(theta)]]) -$$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$$ +# Plot all three +fig, axes = plt.subplots(1, 3, figsize=(15, 5)) +draw_transform(axes[0], stretch, "Stretching") +draw_transform(axes[1], shear, "Shearing") +draw_transform(axes[2], rotation, "Rotation") +plt.suptitle("Linear Transformations: Stretch vs Shear vs Rotation", fontsize=14) +plt.tight_layout(rect=[0, 0, 1, 0.95]) +plt.show() +``` -### Rayleigh quotients - -Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix. The -expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ is called a -**quadratic form**. - -There turns out to be an interesting connection between the quadratic -form of a symmetric matrix and its eigenvalues. This connection is -provided by the **Rayleigh quotient** - -$$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}$$ - -The Rayleigh quotient has a couple of important properties which the -reader can (and should!) easily verify from the definition: - -(i) **Scale invariance**: for any vector $\mathbf{x} \neq \mathbf{0}$ - and any scalar $\alpha \neq 0$, - $R_\mathbf{A}(\mathbf{x}) = R_\mathbf{A}(\alpha\mathbf{x})$. +## Relationship between Eienvalues and Determinant +Interestingly, the determinant of a matrix is equal to the product of +its eigenvalues (repeated according to multiplicity): -(ii) If $\mathbf{x}$ is an eigenvector of $\mathbf{A}$ with eigenvalue - $\lambda$, then $R_\mathbf{A}(\mathbf{x}) = \lambda$. +$$\det(\mathbf{A}) = \prod_i \lambda_i(\mathbf{A})$$ -We can further show that the Rayleigh quotient is bounded by the largest -and smallest eigenvalues of $\mathbf{A}$. But first we will show a -useful special case of the final result. +This provides a means to find the eigenvalues by deriving the roots of the characteristic polynomial. -{prf:proposition} Rayleigh quotient bounds -:label: rayleigh_quotient_bounds -For any $\mathbf{x}$ such that $\|\mathbf{x}\|_2 = 1$, +:::{prf:corollary} Characteristic Polynomial +:label: trm-characteristic-polynomial +:nonumber: -$$\lambda_{\min}(\mathbf{A}) \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$$ +The eigenvalues of a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ are the **roots of its characteristic polynomial** defined as: -with equality if and only if $\mathbf{x}$ is a corresponding -eigenvector. +$$ +p(\lambda) = \det(\mathbf{A} - \lambda \mathbf{I}) +$$ +It is a degree-$n$ polynomial in $\lambda$, and its roots are precisely the eigenvalues of $\mathbf{A}$. -:::{prf:proof} -*Proof.* We show only the $\max$ case because the argument for the -$\min$ case is entirely analogous. - -Since $\mathbf{A}$ is symmetric, we can decompose it as -$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$. Then use -the change of variable $\mathbf{y} = \mathbf{Q}^{\!\top\!}\mathbf{x}$, -noting that the relationship between $\mathbf{x}$ and $\mathbf{y}$ is -one-to-one and that $\|\mathbf{y}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Hence - -$$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \max_{\|\mathbf{y}\|_2 = 1} \mathbf{y}^{\!\top\!}\mathbf{\Lambda}\mathbf{y} = \max_{y_1^2+\dots+y_n^2=1} \sum_{i=1}^n \lambda_i y_i^2$$ - -Written this way, it is clear that $\mathbf{y}$ maximizes this -expression exactly if and only if it satisfies -$\sum_{i \in I} y_i^2 = 1$ where -$I = \{i : \lambda_i = \max_{j=1,\dots,n} \lambda_j = \lambda_{\max}(\mathbf{A})\}$ -and $y_j = 0$ for $j \not\in I$. That is, $I$ contains the index or -indices of the largest eigenvalue. In this case, the maximal value of -the expression is - -$$\sum_{i=1}^n \lambda_i y_i^2 = \sum_{i \in I} \lambda_i y_i^2 = \lambda_{\max}(\mathbf{A}) \sum_{i \in I} y_i^2 = \lambda_{\max}(\mathbf{A})$$ - -Then writing $\mathbf{q}_1, \dots, \mathbf{q}_n$ for the columns of -$\mathbf{Q}$, we have - -$$\mathbf{x} = \mathbf{Q}\mathbf{Q}^{\!\top\!}\mathbf{x} = \mathbf{Q}\mathbf{y} = \sum_{i=1}^n y_i\mathbf{q}_i = \sum_{i \in I} y_i\mathbf{q}_i$$ - -where we have used the matrix-vector product identity. - -Recall that $\mathbf{q}_1, \dots, \mathbf{q}_n$ are eigenvectors of -$\mathbf{A}$ and form an orthonormal basis for $\mathbb{R}^n$. Therefore -by construction, the set $\{\mathbf{q}_i : i \in I\}$ forms an -orthonormal basis for the eigenspace of $\lambda_{\max}(\mathbf{A})$. -Hence $\mathbf{x}$, which is a linear combination of these, lies in that -eigenspace and thus is an eigenvector of $\mathbf{A}$ corresponding to -$\lambda_{\max}(\mathbf{A})$. - -We have shown that -$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \lambda_{\max}(\mathbf{A})$, -from which we have the general inequality -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$ -for all unit-length $\mathbf{x}$. ◻ ::: -By the scale invariance of the Rayleigh quotient, we immediately have as -a corollary (since -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = R_{\mathbf{A}}(\mathbf{x})$ -for unit $\mathbf{x}$) - -{prf:theorem} Min-max theorem -*Theorem.* For all $\mathbf{x} \neq \mathbf{0}$, - -$$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ - -with equality if and only if $\mathbf{x}$ is a corresponding -eigenvector. - - -## Positive (semi-)definite matrices - -A symmetric matrix $\mathbf{A}$ is **positive semi-definite** if for all -$\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \geq 0$. Sometimes people -write $\mathbf{A} \succeq 0$ to indicate that $\mathbf{A}$ is positive -semi-definite. - -A symmetric matrix $\mathbf{A}$ is **positive definite** if for all -nonzero $\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} > 0$. Sometimes people write -$\mathbf{A} \succ 0$ to indicate that $\mathbf{A}$ is positive definite. -Note that positive definiteness is a strictly stronger property than -positive semi-definiteness, in the sense that every positive definite -matrix is positive semi-definite but not vice-versa. - -These properties are related to eigenvalues in the following way. - -*Proposition.* -A symmetric matrix is positive semi-definite if and only if all of its -eigenvalues are nonnegative, and positive definite if and only if all of -its eigenvalues are positive. - - -*Proof.* Suppose $A$ is positive semi-definite, and let $\mathbf{x}$ be -an eigenvector of $\mathbf{A}$ with eigenvalue $\lambda$. Then - -$$0 \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}(\lambda\mathbf{x}) = \lambda\mathbf{x}^{\!\top\!}\mathbf{x} = \lambda\|\mathbf{x}\|_2^2$$ - -Since $\mathbf{x} \neq \mathbf{0}$ (by the assumption that it is an -eigenvector), we have $\|\mathbf{x}\|_2^2 > 0$, so we can divide both -sides by $\|\mathbf{x}\|_2^2$ to arrive at $\lambda \geq 0$. If -$\mathbf{A}$ is positive definite, the inequality above holds strictly, -so $\lambda > 0$. This proves one direction. - -To simplify the proof of the other direction, we will use the machinery -of Rayleigh quotients. Suppose that $\mathbf{A}$ is symmetric and all -its eigenvalues are nonnegative. Then for all -$\mathbf{x} \neq \mathbf{0}$, - -$$0 \leq \lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x})$$ - -Since $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ matches -$R_\mathbf{A}(\mathbf{x})$ in sign, we conclude that $\mathbf{A}$ is -positive semi-definite. If the eigenvalues of $\mathbf{A}$ are all -strictly positive, then $0 < \lambda_{\min}(\mathbf{A})$, whence it -follows that $\mathbf{A}$ is positive definite. ◻ - - -As an example of how these matrices arise, consider - -*Proposition.* -Suppose $\mathbf{A} \in \mathbb{R}^{m \times n}$. Then -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. If -$\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, then -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. - - -*Proof.* For any $\mathbf{x} \in \mathbb{R}^n$, - -$$\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = (\mathbf{A}\mathbf{x})^{\!\top\!}(\mathbf{A}\mathbf{x}) = \|\mathbf{A}\mathbf{x}\|_2^2 \geq 0$$ - -so $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. - -Note that $\|\mathbf{A}\mathbf{x}\|_2^2 = 0$ implies -$\|\mathbf{A}\mathbf{x}\|_2 = 0$, which in turn implies -$\mathbf{A}\mathbf{x} = \mathbf{0}$ (recall that this is a property of -norms). If $\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, -$\mathbf{A}\mathbf{x} = \mathbf{0}$ implies $\mathbf{x} = \mathbf{0}$, -so -$\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = 0$ -if and only if $\mathbf{x} = \mathbf{0}$, and thus -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. ◻ - -Positive definite matrices are invertible (since their eigenvalues are -nonzero), whereas positive semi-definite matrices might not be. However, -if you already have a positive semi-definite matrix, it is possible to -perturb its diagonal slightly to produce a positive definite matrix. - -*Proposition.* -If $\mathbf{A}$ is positive semi-definite and $\epsilon > 0$, then -$\mathbf{A} + \epsilon\mathbf{I}$ is positive definite. - -*Proof.* Assuming $\mathbf{A}$ is positive semi-definite and -$\epsilon > 0$, we have for any $\mathbf{x} \neq \mathbf{0}$ that - -$$\mathbf{x}^{\!\top\!}(\mathbf{A}+\epsilon\mathbf{I})\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} + \epsilon\mathbf{x}^{\!\top\!}\mathbf{I}\mathbf{x} = \underbrace{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}_{\geq 0} + \underbrace{\epsilon\|\mathbf{x}\|_2^2}_{> 0} > 0$$ - -as claimed. ◻ - -An obvious but frequently useful consequence of the two propositions we -have just shown is that -$\mathbf{A}^{\!\top\!}\mathbf{A} + \epsilon\mathbf{I}$ is positive -definite (and in particular, invertible) for *any* matrix $\mathbf{A}$ -and any $\epsilon > 0$. - -### The geometry of positive definite quadratic forms - -A useful way to understand quadratic forms is by the geometry of their -level sets. A **level set** or **isocontour** of a function is the set -of all inputs such that the function applied to those inputs yields a -given output. Mathematically, the $c$-isocontour of $f$ is -$\{\mathbf{x} \in \operatorname{dom} f : f(\mathbf{x}) = c\}$. - -Let us consider the special case -$f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ where -$\mathbf{A}$ is a positive definite matrix. Since $\mathbf{A}$ is -positive definite, it has a unique matrix square root -$\mathbf{A}^{\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, -where $\mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$ is the -eigendecomposition of $\mathbf{A}$ and -$\mathbf{\Lambda}^{\frac{1}{2}} = \operatorname{diag}(\sqrt{\lambda_1}, \dots \sqrt{\lambda_n})$. -It is easy to see that this matrix $\mathbf{A}^{\frac{1}{2}}$ is -positive definite (consider its eigenvalues) and satisfies -$\mathbf{A}^{\frac{1}{2}}\mathbf{A}^{\frac{1}{2}} = \mathbf{A}$. Fixing -a value $c \geq 0$, the $c$-isocontour of $f$ is the set of -$\mathbf{x} \in \mathbb{R}^n$ such that - -$$c = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A}^{\frac{1}{2}}\mathbf{A}^{\frac{1}{2}}\mathbf{x} = \|\mathbf{A}^{\frac{1}{2}}\mathbf{x}\|_2^2$$ - -where we have used the symmetry of $\mathbf{A}^{\frac{1}{2}}$. Making -the change of variable -$\mathbf{z} = \mathbf{A}^{\frac{1}{2}}\mathbf{x}$, we have the condition -$\|\mathbf{z}\|_2 = \sqrt{c}$. That is, the values $\mathbf{z}$ lie on a -sphere of radius $\sqrt{c}$. These can be parameterized as -$\mathbf{z} = \sqrt{c}\hat{\mathbf{z}}$ where $\hat{\mathbf{z}}$ has -$\|\hat{\mathbf{z}}\|_2 = 1$. Then since -$\mathbf{A}^{-\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, -we have - -$$\mathbf{x} = \mathbf{A}^{-\frac{1}{2}}\mathbf{z} = \mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\mathbf{Q}^{\!\top\!}\sqrt{c}\hat{\mathbf{z}} = \sqrt{c}\mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\tilde{\mathbf{z}}$$ - -where $\tilde{\mathbf{z}} = \mathbf{Q}^{\!\top\!}\hat{\mathbf{z}}$ also -satisfies $\|\tilde{\mathbf{z}}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Using this parameterization, we see that the solution set -$\{\mathbf{x} \in \mathbb{R}^n : f(\mathbf{x}) = c\}$ is the image of -the unit sphere -$\{\tilde{\mathbf{z}} \in \mathbb{R}^n : \|\tilde{\mathbf{z}}\|_2 = 1\}$ -under the invertible linear map -$\mathbf{x} = \sqrt{c}\mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\tilde{\mathbf{z}}$. - -What we have gained with all these manipulations is a clear algebraic -understanding of the $c$-isocontour of $f$ in terms of a sequence of -linear transformations applied to a well-understood set. We begin with -the unit sphere, then scale every axis $i$ by -$\lambda_i^{-\frac{1}{2}}$, resulting in an axis-aligned ellipsoid. -Observe that the axis lengths of the ellipsoid are proportional to the -inverse square roots of the eigenvalues of $\mathbf{A}$. Hence larger -eigenvalues correspond to shorter axis lengths, and vice-versa. - -Then this axis-aligned ellipsoid undergoes a rigid transformation (i.e. -one that preserves length and angles, such as a rotation/reflection) -given by $\mathbf{Q}$. The result of this transformation is that the -axes of the ellipse are no longer along the coordinate axes in general, -but rather along the directions given by the corresponding eigenvectors. -To see this, consider the unit vector $\mathbf{e}_i \in \mathbb{R}^n$ -that has $[\mathbf{e}_i]_j = \delta_{ij}$. In the pre-transformed space, -this vector points along the axis with length proportional to -$\lambda_i^{-\frac{1}{2}}$. But after applying the rigid transformation -$\mathbf{Q}$, the resulting vector points in the direction of the -corresponding eigenvector $\mathbf{q}_i$, since - -$$\mathbf{Q}\mathbf{e}_i = \sum_{j=1}^n [\mathbf{e}_i]_j\mathbf{q}_j = \mathbf{q}_i$$ - -where we have used the matrix-vector product identity from earlier. - -In summary: the isocontours of -$f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ are -ellipsoids such that the axes point in the directions of the -eigenvectors of $\mathbf{A}$, and the radii of these axes are -proportional to the inverse square roots of the corresponding -eigenvalues. - -## Singular value decomposition - -Singular value decomposition (SVD) is a widely applicable tool in linear -algebra. Its strength stems partially from the fact that *every matrix* -$\mathbf{A} \in \mathbb{R}^{m \times n}$ has an SVD (even non-square -matrices)! The decomposition goes as follows: - -$$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!}$$ - -where -$\mathbf{U} \in \mathbb{R}^{m \times m}$ and -$\mathbf{V} \in \mathbb{R}^{n \times n}$ are orthogonal matrices and -$\mathbf{\Sigma} \in \mathbb{R}^{m \times n}$ is a diagonal matrix with -the **singular values** of $\mathbf{A}$ (denoted $\sigma_i$) on its -diagonal. - -By convention, the singular values are given in non-increasing order, -i.e. - -$$\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_{\min(m,n)} \geq 0$$ - -Only the first $r$ singular values are nonzero, where $r$ is the rank of -$\mathbf{A}$. - -Observe that the SVD factors provide eigendecompositions for -$\mathbf{A}^{\!\top\!}\mathbf{A}$ and $\mathbf{A}\mathbf{A}^{\!\top\!}$: - -$$\begin{aligned} -\mathbf{A}^{\!\top\!}\mathbf{A} &= (\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!})^{\!\top\!}\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!} = \mathbf{V}\mathbf{\Sigma}^{\!\top\!}\mathbf{U}^{\!\top\!}\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!} = \mathbf{V}\mathbf{\Sigma}^{\!\top\!}\mathbf{\Sigma}\mathbf{V}^{\!\top\!} \\ -\mathbf{A}\mathbf{A}^{\!\top\!} &= \mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!}(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!})^{\!\top\!} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!}\mathbf{V}\mathbf{\Sigma}^{\!\top\!}\mathbf{U}^{\!\top\!} = \mathbf{U}\mathbf{\Sigma}\mathbf{\Sigma}^{\!\top\!}\mathbf{U}^{\!\top\!} -\end{aligned}$$ - -It follows immediately that the columns of $\mathbf{V}$ -(the **right-singular vectors** of $\mathbf{A}$) are eigenvectors of -$\mathbf{A}^{\!\top\!}\mathbf{A}$, and the columns of $\mathbf{U}$ (the -**left-singular vectors** of $\mathbf{A}$) are eigenvectors of -$\mathbf{A}\mathbf{A}^{\!\top\!}$. - -The matrices $\mathbf{\Sigma}^{\!\top\!}\mathbf{\Sigma}$ and -$\mathbf{\Sigma}\mathbf{\Sigma}^{\!\top\!}$ are not necessarily the same -size, but both are diagonal with the squared singular values -$\sigma_i^2$ on the diagonal (plus possibly some zeros). Thus the -singular values of $\mathbf{A}$ are the square roots of the eigenvalues -of $\mathbf{A}^{\!\top\!}\mathbf{A}$ (or equivalently, of -$\mathbf{A}\mathbf{A}^{\!\top\!}$)[^5]. - -## Some useful matrix identities - -### Matrix-vector product as linear combination of matrix columns - -*Proposition.* -Let $\mathbf{x} \in \mathbb{R}^n$ be a vector and -$\mathbf{A} \in \mathbb{R}^{m \times n}$ a matrix with columns -$\mathbf{a}_1, \dots, \mathbf{a}_n$. Then - -$$\mathbf{A}\mathbf{x} = \sum_{i=1}^n x_i\mathbf{a}_i$$ - -This identity is extremely useful in understanding linear operators in -terms of their matrices' columns. The proof is very simple (consider -each element of $\mathbf{A}\mathbf{x}$ individually and expand by -definitions) but it is a good exercise to convince yourself. - -### Sum of outer products as matrix-matrix product - -An **outer product** is an expression of the form -$\mathbf{a}\mathbf{b}^{\!\top\!}$, where $\mathbf{a} \in \mathbb{R}^m$ -and $\mathbf{b} \in \mathbb{R}^n$. By inspection it is not hard to see -that such an expression yields an $m \times n$ matrix such that - -$$[\mathbf{a}\mathbf{b}^{\!\top\!}]_{ij} = a_ib_j$$ - -It is not -immediately obvious, but the sum of outer products is actually -equivalent to an appropriate matrix-matrix product! We formalize this -statement as - -*Proposition.* -Let $\mathbf{a}_1, \dots, \mathbf{a}_k \in \mathbb{R}^m$ and -$\mathbf{b}_1, \dots, \mathbf{b}_k \in \mathbb{R}^n$. Then - -$$\sum_{\ell=1}^k \mathbf{a}_\ell\mathbf{b}_\ell^{\!\top\!} = \mathbf{A}\mathbf{B}^{\!\top\!}$$ - -where - -$$\mathbf{A} = \begin{bmatrix}\mathbf{a}_1 & \cdots & \mathbf{a}_k\end{bmatrix}, \hspace{0.5cm} \mathbf{B} = \begin{bmatrix}\mathbf{b}_1 & \cdots & \mathbf{b}_k\end{bmatrix}$$ - -*Proof.* For each $(i,j)$, we have +:::{prf:proof} +By definition, $\lambda$ is an **eigenvalue** of $\mathbf{A}$ if: -$$\left[\sum_{\ell=1}^k \mathbf{a}_\ell\mathbf{b}_\ell^{\!\top\!}\right]_{ij} = \sum_{\ell=1}^k [\mathbf{a}_\ell\mathbf{b}_\ell^{\!\top\!}]_{ij} = \sum_{\ell=1}^k [\mathbf{a}_\ell]_i[\mathbf{b}_\ell]_j = \sum_{\ell=1}^k A_{i\ell}B_{j\ell}$$ +$$ +\exists \, \mathbf{x} \neq \mathbf{0} \text{ such that } \mathbf{A} \mathbf{x} = \lambda \mathbf{x} +$$ -This last expression should be recognized as an inner product between -the $i$th row of $\mathbf{A}$ and the $j$th row of $\mathbf{B}$, or -equivalently the $j$th column of $\mathbf{B}^{\!\top\!}$. Hence by the -definition of matrix multiplication, it is equal to -$[\mathbf{A}\mathbf{B}^{\!\top\!}]_{ij}$. ◻ +Rewriting: -### Quadratic forms +$$ +(\mathbf{A} - \lambda \mathbf{I}) \mathbf{x} = \mathbf{0} +$$ -Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix, and -recall that the expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ -is called a quadratic form of $\mathbf{A}$. It is in some cases helpful -to rewrite the quadratic form in terms of the individual elements that -make up $\mathbf{A}$ and $\mathbf{x}$: +This is a homogeneous linear system. +A **nontrivial solution** exists **if and only if** the matrix $\mathbf{A} - \lambda \mathbf{I}$ is **not invertible**, which means: -$$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \sum_{i=1}^n\sum_{j=1}^n A_{ij}x_ix_j$$ +$$ +\det(\mathbf{A} - \lambda \mathbf{I}) = 0 +$$ -This identity is valid for any square matrix (need not be symmetric), -although quadratic forms are usually only discussed in the context of -symmetric matrices. +Therefore, the **eigenvalues are the roots of the characteristic polynomial** $p(\lambda)$. +::: diff --git a/book/chapter_decompositions/matrix_norms.md b/book/chapter_decompositions/matrix_norms.md index cff10a6..3c28d81 100644 --- a/book/chapter_decompositions/matrix_norms.md +++ b/book/chapter_decompositions/matrix_norms.md @@ -1,44 +1,13 @@ -## Matrix Norms -[]: # -[]: # for metric in metrics: -[]: # clf = FlexibleNearestCentroidClassifier(metric=metric) -[]: # clf.fit(X_train, y_train) -[]: # predictions = clf.predict(X_test) -[]: # print(f"Accuracy with {metric} metric: {accuracy_score(y_test, predictions)}") -[]: # ``` -[]: # -[]: # --- -[]: # -[]: # ### Conclusion -[]: # -[]: # The choice of distance metric can significantly affect the performance of the nearest centroid classifier. By experimenting with different metrics, you can gain insights into how they influence classification boundaries and model performance. -[]: # -[]: # --- -[]: # -[]: # ### Further Reading -[]: # -[]: # - [Understanding Distance Metrics](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html) -[]: # - [Nearest Centroid Classifier in Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestCentroid.html) -[]: # -[]: # - [Distance Metrics in Machine Learning](https://towardsdatascience.com/distance-metrics-in-machine-learning-1f3b2a0c4d7e) -[]: # -[]: # --- -[]: # -[]: # ### References -[]: # -[]: # - Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. -[]: # - Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. -[]: # - Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. -[]: # -[]: # --- -[]: # -[]: # ### License -[]: # -[]: # This notebook is licensed under the [MIT License](https://opensource.org/licenses/MIT). -[]: # -[]: # --- -[]: # -[]: # ### Acknowledgments -[]: # -[]: # - [Scikit-learn](https://scikit-learn.org/stable/) for providing the machine learning library used in this notebook. -[]: # - [Matplotlib](https://matplotlib.org/) for the visualization tools used in this notebook. \ No newline at end of file +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Matrix Norms diff --git a/book/chapter_decompositions/matrix_rank.md b/book/chapter_decompositions/matrix_rank.md new file mode 100644 index 0000000..c7e3f60 --- /dev/null +++ b/book/chapter_decompositions/matrix_rank.md @@ -0,0 +1,115 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +# Rank of a Matrix + +Let $\mathbf{A} \in \mathbb{R}^{m \times n}$ be a real matrix. + +The **rank** of $\mathbf{A}$, denoted $\operatorname{rank}(\mathbf{A})$, is defined as: + +$$ +\operatorname{rank}(\mathbf{A}) = \text{the maximum number of linearly independent rows or columns of } \mathbf{A} +$$ + +Equivalently, it's the **dimension of the image** (or column space) of $\mathbf{A}$: + +$$ +\operatorname{rank}(\mathbf{A}) = \dim(\operatorname{Im}(\mathbf{A})) = \dim(\text{Col}(\mathbf{A})) +$$ + +--- + +### ✅ Interpretations + +* **Column Rank**: The number of linearly independent **columns** +* **Row Rank**: The number of linearly independent **rows** + +> For all matrices, the **row rank equals the column rank**, even if $m \neq n$. This is a deep result in linear algebra. + +--- + +### ✅ Practical View + +To compute $\operatorname{rank}(\mathbf{A})$ in practice: + +* Reduce $\mathbf{A}$ to **row echelon form** (via Gaussian elimination) +* Count the number of **non-zero rows** + +--- + +### 🧠 Summary + +$$ +\boxed{ +\operatorname{rank}(\mathbf{A}) = \text{dimensionality of the space spanned by the columns (or rows) of } \mathbf{A} +} +$$ + + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Matrices of different ranks +A_full_rank = np.array([[3, 1], + [1, 2]]) + +A_rank_1 = np.array([[3, 6], + [1, 2]]) + +A_rank_0 = np.zeros((2, 2)) + +# Unit circle +theta = np.linspace(0, 2*np.pi, 100) +circle = np.stack((np.cos(theta), np.sin(theta))) + +# Unit square +square = np.array([[0, 1, 1, 0, 0], + [0, 0, 1, 1, 0]]) + +def plot_transformation(ax, A, title): + ax.set_title(title) + ax.axhline(0, color='gray', lw=0.5) + ax.axvline(0, color='gray', lw=0.5) + ax.set_xlim(-5, 5) + ax.set_ylim(-5, 5) + ax.set_aspect('equal') + ax.grid(True) + + # Plot transformed circle + ax.plot(circle[0], circle[1], "y:", label='Circle') + + + # Plot transformed circle + circ_trans = A @ circle + ax.plot(circ_trans[0], circ_trans[1], color='darkorange', label='A ∘ Circle') + + # Plot transformed square + sq_trans = A @ square + ax.plot(square[0], square[1], 'g:', label='Square') + ax.plot(sq_trans[0], sq_trans[1], color='green', label='A ∘ Square') + + ax.legend() + +# Plot +fig, axes = plt.subplots(1, 3, figsize=(18, 6)) + +plot_transformation(axes[0], A_full_rank, "Rank 2: Full Rank (ℝ² → ℝ²)") +plot_transformation(axes[1], A_rank_1, "Rank 1: Collapse to Line") +plot_transformation(axes[2], A_rank_0, "Rank 0: Collapse to Origin") + +plt.suptitle("Geometric Effect of Rank: Vectors, Circle, and Square Transformed", fontsize=16) +plt.tight_layout(rect=[0, 0, 1, 0.93]) +plt.show() +``` \ No newline at end of file diff --git a/book/chapter_decompositions/orthogonal_matrices.md b/book/chapter_decompositions/orthogonal_matrices.md index 9835778..34cb013 100644 --- a/book/chapter_decompositions/orthogonal_matrices.md +++ b/book/chapter_decompositions/orthogonal_matrices.md @@ -1,8 +1,22 @@ -## Orthogonal matrices +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Orthogonal matrices A matrix $\mathbf{Q} \in \mathbb{R}^{n \times n}$ is said to be -**orthogonal** if its columns are pairwise orthonormal. This definition -implies that +**orthogonal** if its columns are pairwise orthonormal. + + +This definition implies that $$\mathbf{Q}^{\!\top\!} \mathbf{Q} = \mathbf{Q}\mathbf{Q}^{\!\top\!} = \mathbf{I}$$ @@ -19,300 +33,77 @@ Therefore multiplication by an orthogonal matrix can be considered as a transformation that preserves length, but may rotate or reflect the vector about the origin. -## Symmetric matrices - -A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ is said to be -**symmetric** if it is equal to its own transpose -($\mathbf{A} = \mathbf{A}^{\!\top\!}$), meaning that $A_{ij} = A_{ji}$ -for all $(i,j)$. This definition seems harmless enough but turns out to -have some strong implications. We summarize the most important of these -as - -*Theorem.* -(Spectral Theorem) If $\mathbf{A} \in \mathbb{R}^{n \times n}$ is -symmetric, then there exists an orthonormal basis for $\mathbb{R}^n$ -consisting of eigenvectors of $\mathbf{A}$. - -The practical application of this theorem is a particular factorization -of symmetric matrices, referred to as the **eigendecomposition** or -**spectral decomposition**. Denote the orthonormal basis of eigenvectors -$\mathbf{q}_1, \dots, \mathbf{q}_n$ and their eigenvalues -$\lambda_1, \dots, \lambda_n$. Let $\mathbf{Q}$ be an orthogonal matrix -with $\mathbf{q}_1, \dots, \mathbf{q}_n$ as its columns, and -$\mathbf{\Lambda} = \operatorname{diag}(\lambda_1, \dots, \lambda_n)$. Since by -definition $\mathbf{A}\mathbf{q}_i = \lambda_i\mathbf{q}_i$ for every -$i$, the following relationship holds: - -$$\mathbf{A}\mathbf{Q} = \mathbf{Q}\mathbf{\Lambda}$$ - -Right-multiplying -by $\mathbf{Q}^{\!\top\!}$, we arrive at the decomposition - -$$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$$ - -### Rayleigh quotients - -Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix. The -expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ is called a -**quadratic form**. - -There turns out to be an interesting connection between the quadratic -form of a symmetric matrix and its eigenvalues. This connection is -provided by the **Rayleigh quotient** - -$$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}$$ - -The Rayleigh quotient has a couple of important properties which the -reader can (and should!) easily verify from the definition: - -(i) **Scale invariance**: for any vector $\mathbf{x} \neq \mathbf{0}$ - and any scalar $\alpha \neq 0$, - $R_\mathbf{A}(\mathbf{x}) = R_\mathbf{A}(\alpha\mathbf{x})$. - -(ii) If $\mathbf{x}$ is an eigenvector of $\mathbf{A}$ with eigenvalue - $\lambda$, then $R_\mathbf{A}(\mathbf{x}) = \lambda$. - -We can further show that the Rayleigh quotient is bounded by the largest -and smallest eigenvalues of $\mathbf{A}$. But first we will show a -useful special case of the final result. - -*Proposition.* -For any $\mathbf{x}$ such that $\|\mathbf{x}\|_2 = 1$, - -$$\lambda_{\min}(\mathbf{A}) \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$$ - -with equality if and only if $\mathbf{x}$ is a corresponding -eigenvector. - -*Proof.* We show only the $\max$ case because the argument for the -$\min$ case is entirely analogous. - -Since $\mathbf{A}$ is symmetric, we can decompose it as -$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$. Then use -the change of variable $\mathbf{y} = \mathbf{Q}^{\!\top\!}\mathbf{x}$, -noting that the relationship between $\mathbf{x}$ and $\mathbf{y}$ is -one-to-one and that $\|\mathbf{y}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Hence - -$$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \max_{\|\mathbf{y}\|_2 = 1} \mathbf{y}^{\!\top\!}\mathbf{\Lambda}\mathbf{y} = \max_{y_1^2+\dots+y_n^2=1} \sum_{i=1}^n \lambda_i y_i^2$$ - -Written this way, it is clear that $\mathbf{y}$ maximizes this -expression exactly if and only if it satisfies -$\sum_{i \in I} y_i^2 = 1$ where -$I = \{i : \lambda_i = \max_{j=1,\dots,n} \lambda_j = \lambda_{\max}(\mathbf{A})\}$ -and $y_j = 0$ for $j \not\in I$. That is, $I$ contains the index or -indices of the largest eigenvalue. In this case, the maximal value of -the expression is - -$$\sum_{i=1}^n \lambda_i y_i^2 = \sum_{i \in I} \lambda_i y_i^2 = \lambda_{\max}(\mathbf{A}) \sum_{i \in I} y_i^2 = \lambda_{\max}(\mathbf{A})$$ - -Then writing $\mathbf{q}_1, \dots, \mathbf{q}_n$ for the columns of -$\mathbf{Q}$, we have - -$$\mathbf{x} = \mathbf{Q}\mathbf{Q}^{\!\top\!}\mathbf{x} = \mathbf{Q}\mathbf{y} = \sum_{i=1}^n y_i\mathbf{q}_i = \sum_{i \in I} y_i\mathbf{q}_i$$ - -where we have used the matrix-vector product identity. - -Recall that $\mathbf{q}_1, \dots, \mathbf{q}_n$ are eigenvectors of -$\mathbf{A}$ and form an orthonormal basis for $\mathbb{R}^n$. Therefore -by construction, the set $\{\mathbf{q}_i : i \in I\}$ forms an -orthonormal basis for the eigenspace of $\lambda_{\max}(\mathbf{A})$. -Hence $\mathbf{x}$, which is a linear combination of these, lies in that -eigenspace and thus is an eigenvector of $\mathbf{A}$ corresponding to -$\lambda_{\max}(\mathbf{A})$. - -We have shown that -$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \lambda_{\max}(\mathbf{A})$, -from which we have the general inequality -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$ -for all unit-length $\mathbf{x}$. ◻ - - -By the scale invariance of the Rayleigh quotient, we immediately have as -a corollary (since -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = R_{\mathbf{A}}(\mathbf{x})$ -for unit $\mathbf{x}$) - -*Theorem.* -(Min-max theorem) For all $\mathbf{x} \neq \mathbf{0}$, - -$$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ - -with equality if and only if $\mathbf{x}$ is a corresponding -eigenvector. - - -## Positive (semi-)definite matrices - -A symmetric matrix $\mathbf{A}$ is **positive semi-definite** if for all -$\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \geq 0$. Sometimes people -write $\mathbf{A} \succeq 0$ to indicate that $\mathbf{A}$ is positive -semi-definite. - -A symmetric matrix $\mathbf{A}$ is **positive definite** if for all -nonzero $\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} > 0$. Sometimes people write -$\mathbf{A} \succ 0$ to indicate that $\mathbf{A}$ is positive definite. -Note that positive definiteness is a strictly stronger property than -positive semi-definiteness, in the sense that every positive definite -matrix is positive semi-definite but not vice-versa. - -These properties are related to eigenvalues in the following way. - -*Proposition.* -A symmetric matrix is positive semi-definite if and only if all of its -eigenvalues are nonnegative, and positive definite if and only if all of -its eigenvalues are positive. - - -*Proof.* Suppose $A$ is positive semi-definite, and let $\mathbf{x}$ be -an eigenvector of $\mathbf{A}$ with eigenvalue $\lambda$. Then - -$$0 \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}(\lambda\mathbf{x}) = \lambda\mathbf{x}^{\!\top\!}\mathbf{x} = \lambda\|\mathbf{x}\|_2^2$$ - -Since $\mathbf{x} \neq \mathbf{0}$ (by the assumption that it is an -eigenvector), we have $\|\mathbf{x}\|_2^2 > 0$, so we can divide both -sides by $\|\mathbf{x}\|_2^2$ to arrive at $\lambda \geq 0$. If -$\mathbf{A}$ is positive definite, the inequality above holds strictly, -so $\lambda > 0$. This proves one direction. - -To simplify the proof of the other direction, we will use the machinery -of Rayleigh quotients. Suppose that $\mathbf{A}$ is symmetric and all -its eigenvalues are nonnegative. Then for all -$\mathbf{x} \neq \mathbf{0}$, - -$$0 \leq \lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x})$$ - -Since $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ matches -$R_\mathbf{A}(\mathbf{x})$ in sign, we conclude that $\mathbf{A}$ is -positive semi-definite. If the eigenvalues of $\mathbf{A}$ are all -strictly positive, then $0 < \lambda_{\min}(\mathbf{A})$, whence it -follows that $\mathbf{A}$ is positive definite. ◻ - - -As an example of how these matrices arise, consider - -*Proposition.* -Suppose $\mathbf{A} \in \mathbb{R}^{m \times n}$. Then -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. If -$\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, then -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. - - -*Proof.* For any $\mathbf{x} \in \mathbb{R}^n$, - -$$\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = (\mathbf{A}\mathbf{x})^{\!\top\!}(\mathbf{A}\mathbf{x}) = \|\mathbf{A}\mathbf{x}\|_2^2 \geq 0$$ - -so $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. - -Note that $\|\mathbf{A}\mathbf{x}\|_2^2 = 0$ implies -$\|\mathbf{A}\mathbf{x}\|_2 = 0$, which in turn implies -$\mathbf{A}\mathbf{x} = \mathbf{0}$ (recall that this is a property of -norms). If $\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, -$\mathbf{A}\mathbf{x} = \mathbf{0}$ implies $\mathbf{x} = \mathbf{0}$, -so -$\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = 0$ -if and only if $\mathbf{x} = \mathbf{0}$, and thus -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. ◻ - -Positive definite matrices are invertible (since their eigenvalues are -nonzero), whereas positive semi-definite matrices might not be. However, -if you already have a positive semi-definite matrix, it is possible to -perturb its diagonal slightly to produce a positive definite matrix. +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt -*Proposition.* -If $\mathbf{A}$ is positive semi-definite and $\epsilon > 0$, then -$\mathbf{A} + \epsilon\mathbf{I}$ is positive definite. +# Asymmetrical vector set +vectors = np.array([[1, 0.5, -0.5], + [0, 1, 0.5]]) -*Proof.* Assuming $\mathbf{A}$ is positive semi-definite and -$\epsilon > 0$, we have for any $\mathbf{x} \neq \mathbf{0}$ that +# Orthogonal matrices +theta = np.pi / 4 +Q_rot = np.array([[np.cos(theta), -np.sin(theta)], + [np.sin(theta), np.cos(theta)]]) +Q_reflect = np.array([[1, 0], + [0, -1]]) -$$\mathbf{x}^{\!\top\!}(\mathbf{A}+\epsilon\mathbf{I})\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} + \epsilon\mathbf{x}^{\!\top\!}\mathbf{I}\mathbf{x} = \underbrace{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}_{\geq 0} + \underbrace{\epsilon\|\mathbf{x}\|_2^2}_{> 0} > 0$$ +# Transform vectors +rotated_vectors = Q_rot @ vectors +reflected_vectors = Q_reflect @ vectors -as claimed. ◻ +# Unit square +square = np.array([[0, 1, 1, 0, 0], + [0, 0, 1, 1, 0]]) +square_rotated = Q_rot @ square +square_reflected = Q_reflect @ square -An obvious but frequently useful consequence of the two propositions we -have just shown is that -$\mathbf{A}^{\!\top\!}\mathbf{A} + \epsilon\mathbf{I}$ is positive -definite (and in particular, invertible) for *any* matrix $\mathbf{A}$ -and any $\epsilon > 0$. +# Plotting +fig, axes = plt.subplots(1, 3, figsize=(15, 5)) -### The geometry of positive definite quadratic forms +# Function to plot a frame +def plot_frame(ax, vecs, square, title, color): + ax.quiver(np.zeros(vecs.shape[1]), np.zeros(vecs.shape[1]), + vecs[0], vecs[1], angles='xy', scale_units='xy', scale=1, color=color) + ax.plot(square[0], square[1], 'k--', lw=1.5, label='Transformed Unit Square') + ax.fill(square[0], square[1], facecolor='lightgray', alpha=0.3) + ax.set_title(title) + ax.set_xlim(-2, 2) + ax.set_ylim(-2, 2) + ax.set_aspect('equal') + ax.axhline(0, color='gray', lw=0.5) + ax.axvline(0, color='gray', lw=0.5) + ax.grid(True) + ax.legend() -A useful way to understand quadratic forms is by the geometry of their -level sets. A **level set** or **isocontour** of a function is the set -of all inputs such that the function applied to those inputs yields a -given output. Mathematically, the $c$-isocontour of $f$ is -$\{\mathbf{x} \in \operatorname{dom} f : f(\mathbf{x}) = c\}$. +# Original +plot_frame(axes[0], vectors, square, "Original Vectors and Unit Square", 'blue') -Let us consider the special case -$f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ where -$\mathbf{A}$ is a positive definite matrix. Since $\mathbf{A}$ is -positive definite, it has a unique matrix square root -$\mathbf{A}^{\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, -where $\mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$ is the -eigendecomposition of $\mathbf{A}$ and -$\mathbf{\Lambda}^{\frac{1}{2}} = \operatorname{diag}(\sqrt{\lambda_1}, \dots \sqrt{\lambda_n})$. -It is easy to see that this matrix $\mathbf{A}^{\frac{1}{2}}$ is -positive definite (consider its eigenvalues) and satisfies -$\mathbf{A}^{\frac{1}{2}}\mathbf{A}^{\frac{1}{2}} = \mathbf{A}$. Fixing -a value $c \geq 0$, the $c$-isocontour of $f$ is the set of -$\mathbf{x} \in \mathbb{R}^n$ such that +# Rotation +plot_frame(axes[1], rotated_vectors, square_rotated, "Rotation (Orthogonal Q)", 'green') -$$c = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A}^{\frac{1}{2}}\mathbf{A}^{\frac{1}{2}}\mathbf{x} = \|\mathbf{A}^{\frac{1}{2}}\mathbf{x}\|_2^2$$ +# Reflection +plot_frame(axes[2], reflected_vectors, square_reflected, "Reflection (Orthogonal Q)", 'red') -where we have used the symmetry of $\mathbf{A}^{\frac{1}{2}}$. Making -the change of variable -$\mathbf{z} = \mathbf{A}^{\frac{1}{2}}\mathbf{x}$, we have the condition -$\|\mathbf{z}\|_2 = \sqrt{c}$. That is, the values $\mathbf{z}$ lie on a -sphere of radius $\sqrt{c}$. These can be parameterized as -$\mathbf{z} = \sqrt{c}\hat{\mathbf{z}}$ where $\hat{\mathbf{z}}$ has -$\|\hat{\mathbf{z}}\|_2 = 1$. Then since -$\mathbf{A}^{-\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, -we have +plt.suptitle("Orthogonal Transformations: Vectors and Unit Square", fontsize=16) +plt.tight_layout(rect=[0, 0, 1, 0.93]) +plt.show() +``` -$$\mathbf{x} = \mathbf{A}^{-\frac{1}{2}}\mathbf{z} = \mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\mathbf{Q}^{\!\top\!}\sqrt{c}\hat{\mathbf{z}} = \sqrt{c}\mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\tilde{\mathbf{z}}$$ +This enhanced visualization shows how **orthogonal transformations** affect both: -where $\tilde{\mathbf{z}} = \mathbf{Q}^{\!\top\!}\hat{\mathbf{z}}$ also -satisfies $\|\tilde{\mathbf{z}}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Using this parameterization, we see that the solution set -$\{\mathbf{x} \in \mathbb{R}^n : f(\mathbf{x}) = c\}$ is the image of -the unit sphere -$\{\tilde{\mathbf{z}} \in \mathbb{R}^n : \|\tilde{\mathbf{z}}\|_2 = 1\}$ -under the invertible linear map -$\mathbf{x} = \sqrt{c}\mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\tilde{\mathbf{z}}$. +* A set of **asymmetric vectors**, and -What we have gained with all these manipulations is a clear algebraic -understanding of the $c$-isocontour of $f$ in terms of a sequence of -linear transformations applied to a well-understood set. We begin with -the unit sphere, then scale every axis $i$ by -$\lambda_i^{-\frac{1}{2}}$, resulting in an axis-aligned ellipsoid. -Observe that the axis lengths of the ellipsoid are proportional to the -inverse square roots of the eigenvalues of $\mathbf{A}$. Hence larger -eigenvalues correspond to shorter axis lengths, and vice-versa. +* The **unit square**, which is preserved in shape and size but transformed in orientation: -Then this axis-aligned ellipsoid undergoes a rigid transformation (i.e. -one that preserves length and angles, such as a rotation/reflection) -given by $\mathbf{Q}$. The result of this transformation is that the -axes of the ellipse are no longer along the coordinate axes in general, -but rather along the directions given by the corresponding eigenvectors. -To see this, consider the unit vector $\mathbf{e}_i \in \mathbb{R}^n$ -that has $[\mathbf{e}_i]_j = \delta_{ij}$. In the pre-transformed space, -this vector points along the axis with length proportional to -$\lambda_i^{-\frac{1}{2}}$. But after applying the rigid transformation -$\mathbf{Q}$, the resulting vector points in the direction of the -corresponding eigenvector $\mathbf{q}_i$, since +* **Left**: The original setup with vectors and the unit square. -$$\mathbf{Q}\mathbf{e}_i = \sum_{j=1}^n [\mathbf{e}_i]_j\mathbf{q}_j = \mathbf{q}_i$$ +* **Middle**: A **rotation** — vectors and the square are rotated without distortion. -where we have used the matrix-vector product identity from earlier. +* **Right**: A **reflection** — vectors and the square are flipped, but all lengths and angles remain unchanged. -In summary: the isocontours of -$f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ are -ellipsoids such that the axes point in the directions of the -eigenvectors of $\mathbf{A}$, and the radii of these axes are -proportional to the inverse square roots of the corresponding -eigenvalues. +✅ This highlights that orthogonal matrices are **distance- and angle-preserving**, making them key to rigid transformations like rotations and reflections. +Would you like to include a numerical check that verifies length and angle invariance? diff --git a/book/chapter_decompositions/pseudoinverse.md b/book/chapter_decompositions/pseudoinverse.md index af58d9b..1b016db 100644 --- a/book/chapter_decompositions/pseudoinverse.md +++ b/book/chapter_decompositions/pseudoinverse.md @@ -1,23 +1,37 @@ -## Moore-Penrose Pseudoinverse -The Moore-Penrose pseudoinverse is a generalization of the matrix inverse that can be applied to non-square or singular matrices. It is denoted as \( A^+ \) for a matrix \( A \). The pseudoinverse satisfies the following properties: -1. **Existence**: The pseudoinverse exists for any matrix \( A \). +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Moore-Penrose Pseudoinverse +The Moore-Penrose pseudoinverse is a generalization of the matrix inverse that can be applied to non-square or singular matrices. It is denoted as $ A^+ $ for a matrix $ A $. The pseudoinverse satisfies the following properties: +1. **Existence**: The pseudoinverse exists for any matrix $ A $. 2. **Uniqueness**: The pseudoinverse is unique. 3. **Properties**: - - \( A A^+ A = A \) - - \( A^+ A A^+ = A^+ \) - - \( (A A^+)^\top = A A^+ \) - - \( (A^+ A)^\top = A^+ A \) -4. **Rank**: The rank of \( A^+ \) is equal to the rank of \( A \). -5. **Singular Value Decomposition (SVD)**: The pseudoinverse can be computed using the singular value decomposition of \( A \). If \( A = U \Sigma V^\top \), where \( U \) and \( V \) are orthogonal matrices and \( \Sigma \) is a diagonal matrix with singular values, then: - \[ + - $ A A^+ A = A $ + - $ A^+ A A^+ = A^+ $ + - $ (A A^+)^\top = A A^+ $ + - $ (A^+ A)^\top = A^+ A $ +4. **Rank**: The rank of $ A^+ $ is equal to the rank of $ A $. +5. **Singular Value Decomposition (SVD)**: The pseudoinverse can be computed using the singular value decomposition of $ A $. If $ A = U \Sigma V^\top $, where $ U $ and $ V $ are orthogonal matrices and $ \Sigma $ is a diagonal matrix with singular values, then: + + $$ A^+ = V \Sigma^+ U^\top - \] - where \( \Sigma^+ \) is obtained by taking the reciprocal of the non-zero singular values in \( \Sigma \) and transposing the resulting matrix. + $$ + where $ \Sigma^+ $ is obtained by taking the reciprocal of the non-zero singular values in $ \Sigma $ and transposing the resulting matrix. 6. **Applications**: The pseudoinverse is used in various applications, including solving linear systems, least squares problems, and in machine learning algorithms such as linear regression. -7. **Least Squares Solution**: The pseudoinverse provides a least squares solution to the equation \( Ax = b \) when \( A \) is not square or has no unique solution. The least squares solution is given by: - \[ +7. **Least Squares Solution**: The pseudoinverse provides a least squares solution to the equation $ Ax = b $ when $ A $ is not square or has no unique solution. The least squares solution is given by: + + $$ x = A^+ b - \] -8. **Geometric Interpretation**: The pseudoinverse can be interpreted geometrically as the projection of a vector onto the column space of \( A \). + $$ +8. **Geometric Interpretation**: The pseudoinverse can be interpreted geometrically as the projection of a vector onto the column space of $ A $. 9. **Computational Considerations**: The computation of the pseudoinverse can be done efficiently using numerical methods, such as the SVD, especially for large matrices. 10. **Limitations**: The pseudoinverse may not be suitable for all applications, especially when the matrix is ill-conditioned or has a high condition number. diff --git a/book/chapter_decompositions/row_equivalence.md b/book/chapter_decompositions/row_equivalence.md new file mode 100644 index 0000000..2823d90 --- /dev/null +++ b/book/chapter_decompositions/row_equivalence.md @@ -0,0 +1,1047 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Gaussian Elimination and the PLU Decomposition + +Gaussian elimination is one of the most fundamental algorithms in linear algebra. +It provides a systematic procedure for simplifying matrices using elementary row operations and lies at the heart of solving linear systems, computing inverses, determining rank, and understanding matrix structure. + +This section introduces the core concepts and forms related to Gaussian elimination: + +* **Row Echelon Form (REF)**: A simplified form of a matrix that resembles an upper-triangular structure. REF is sufficient for solving linear systems using **back substitution**. +* **Reduced Row Echelon Form (RREF)**: A further simplified and canonical form where each pivot is 1 and the only nonzero entry in its column. RREF enables direct reading of solutions to linear systems. +* **Row Equivalence**: The idea that matrices related through row operations preserve important properties such as solvability and rank. +* **Gaussian Elimination**: The algorithm used to transform matrices into REF, using a sequence of elementary row operations. + +Throughout this section, we will define these forms, illustrate them with examples, and demonstrate how they relate to one another and to solving equations of the form $\mathbf{A}\mathbf{x} = \mathbf{b}$. +We will also discuss when a matrix is invertible based on its row-reduced form, and how to use back substitution after performing Gaussian elimination. + +This foundation is essential for understanding many areas of applied linear algebra, from numerical methods to machine learning. + + +--- + +## Elementary Row Operations + +One of the most important facts underlying **Gaussian elimination** is that the following **elementary row operations** do not change the solution set of a linear system. +That is, if we apply these operations to both the matrix $\mathbf{A}$ and the right-hand side $\mathbf{b}$ in the system $\mathbf{A}\mathbf{x} = \mathbf{b}$, we obtain an **equivalent system** with the **same solutions**. + +1. **Swap** two rows + $R_i \leftrightarrow R_j$ + +2. **Scale** a row by a nonzero scalar + $R_i \to \alpha R_i$, $\alpha \neq 0$ + +3. **Add a multiple** of one row to another + $R_i \to R_i + \beta R_j$ + +--- + +:::{prf:theorem} Elementary Row Operations Preserve Solution Sets +:label: trm-elementary-rop-operations + +Let $\mathbf{A} \mathbf{x} = \mathbf{b}$ be a system of linear equations. + +If we apply a finite sequence of **elementary row operations** to both $\mathbf{A}$ and $\mathbf{b}$, the resulting system has the **same solution set** as the original. + +That is, $\mathbf{A} \sim \mathbf{A}'$ implies: + +$$ +\mathbf{A} \mathbf{x} = \mathbf{b} \iff \mathbf{A}' \mathbf{x} = \mathbf{b}' +$$ + +::: + +:::{prf:proof} + +Each elementary row operation corresponds to **left-multiplication** of both sides of the equation by an **invertible matrix** $\mathbf{C}$: + +$$ +\mathbf{A} \mathbf{x} = \mathbf{b} \iff \mathbf{C}\mathbf{A} \mathbf{x} = \mathbf{C}\mathbf{b} +$$ + +* Swapping rows ↔ permutation matrix +* Scaling a row ↔ diagonal matrix +* Adding a multiple of one row to another ↔ elementary row matrix + +Since invertible matrices preserve linear equivalence, applying these operations preserves the solution set. +::: + +## Row Echelon Form (REF) (🇩🇪 **Zeilen-Stufenform**) + +A matrix is said to be in **row echelon form (REF)** if it satisfies the following conditions: + +1. **All zero rows** (if any) appear **at the bottom** of the matrix. +2. The **leading entry** (or pivot) of each nonzero row is strictly to the **right** of the leading entry of the row above it. +3. All entries **below** a pivot are **zero**. + +--- + +The following is a matrix in row echelon form: + +$$ +\begin{bmatrix} +1 & 2 & 3 \\ +0 & 1 & 4 \\ +0 & 0 & 5 +\end{bmatrix} +$$ + +But this is **not** in row echelon form (pivot in row 3 is not to the right of the pivot in row 2): + +$$ +\begin{bmatrix} +1 & 2 & 3 \\ +0 & 0 & 1 \\ +0 & 1 & 0 +\end{bmatrix} +$$ + +A matrix in **REF** is a "staircase" matrix with each nonzero row starting further to the right, and all entries below each pivot are zero: + +$$ +\boxed{ +\text{REF = Upper triangular-like form from which back substitution is possible} +} +$$ + + +## Reduced Row Echelon Form (RREF) + +A matrix is in **reduced row echelon form (RREF)** if it satisfies **all the conditions of row echelon form (REF)**, *plus* two additional conditions: + +--- + +### Conditions for RREF + +1. **REF conditions**: + + * All nonzero rows are above any all-zero rows. + * Each leading (nonzero) entry of a row is strictly to the right of the leading entry of the row above it. + * All entries below a pivot are zero. + +2. **Additional conditions**: + + * Each **pivot is equal to 1** (i.e. all leading entries are 1). + * Each pivot is the **only nonzero entry in its column**. + +--- + +$$ +\begin{bmatrix} +1 & 0 & 2 \\ +0 & 1 & -3 \\ +0 & 0 & 0 +\end{bmatrix} +$$ + +This is in **RREF** because: + +* Each pivot is 1. +* Each pivot column contains only one nonzero entry (the pivot itself). +* Pivots step to the right as you go down the rows. +* Zero row is at the bottom. + +--- + +$$ +\begin{bmatrix} +2 & 4 & 6 \\ +0 & 3 & 9 \\ +0 & 0 & 1 +\end{bmatrix} +$$ + +This is in **REF**, but not RREF: + +* It satisfies the "staircase" structure. +* But pivots are not 1. +* There are other nonzero entries in pivot columns. + +--- + +## REF vs. RREF: Key Differences + +| Feature | REF | RREF | +| -------------------------------- | ---------------- | ---- | +| Zero rows at bottom | ✅ | ✅ | +| Pivots step to the right | ✅ | ✅ | +| Zeros below pivots | ✅ | ✅ | +| Pivots are 1 | ❌ (not required) | ✅ | +| Zeros **above and below** pivots | ❌ | ✅ | + +--- + +## Gaussian Elimination + +**Gaussian elimination** is a method for solving systems of linear equations by systematically transforming the coefficient matrix into **row echelon form** (REF) using the **elementary row operations** defined above. + +It is one of the fundamental algorithms in linear algebra and underlies techniques such as solving $\mathbf{A}\mathbf{x} = \mathbf{b}$, computing the rank, and inverting matrices. +If we track the elementary operations of Gaussian Elimination, we obtain the PLU decomposition of a $\mathbf{A}$. + +--- + +### Goal + +Transform a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ into an **upper triangular matrix** (or REF), such that all entries below the pivot positions (leading entries in each row) are zero. + +--- + +### Steps of Gaussian Elimination + +1. **Identify the leftmost column** that contains a nonzero entry (the pivot column). +2. **Swap rows** (if necessary) so that the pivot entry is at the top of the current submatrix. +3. **Normalize** the pivot row so the pivot equals 1 (optional — standard Gaussian elimination doesn’t require this). +4. **Eliminate below the pivot**: Subtract suitable multiples of the pivot row from rows below to make all entries in the pivot column below the pivot zero. +5. **Move to the submatrix** that lies below and to the right, and repeat until the entire matrix is in row echelon form. + + +--- + + +### Example: Gaussian Elimination by Hand + +Start with: + +$$ +\mathbf{A} = +\begin{bmatrix} +2 & 4 & -2 \\ +4 & 9 & -3 \\ +-2 & -1 & 7 +\end{bmatrix} +$$ + +--- + +**Step 1: Normalize the first pivot** + +Pivot is $A_{11} = 2$. We'll eliminate below it. + +Eliminate row 2: + +$$ +R_2 \leftarrow R_2 - 2 \cdot R_1 +$$ + +$$ +\begin{bmatrix} +2 & 4 & -2 \\ +0 & 1 & 1 \\ +-2 & -1 & 7 +\end{bmatrix} +$$ + +Eliminate row 3: + +$$ +R_3 \leftarrow R_3 + R_1 +$$ + +$$ +\begin{bmatrix} +2 & 4 & -2 \\ +0 & 1 & 1 \\ +0 & 3 & 5 +\end{bmatrix} +$$ + +--- + +**Step 2: Eliminate below second pivot** + +Pivot at $A_{22} = 1$. Eliminate below. + +$$ +R_3 \leftarrow R_3 - 3 \cdot R_2 +$$ + +$$ +\begin{bmatrix} +2 & 4 & -2 \\ +0 & 1 & 1 \\ +0 & 0 & 2 +\end{bmatrix} +$$ + +--- + +**Step 3: Normalize all pivots (optional, for RREF)** + +We can divide each pivot row to make the pivots equal to 1: + +$$ +R_1 \leftarrow \frac{1}{2} R_1, \quad +R_3 \leftarrow \frac{1}{2} R_3 +$$ + +$$ +\begin{bmatrix} +1 & 2 & -1 \\ +0 & 1 & 1 \\ +0 & 0 & 1 +\end{bmatrix} +$$ + +--- + +Final Result: Row Echelon Form (REF) + +$$ +\boxed{ +\begin{bmatrix} +1 & 2 & -1 \\ +0 & 1 & 1 \\ +0 & 0 & 1 +\end{bmatrix} +} +$$ + +This is the **upper triangular matrix** resulting from Gaussian elimination. + +## Solving Linear Systems via Gaussian Elimination + +To solve a linear system $\mathbf{A}\mathbf{x} = \mathbf{b}$ using **Gaussian elimination followed by back substitution**, it's not enough to row-reduce $\mathbf{A}$ alone — we must also apply the **same row operations** to the right-hand side vector $\mathbf{b}$. This gives a consistent system that we can solve in the transformed space. + +Gaussian elimination turns the system: + +$$ +\mathbf{A} \mathbf{x} = \mathbf{b} +$$ + +into an equivalent upper-triangular system: + +$$ +\mathbf{U} \mathbf{x} = \mathbf{c} +$$ + +where: + +* $\mathbf{U}$ is the row-reduced form of $\mathbf{A}$ (usually REF) +* $\mathbf{c}$ is the result of applying the **same row operations** to $\mathbf{b}$ + +Only with this consistent transformation can back substitution be applied. + +### Example Linear System + +Let’s extend the example to **solve a system** $\mathbf{A} \mathbf{x} = \mathbf{b}$ using **Gaussian elimination + back substitution**. + +Given: + +$$ +\mathbf{A} = +\begin{bmatrix} +2 & 4 & -2 \\ +4 & 9 & -3 \\ +-2 & -1 & 7 +\end{bmatrix}, \quad +\mathbf{b} = +\begin{bmatrix} +2 \\ +8 \\ +10 +\end{bmatrix} +$$ + +We want to solve: + +$$ +\mathbf{A} \mathbf{x} = \mathbf{b} +$$ + +--- + +**Step 1: Augmented Matrix** + +Form the augmented matrix $[\mathbf{A} \mid \mathbf{b}]$: + +$$ +\begin{bmatrix} +2 & 4 & -2 & \big| & 2 \\ +4 & 9 & -3 & \big| & 8 \\ +-2 & -1 & 7 & \big| & 10 +\end{bmatrix} +$$ + +--- + +**Step 2: Apply Gaussian Elimination** + +**Eliminate below pivot (row 1)** + +* $R_2 \leftarrow R_2 - 2 \cdot R_1$ +* $R_3 \leftarrow R_3 + R_1$ + +$$ +\begin{bmatrix} +2 & 4 & -2 & \big| & 2 \\ +0 & 1 & 1 & \big| & 4 \\ +0 & 3 & 5 & \big| & 12 +\end{bmatrix} +$$ + +**Eliminate below pivot (row 2)** + +* $R_3 \leftarrow R_3 - 3 \cdot R_2$ + +$$ +\begin{bmatrix} +2 & 4 & -2 & \big| & 2 \\ +0 & 1 & 1 & \big| & 4 \\ +0 & 0 & 2 & \big| & 0 +\end{bmatrix} +$$ + +**Normalize pivots (optional)** + +* $R_1 \leftarrow \frac{1}{2} R_1$ +* $R_3 \leftarrow \frac{1}{2} R_3$ + +$$ +\boxed{ +\begin{bmatrix} +1 & 2 & -1 & \big| & 1 \\ +0 & 1 & 1 & \big| & 4 \\ +0 & 0 & 1 & \big| & 0 +\end{bmatrix} +} +$$ + +This is the system in **row echelon form**. + +--- + +**Step 3: Back Substitution** + +Let the system be: + +$$ +\begin{aligned} +x_1 + 2x_2 - x_3 &= 1 \quad \text{(Row 1)} \\ +\quad\;\;\;\; x_2 + x_3 &= 4 \quad \text{(Row 2)} \\ +\quad\quad\quad\quad\; x_3 &= 0 \quad \text{(Row 3)} +\end{aligned} +$$ + +Back-substitute from bottom to top: + +1. $x_3 = 0$ +2. $x_2 + x_3 = 4 \Rightarrow x_2 = 4$ +3. $x_1 + 2x_2 - x_3 = 1 \Rightarrow x_1 + 8 = 1 \Rightarrow x_1 = -7$ + +--- + +Final Solution + +$$ +\boxed{ +\mathbf{x} = +\begin{bmatrix} +-7 \\ +4 \\ +0 +\end{bmatrix} +} +$$ + + + +--- + +## Interpretation of Solving Systems by Gaussian Elimination + +Think of it as row-reducing the **augmented matrix**: + +$$ +[\mathbf{A} \mid \mathbf{b}] \quad \longrightarrow \quad [\mathbf{U} \mid \mathbf{c}] +$$ + +You solve the simplified system $\mathbf{U} \mathbf{x} = \mathbf{c}$, not the original one. + +--- + +$$ +\boxed{ +\text{Gaussian elimination modifies both } \mathbf{A} \text{ and } \mathbf{b} \text{ together.} +} +$$ + + +Then you can solve $\mathbf{A} \mathbf{x} = \mathbf{b}$ via **back substitution**. + +--- + +## Back Substitution + +Once a matrix has been transformed into **row echelon form (REF)** using Gaussian elimination, we can solve a system of equations $\mathbf{A}\mathbf{x} = \mathbf{b}$ using **back substitution**. + +This method proceeds from the bottom row of the triangular system upward, solving for each variable one at a time. + +--- + +### General Idea + +Suppose, after Gaussian elimination, we have the augmented system: + +$$ +\begin{aligned} +x_3 &= c_3 \\ +x_2 + a_{23}x_3 &= c_2 \\ +x_1 + a_{12}x_2 + a_{13}x_3 &= c_1 +\end{aligned} +$$ + +We can compute the solution in reverse order: + +1. Solve for $x_3$ from the last equation. +2. Plug $x_3$ into the second equation and solve for $x_2$. +3. Plug $x_2$ and $x_3$ into the first equation to solve for $x_1$. + +--- + +### Back Substitution Example + +Let’s solve: + +$$ +\begin{bmatrix} +1 & 2 & -1 \\ +0 & 1 & 3 \\ +0 & 0 & 2 +\end{bmatrix} +\begin{bmatrix} +x_1 \\ x_2 \\ x_3 +\end{bmatrix} += +\begin{bmatrix} +5 \\ 4 \\ 6 +\end{bmatrix} +$$ + +We solve from the bottom up: + +* $x_3 = \frac{6}{2} = 3$ +* $x_2 + 3 \cdot 3 = 4 \Rightarrow x_2 = 4 - 9 = -5$ +* $x_1 + 2(-5) - 3 = 5 \Rightarrow x_1 = 5 + 10 + 3 = 18$ + +So the solution is: + +$$ +\boxed{ +\mathbf{x} = +\begin{bmatrix} +18 \\ +-5 \\ +3 +\end{bmatrix} +} +$$ + + +--- + +## Pivot Columns and Free Variables + +When we reduce a matrix to **row echelon form (REF)** or **reduced row echelon form (RREF)**, the position of **pivots** in the matrix gives us direct insight into the structure of the solution set of the system $\mathbf{A}\mathbf{x} = \mathbf{b}$. + +--- + +### ✅ Pivot Columns and Basic Variables + +* A **pivot** is the first nonzero entry in a row of REF or RREF. +* The **columns** of the original matrix $\mathbf{A}$ that contain pivots are called **pivot columns**. +* The variables corresponding to pivot columns are called **basic variables**. + + * These are the variables you solve for directly using back substitution. + +--- + +### 🆓 Non-Pivot Columns and Free Variables + +* The **columns that do not contain a pivot** are called **free columns**. +* The variables corresponding to these columns are called **free variables**. + + * They can take on arbitrary values. + * The values of basic variables depend on the free variables. + +--- + +### 🧠 Solution Structure + +The presence or absence of pivot positions determines the nature of the solution: + +| Situation | Interpretation | +| --------------------------------------------------------------------------------- | ---------------------------------------------- | +| Pivot in every column of $\mathbf{A}$ | **Unique solution** (if consistent) | +| Some columns with no pivot | **Infinitely many solutions** (free variables) | +| Inconsistent system (e.g., row of zeros in $\mathbf{A}$, nonzero in $\mathbf{b}$) | **No solution** | + +--- + +### 🔢 Example + +Suppose we reduce the augmented matrix to RREF: + +$$ +\left[ +\begin{array}{ccc|c} +1 & 0 & 2 & 3 \\ +0 & 1 & -1 & 1 \\ +0 & 0 & 0 & 0 +\end{array} +\right] +$$ + +* Pivots in columns 1 and 2 → $x_1$ and $x_2$ are **basic** +* No pivot in column 3 → $x_3$ is a **free variable** +* The solution has the form: + + $$ + \begin{aligned} + x_1 &= 3 - 2x_3 \\ + x_2 &= 1 + x_3 \\ + x_3 &\text{ is free} + \end{aligned} + $$ + +This system has **infinitely many solutions**, parameterized by $x_3$. + +--- + +### 🧩 Summary + +$$ +\boxed{ +\text{Pivot columns } \leftrightarrow \text{ basic variables}, \quad +\text{Non-pivot columns } \leftrightarrow \text{ free variables} +} +$$ + +Understanding this structure helps us: + +* Determine how many solutions exist +* Express the solution set explicitly +* Identify the degrees of freedom in underdetermined systems + +--- + + +## Row-Equivalent Matrices + +Two matrices $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$ are called **row-equivalent** if one can be transformed into the other using a finite sequence of the **elementary row operations** defined above. + +--- + +### Notation + +We write: + +$$ +\mathbf{A} \sim \mathbf{B} +$$ + +to denote that $\mathbf{A}$ is **row-equivalent** to $\mathbf{B}$. + +--- + +### Intuition + +* Row-equivalence preserves the **solution set** of the linear system $\mathbf{A} \mathbf{x} = \mathbf{b}$. +* It **does not** change the **row space**, and hence **preserves the rank**. +* A matrix is row-equivalent to the **identity matrix** $\mathbf{I}$ if and only if it is **invertible** (for square matrices). + +--- + +### Summary + +$$ +\boxed{ +\mathbf{A} \sim \mathbf{B} \iff \text{you can get from one to the other by row operations} +} +$$ + +### Example +Here are the step-by-step row operations showing that the matrix + +$$ +\mathbf{A} = \begin{bmatrix} +2 & 1 & -1 \\ +-3 & -1 & 2 \\ +-2 & 1 & 2 +\end{bmatrix} +$$ + +is **row-equivalent** to the identity matrix $\mathbf{I}$. +This confirms that $\mathbf{A}$ is **invertible**. + +Here are the steps of the row reduction process rendered with corresponding **elementary row operations** and resulting matrices: + +--- + +**Step 0**: Start with matrix $\mathbf{A}$ + +$$ +\mathbf{A} = +\begin{bmatrix} +2 & 1 & -1 \\ +-3 & -1 & 2 \\ +-2 & 1 & 2 +\end{bmatrix} +$$ + +--- + +**Step 1**: Normalize row 1 + +$R_1 \leftarrow \frac{1}{2} R_1$ + +$$ +\begin{bmatrix} +1 & 0.5 & -0.5 \\ +-3 & -1 & 2 \\ +-2 & 1 & 2 +\end{bmatrix} +$$ + +--- + +**Step 2**: Eliminate entries below pivot in column 1 + +$R_2 \leftarrow R_2 + 3 R_1$ +$R_3 \leftarrow R_3 + 2 R_1$ + +$$ +\begin{bmatrix} +1 & 0.5 & -0.5 \\ +0 & 0.5 & 0.5 \\ +0 & 2 & 1 +\end{bmatrix} +$$ + +--- + +**Step 3**: Normalize row 2 + +$R_2 \leftarrow \frac{1}{0.5} R_2 = 2 R_2$ + +$$ +\begin{bmatrix} +1 & 0.5 & -0.5 \\ +0 & 1 & 1 \\ +0 & 2 & 1 +\end{bmatrix} +$$ + +--- + +**Step 4**: Eliminate below pivot in column 2 + +$R_3 \leftarrow R_3 - 2 R_2$ + +$$ +\begin{bmatrix} +1 & 0.5 & -0.5 \\ +0 & 1 & 1 \\ +0 & 0 & -1 +\end{bmatrix} +$$ + +--- + +**Step 5**: Normalize row 3 + +$R_3 \leftarrow -1 \cdot R_3$ + +$$ +\begin{bmatrix} +1 & 0.5 & -0.5 \\ +0 & 1 & 1 \\ +0 & 0 & 1 +\end{bmatrix} +$$ + +--- + +We have row-reduced $\mathbf{A}$ to **row echelon form**, which is the identity matrix after further elimination above the pivots (not shown). + +Hence: + +$$ +\mathbf{A} \sim \mathbf{I} \quad \Rightarrow \quad \textbf{A is invertible.} +$$ + +--- + +## Matrix Inversion via Gaussian Elimination + +Gaussian elimination can not only be used to solve systems $\mathbf{A}\mathbf{x} = \mathbf{b}$, but also to **compute the inverse** of a square matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$, **if it exists**. + +This is done by augmenting $\mathbf{A}$ with the identity matrix $\mathbf{I}$, and applying row operations to reduce $\mathbf{A}$ to $\mathbf{I}$. + +The operations that convert $\mathbf{A}$ into $\mathbf{I}$ will simultaneously convert $\mathbf{I}$ into $\mathbf{A}^{-1}$. +This approach is a **constructive proof** of invertibility. + +--- + +### Procedure + +1. Form the augmented matrix $[\mathbf{A} \mid \mathbf{I}]$. +2. Apply **Gaussian elimination** to row-reduce the left side to the identity matrix. +3. If successful, the right side will become $\mathbf{A}^{-1}$. +4. If the left side cannot be reduced to identity (e.g. a zero row appears), $\mathbf{A}$ is **not invertible**. + +--- + +$$ +\boxed{ +[\mathbf{A} \mid \mathbf{I}] \longrightarrow [\mathbf{I} \mid \mathbf{A}^{-1}] \quad \text{via Gaussian elimination}. +} +$$ + +--- + +## PLU Decomposition + +Every square matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ has a **PLU decomposition** (or **LU decomposition with partial pivoting**): + +$$ +\boxed{ +\mathbf{A} = \mathbf{P} \mathbf{L} \mathbf{U} +} +$$ + +* $\mathbf{P}^\top$ is a **permutation matrix** (a rearrangement of the identity matrix) that tracks the row swaps +* $\mathbf{L}$ is **lower triangular** with unit diagonal (contains the elimination multipliers). +* $\mathbf{U}$ is **upper triangular** (result of Gaussian elimination). + +As $\mathbf{P}$ is a permuation matrix $\mathbf{P}^{-1} = \mathbf{P}^{\top}$, we can alternatively write + +$$ +\mathbf{P}^\top \mathbf{A} = \mathbf{L} \mathbf{U} +$$ + +The PLU decomposition +* always exists for any square matrix. +* is used in stable numerical solvers. +* is efficient for solving systems and computing inverses. + +--- + +### Example PLU Decomposition + +Let + +$$ +\mathbf{A} = +\begin{bmatrix} +0 & 2 \\ +1 & 4 +\end{bmatrix} +$$ + +To eliminate below the pivot, we need to **swap rows**, since $A_{11} = 0$. +The permutation matrix is: + +$$ +\mathbf{P} = +\begin{bmatrix} +0 & 1 \\ +1 & 0 +\end{bmatrix}, \quad +\mathbf{P}^\top \mathbf{A} = +\begin{bmatrix} +1 & 4 \\ +0 & 2 +\end{bmatrix} +$$ + +Then: + +$$ +\mathbf{L} = +\begin{bmatrix} +1 & 0 \\ +0 & 1 +\end{bmatrix}, \quad +\mathbf{U} = +\begin{bmatrix} +1 & 4 \\ +0 & 2 +\end{bmatrix} +$$ + +So: + +$$ +\mathbf{P}^\top \mathbf{A} = \mathbf{L} \mathbf{U} +$$ + +--- + +```{code-cell} ipython3 +import numpy as np +from scipy.linalg import lu + +# Example matrix +A = np.array([[0, 2, 1], + [1, 1, 0], + [2, 1, 1]], dtype=float) + +# Perform PLU decomposition +P, L, U = lu(A) + +print("P.T @ A:\n", P.T @ A) +print("L @ U:\n", L @ U) +``` + +This uses **SciPy's `lu` function**, returning: + +* $\mathbf{P}$: permutation matrix +* $\mathbf{L}$: unit lower triangular matrix +* $\mathbf{U}$: upper triangular matrix + such that: + +$$ +\text{LU decomposition with pivoting gives } \mathbf{P}, \mathbf{L}, \mathbf{U} \text{ such that } \mathbf{A} = \mathbf{P}\mathbf{L} \mathbf{U} +$$ + +--- + +To solve a linear system $\mathbf{A} \mathbf{x} = \mathbf{b}$ **given the PLU decomposition** of $\mathbf{A}$, that is: + +$$ +\boxed{ +\mathbf{P}^\top \mathbf{A} = \mathbf{L} \mathbf{U} +} +$$ + +You solve the system in **three steps**: + +**1. Permute the right-hand side** + +Multiply both sides by $\mathbf{P}^\top$ to align with the decomposition: + +$$ +\mathbf{P}^\top \mathbf{A} \mathbf{x} = \mathbf{P}^\top \mathbf{b} +\Rightarrow \mathbf{L} \mathbf{U} \mathbf{x} = \mathbf{P}^\top \mathbf{b} +$$ + +Let: + +$$ +\mathbf{c} = \mathbf{P}^\top \mathbf{b} +$$ + +--- + +**2. Forward substitution** + +Solve for intermediate vector $\mathbf{y}$ in: + +$$ +\mathbf{L} \mathbf{y} = \mathbf{c} +$$ + +Since $\mathbf{L}$ is lower triangular, this can be done **top-down**. + +--- + +**3. Backward substitution** + +Solve for final solution $\mathbf{x}$ in: + +$$ +\mathbf{U} \mathbf{x} = \mathbf{y} +$$ + +Since $\mathbf{U}$ is upper triangular, this can be done **bottom-up**. + +--- + +```{code-cell} ipython3 +from scipy.linalg import solve_triangular +# Right Hand Side of Equation +b = np.array([4, 2, 6], dtype=float) + +# Step 1: permute b +c = P @ b + +# Step 2: solve L y = P b +y = solve_triangular(L, c, lower=True) + +# Step 3: solve U x = y +x = solve_triangular(U, y) + +print("Solution x:", x) +``` + +--- + +$$ +\boxed{ +\mathbf{A} \mathbf{x} = \mathbf{b} \quad \Rightarrow \quad +\mathbf{P}^\top \mathbf{A} \mathbf{x} = \mathbf{L} \mathbf{U} \mathbf{x} = \mathbf{P}^\top \mathbf{b} +} +$$ + +Solve via: + +* $\mathbf{L} \mathbf{y} = \mathbf{P}^\top \mathbf{b}$ (forward) +* $\mathbf{U} \mathbf{x} = \mathbf{y}$ (backward) + +This is numerically efficient and stable, especially for **repeated solves** with the same $\mathbf{A}$. + +## Summary: Why Gaussian Elimination and Matrix Forms Matter in Machine Learning + +Understanding Gaussian elimination and matrix forms like **REF** and **RREF** is more than a theoretical exercise — it builds essential intuition and computational tools for many areas of **machine learning**. + +Here’s how these concepts directly relate: + +### Solving Linear Systems + +Many machine learning algorithms boil down to solving systems of linear equations. For example: + +* In **linear regression**, the optimal weights minimize a quadratic loss and satisfy the **normal equations**, which are linear: + + $$ + (\mathbf{X}^\top \mathbf{X}) \mathbf{w} = \mathbf{X}^\top \mathbf{y} + $$ + + Gaussian elimination provides an efficient way to solve these equations, especially for small- to medium-scale problems. + +--- + +### Understanding Rank and Feature Spaces + +* The **rank** of a data matrix $\mathbf{X}$ tells us the number of **linearly independent features**. +* Low-rank matrices appear naturally in **dimensionality reduction** (e.g. PCA), **collaborative filtering**, and **matrix completion**. +* Detecting whether features are redundant, or whether a system is under- or overdetermined, comes down to understanding row operations and rank. + +--- + +### Interpreting Model Structure + +* Matrices in **RREF** reveal directly which variables (features) are **pivotal** to a system — a perspective that underlies **feature selection**, **interpretability**, and **symbolic regression**. +* Understanding when systems have **unique**, **infinite**, or **no solutions** helps us reason about **well-posedness** and **overfitting** in models. + +--- + +### Numerical Stability and Preconditioning + +* Even when not done directly, Gaussian elimination underpins many **numerical algorithms** (e.g., LU decomposition, QR factorization). +* These are used in optimization, iterative solvers, and deep learning libraries for computing gradients, inverses, and solving systems in a stable way. + +--- + +### Big Picture + +> While machine learning often uses **high-level abstractions** and **automatic solvers**, understanding how these methods work at the matrix level helps build **intuition**, **debugging skills**, and **mathematical fluency** for real-world modeling. diff --git a/book/chapter_decompositions/square_matrices.md b/book/chapter_decompositions/square_matrices.md new file mode 100644 index 0000000..b1a74d7 --- /dev/null +++ b/book/chapter_decompositions/square_matrices.md @@ -0,0 +1,82 @@ +# Fundamental Equivalences for Square matrices +Yes, you're absolutely right — these statements are all **equivalent** and form a **core set of if-and-only-if conditions** that characterize invertibility of a square matrix. Let's list them more formally and then prove that they are all equivalent: + +--- +:::{prf:theorem} Fundamental Equivalences +:label: trm-fundamental-equivalences +:nonumber: + +Let $\mathbf{A} \in \mathbb{R}^{n \times n}$. The following statements are **equivalent** — that is, they are all true or all false together: + +1. $\mathbf{A}$ is **invertible** +2. $\det(\mathbf{A}) \neq 0$ +3. $\mathbf{A}$ is **full-rank**, i.e., $\operatorname{rank}(\mathbf{A}) = n$ +4. The **columns** of $\mathbf{A}$ are **linearly independent** +5. The **rows** of $\mathbf{A}$ are **linearly independent** +6. $\mathbf{A}$ is **row-equivalent** to the identity matrix +7. The system $\mathbf{A}\mathbf{x} = \mathbf{b}$ has a **unique solution for every $\mathbf{b} \in \mathbb{R}^n$** + +::: + + +:::{prf:proof} + +We'll prove the chain of implications in a **circular fashion**, which implies all are equivalent. + +--- + +**(1) ⇒ (2): Invertible ⇒ Determinant nonzero** + +If $\mathbf{A}^{-1}$ exists, then + +$$ +\det(\mathbf{A} \mathbf{A}^{-1}) = \det(\mathbf{I}) = 1 = \det(\mathbf{A}) \det(\mathbf{A}^{-1}) \Rightarrow \det(\mathbf{A}) \neq 0 +$$ + +--- + +**(2) ⇒ (3): $\det(\mathbf{A}) \neq 0$ ⇒ Full-rank** + +A square matrix has full rank $\iff$ its rows/columns span $\mathbb{R}^n$, and this happens exactly when $\det(\mathbf{A}) \neq 0$. + +If $\operatorname{rank}(\mathbf{A}) < n$, then one row or column is linearly dependent, making $\det(\mathbf{A}) = 0$. + +--- + +**(3) ⇒ (4) and (5): Full-rank ⇒ Linear independence of rows and columns** + +A matrix with rank $n$ must have linearly independent rows and columns by the definition of rank. + +--- + +**(4) ⇒ (6): Independent columns ⇒ Row-equivalent to identity** + +If the columns are linearly independent, Gaussian elimination can reduce $\mathbf{A}$ to the identity matrix $\mathbf{I}$ using row operations. + +This means $\mathbf{A}$ is row-equivalent to $\mathbf{I}$. + +--- + +**(6) ⇒ (7): Row-equivalent to $\mathbf{I}$ ⇒ Unique solution for all $\mathbf{b}$** + +If $\mathbf{A} \sim \mathbf{I}$, then solving $\mathbf{A} \mathbf{x} = \mathbf{b}$ is equivalent to solving $\mathbf{I} \mathbf{x} = \mathbf{b}'$, which always has the unique solution $\mathbf{x} = \mathbf{b}'$. + +--- + +**(7) ⇒ (1): Unique solution for all $\mathbf{b}$ ⇒ $\mathbf{A}$ is invertible** + +If $\mathbf{A} \mathbf{x} = \mathbf{b}$ has a **unique solution for all** $\mathbf{b}$, then the inverse mapping $\mathbf{b} \mapsto \mathbf{x}$ is well-defined and linear, so $\mathbf{A}^{-1}$ exists. + +--- + +**Conclusion** + +All these statements are **equivalent**: + +$$ +\boxed{ +\mathbf{A} \text{ invertible } \iff \det(\mathbf{A}) \neq 0 \iff \operatorname{rank}(\mathbf{A}) = n \iff \text{cols/rows lin. independent} \iff \mathbf{A} \sim \mathbf{I} \iff \text{unique solution for all } \mathbf{b} +} +$$ + +::: \ No newline at end of file diff --git a/book/chapter_decompositions/svd.md b/book/chapter_decompositions/svd.md index d212eda..72f1eb2 100644 --- a/book/chapter_decompositions/svd.md +++ b/book/chapter_decompositions/svd.md @@ -94,16 +94,3 @@ equivalently the $j$th column of $\mathbf{B}^{\!\top\!}$. Hence by the definition of matrix multiplication, it is equal to $[\mathbf{A}\mathbf{B}^{\!\top\!}]_{ij}$. ◻ -### Quadratic forms - -Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix, and -recall that the expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ -is called a quadratic form of $\mathbf{A}$. It is in some cases helpful -to rewrite the quadratic form in terms of the individual elements that -make up $\mathbf{A}$ and $\mathbf{x}$: - -$$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \sum_{i=1}^n\sum_{j=1}^n A_{ij}x_ix_j$$ - -This identity is valid for any square matrix (need not be symmetric), -although quadratic forms are usually only discussed in the context of -symmetric matrices. diff --git a/book/chapter_decompositions/symmetric_matrices.md b/book/chapter_decompositions/symmetric_matrices.md index a0b5276..0db52c7 100644 --- a/book/chapter_decompositions/symmetric_matrices.md +++ b/book/chapter_decompositions/symmetric_matrices.md @@ -1,9 +1,23 @@ -## Symmetric matrices +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Symmetric matrices A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ is said to be **symmetric** if it is equal to its own transpose ($\mathbf{A} = \mathbf{A}^{\!\top\!}$), meaning that $A_{ij} = A_{ji}$ -for all $(i,j)$. This definition seems harmless enough but turns out to +for all $(i,j)$. + +This definition seems harmless enough but turns out to have some strong implications. We summarize the most important of these as @@ -29,12 +43,109 @@ by $\mathbf{Q}^{\!\top\!}$, we arrive at the decomposition $$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$$ -### Rayleigh quotients +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Define a symmetric matrix +A = np.array([[3, 1], + [1, 2]]) + +# Eigendecomposition +eigvals, eigvecs = np.linalg.eigh(A) # eigh guarantees real symmetric matrix handling +Λ = np.diag(eigvals) +Q = eigvecs + +# Confirm A = Q Λ Qᵀ +A_reconstructed = Q @ Λ @ Q.T + +# Create unit circle +theta = np.linspace(0, 2*np.pi, 100) +circle = np.stack((np.cos(theta), np.sin(theta))) + +# Transformations +circle_stretched = Λ @ circle +circle_eigen_transformed = Q @ circle_stretched +circle_direct = A @ circle + +# Plotting +fig, axes = plt.subplots(1, 3, figsize=(18, 6)) + +# Original unit circle +axes[0].plot(circle[0], circle[1], 'k--', label='Unit Circle') +axes[0].set_title("Original Space") +axes[0].set_xlim(-3, 3) +axes[0].set_ylim(-3, 3) + +# Stretch along eigenbasis +axes[1].plot(circle_stretched[0], circle_stretched[1], 'r-', label='Stretched (Λ)') +axes[1].quiver(0, 0, Λ[0, 0], 0, angles='xy', scale_units='xy', scale=1, color='blue', label='λ₁q₁') +axes[1].quiver(0, 0, 0, Λ[1, 1], angles='xy', scale_units='xy', scale=1, color='green', label='λ₂q₂') +axes[1].set_title("Stretch in Eigenbasis") +axes[1].set_xlim(-3, 3) +axes[1].set_ylim(-3, 3) +axes[1].legend() + +# Transform via Q Λ Qᵀ +axes[2].plot(circle_direct[0], circle_direct[1], 'purple', label='A ∘ circle') +axes[2].plot(circle_eigen_transformed[0], circle_eigen_transformed[1], 'orange', linestyle='--', label='QΛQᵀ ∘ circle') +axes[2].quiver(0, 0, *eigvecs[:, 0]*eigvals[0], angles='xy', scale_units='xy', scale=1, color='blue') +axes[2].quiver(0, 0, *eigvecs[:, 1]*eigvals[1], angles='xy', scale_units='xy', scale=1, color='green') +axes[2].set_title("Transformation by Symmetric A") +axes[2].set_xlim(-3, 3) +axes[2].set_ylim(-3, 3) +axes[2].legend() + +for ax in axes: + ax.set_aspect('equal') + ax.axhline(0, color='gray', lw=0.5) + ax.axvline(0, color='gray', lw=0.5) + ax.grid(True) + +plt.suptitle("Geometric Intuition of the Spectral Decomposition for Symmetric Matrices", fontsize=16) +plt.tight_layout(rect=[0, 0, 1, 0.93]) +plt.show() +``` + +This visualization gives geometric insight into the **spectral theorem for symmetric matrices**: + +1. **Left Panel** – The original unit circle in $\mathbb{R}^2$. +2. **Middle Panel** – The action of the diagonal matrix $\Lambda$ in the **eigenbasis**: stretching along coordinate axes defined by eigenvectors. +3. **Right Panel** – The full symmetric transformation $\mathbf{A} = \mathbf{Q} \Lambda \mathbf{Q}^\top$: + + * This first rotates into the eigenbasis (via $\mathbf{Q}^\top$), + * Then stretches (via $\Lambda$), + * Then rotates back (via $\mathbf{Q}$). + * Both $\mathbf{A} \circ \text{circle}$ and $\mathbf{Q}\Lambda\mathbf{Q}^\top \circ \text{circle}$ overlap perfectly. + +✅ This illustrates how symmetric matrices are always diagonalizable with **orthogonal eigenvectors**, and why they never induce rotation — only **axis-aligned stretching in some rotated basis**. + -Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix. The -expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ is called a + +### Quadratic forms + +Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix. + +The expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ is called a **quadratic form**. + +Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix, and +recall that the expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ +is called a quadratic form of $\mathbf{A}$. It is in some cases helpful +to rewrite the quadratic form in terms of the individual elements that +make up $\mathbf{A}$ and $\mathbf{x}$: + +$$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \sum_{i=1}^n\sum_{j=1}^n A_{ij}x_ix_j$$ + +This identity is valid for any square matrix (need not be symmetric), +although quadratic forms are usually only discussed in the context of +symmetric matrices. + +### Rayleigh quotients + + There turns out to be an interesting connection between the quadratic form of a symmetric matrix and its eigenvalues. This connection is provided by the **Rayleigh quotient** @@ -120,178 +231,376 @@ $$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\ with equality if and only if $\mathbf{x}$ is a corresponding eigenvector. +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt -## Positive (semi-)definite matrices +# Define symmetric matrix +A = np.array([[2, 1], + [1, 3]]) -A symmetric matrix $\mathbf{A}$ is **positive semi-definite** if for all -$\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \geq 0$. Sometimes people -write $\mathbf{A} \succeq 0$ to indicate that $\mathbf{A}$ is positive -semi-definite. +# Eigenvalues and eigenvectors +eigvals, eigvecs = np.linalg.eigh(A) +λ_min, λ_max = eigvals -A symmetric matrix $\mathbf{A}$ is **positive definite** if for all -nonzero $\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} > 0$. Sometimes people write -$\mathbf{A} \succ 0$ to indicate that $\mathbf{A}$ is positive definite. -Note that positive definiteness is a strictly stronger property than -positive semi-definiteness, in the sense that every positive definite -matrix is positive semi-definite but not vice-versa. +# Generate unit circle points +theta = np.linspace(0, 2*np.pi, 300) +circle = np.stack((np.cos(theta), np.sin(theta))) -These properties are related to eigenvalues in the following way. +# Rayleigh quotient computation +R = np.einsum('ij,ji->i', circle.T @ A, circle) # x^T A x +R /= np.einsum('ij,ji->i', circle.T, circle) # x^T x -*Proposition.* -A symmetric matrix is positive semi-definite if and only if all of its -eigenvalues are nonnegative, and positive definite if and only if all of -its eigenvalues are positive. +# Rayleigh extrema +idx_min = np.argmin(R) +idx_max = np.argmax(R) +x_min = circle[:, idx_min] +x_max = circle[:, idx_max] +# Prepare grid for quadratic form level sets +x = np.linspace(-2, 2, 400) +y = np.linspace(-2, 2, 400) +X, Y = np.meshgrid(x, y) +XY = np.stack((X, Y), axis=-1) +Z = np.einsum('...i,ij,...j->...', XY, A, XY) +levels = np.linspace(np.min(Z), np.max(Z), 20) -*Proof.* Suppose $A$ is positive semi-definite, and let $\mathbf{x}$ be -an eigenvector of $\mathbf{A}$ with eigenvalue $\lambda$. Then +# Create combined figure +fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) -$$0 \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}(\lambda\mathbf{x}) = \lambda\mathbf{x}^{\!\top\!}\mathbf{x} = \lambda\|\mathbf{x}\|_2^2$$ +# Left: Rayleigh quotient on unit circle +sc = ax1.scatter(circle[0], circle[1], c=R, cmap='viridis', s=10) +ax1.quiver(0, 0, x_min[0], x_min[1], color='red', scale=1, scale_units='xy', angles='xy', label='argmin R(x)') +ax1.quiver(0, 0, x_max[0], x_max[1], color='orange', scale=1, scale_units='xy', angles='xy', label='argmax R(x)') +for i in range(2): + eigvec = eigvecs[:, i] + ax1.quiver(0, 0, eigvec[0], eigvec[1], color='black', alpha=0.5, scale=1, scale_units='xy', angles='xy', width=0.008) +ax1.set_title("Rayleigh Quotient on the Unit Circle") +ax1.set_aspect('equal') +ax1.set_xlim(-1.1, 1.1) +ax1.set_ylim(-1.1, 1.1) +ax1.grid(True) +ax1.legend() +plt.colorbar(sc, ax=ax1, label="Rayleigh Quotient $R_A(\\mathbf{x})$") -Since $\mathbf{x} \neq \mathbf{0}$ (by the assumption that it is an -eigenvector), we have $\|\mathbf{x}\|_2^2 > 0$, so we can divide both -sides by $\|\mathbf{x}\|_2^2$ to arrive at $\lambda \geq 0$. If -$\mathbf{A}$ is positive definite, the inequality above holds strictly, -so $\lambda > 0$. This proves one direction. +# Right: Level sets of quadratic form +contour = ax2.contour(X, Y, Z, levels=levels, cmap='viridis') +ax2.clabel(contour, inline=True, fontsize=8, fmt="%.1f") +ax2.set_title("Level Sets of $\\mathbf{x}^\\top \\mathbf{A} \\mathbf{x}$") +ax2.set_xlabel("$x_1$") +ax2.set_ylabel("$x_2$") +ax2.axhline(0, color='gray', lw=0.5) +ax2.axvline(0, color='gray', lw=0.5) +for i in range(2): + vec = eigvecs[:, i] * np.sqrt(eigvals[i]) + ax2.quiver(0, 0, vec[0], vec[1], color='red', scale=1, scale_units='xy', angles='xy', width=0.01, label=f"$\\mathbf{{q}}_{i+1}$") +ax2.set_aspect('equal') +ax2.legend() -To simplify the proof of the other direction, we will use the machinery -of Rayleigh quotients. Suppose that $\mathbf{A}$ is symmetric and all -its eigenvalues are nonnegative. Then for all -$\mathbf{x} \neq \mathbf{0}$, +plt.suptitle("Rayleigh Quotient and Quadratic Form Level Sets", fontsize=16) +plt.tight_layout(rect=[0, 0, 1, 0.93]) +plt.show() +``` -$$0 \leq \lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x})$$ +This combined visualization brings together the **Rayleigh quotient** and the **level sets of the quadratic form** $\mathbf{x}^\top \mathbf{A} \mathbf{x}$: -Since $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ matches -$R_\mathbf{A}(\mathbf{x})$ in sign, we conclude that $\mathbf{A}$ is -positive semi-definite. If the eigenvalues of $\mathbf{A}$ are all -strictly positive, then $0 < \lambda_{\min}(\mathbf{A})$, whence it -follows that $\mathbf{A}$ is positive definite. ◻ +* **Left panel**: Rayleigh quotient $R_\mathbf{A}(\mathbf{x})$ on the unit circle + * Color shows how the value varies with direction. + * Extremes occur at eigenvector directions (marked with arrows). -As an example of how these matrices arise, consider +* **Right panel**: Level sets (contours) of the quadratic form -*Proposition.* -Suppose $\mathbf{A} \in \mathbb{R}^{m \times n}$. Then -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. If -$\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, then -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. + * Elliptical shapes aligned with eigenvectors. + * Red vectors indicate principal axes (scaled eigenvectors). +Together, these panels illustrate how the **direction of a vector determines how strongly it is scaled** by the symmetric matrix, and how this scaling relates to the matrix's **eigenstructure**. -*Proof.* For any $\mathbf{x} \in \mathbb{R}^n$, +✅ As guaranteed by the **Min–Max Theorem**, the maximum and minimum of the Rayleigh quotient occur precisely at the **eigenvectors corresponding to the largest and smallest eigenvalues**. -$$\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = (\mathbf{A}\mathbf{x})^{\!\top\!}(\mathbf{A}\mathbf{x}) = \|\mathbf{A}\mathbf{x}\|_2^2 \geq 0$$ -so $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. -Note that $\|\mathbf{A}\mathbf{x}\|_2^2 = 0$ implies -$\|\mathbf{A}\mathbf{x}\|_2 = 0$, which in turn implies -$\mathbf{A}\mathbf{x} = \mathbf{0}$ (recall that this is a property of -norms). If $\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, -$\mathbf{A}\mathbf{x} = \mathbf{0}$ implies $\mathbf{x} = \mathbf{0}$, -so -$\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = 0$ -if and only if $\mathbf{x} = \mathbf{0}$, and thus -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. ◻ +--- -Positive definite matrices are invertible (since their eigenvalues are -nonzero), whereas positive semi-definite matrices might not be. However, -if you already have a positive semi-definite matrix, it is possible to -perturb its diagonal slightly to produce a positive definite matrix. +## ✅ Theorem: Real symmetric matrices cannot produce rotation -*Proposition.* -If $\mathbf{A}$ is positive semi-definite and $\epsilon > 0$, then -$\mathbf{A} + \epsilon\mathbf{I}$ is positive definite. - -*Proof.* Assuming $\mathbf{A}$ is positive semi-definite and -$\epsilon > 0$, we have for any $\mathbf{x} \neq \mathbf{0}$ that - -$$\mathbf{x}^{\!\top\!}(\mathbf{A}+\epsilon\mathbf{I})\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} + \epsilon\mathbf{x}^{\!\top\!}\mathbf{I}\mathbf{x} = \underbrace{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}_{\geq 0} + \underbrace{\epsilon\|\mathbf{x}\|_2^2}_{> 0} > 0$$ - -as claimed. ◻ - -An obvious but frequently useful consequence of the two propositions we -have just shown is that -$\mathbf{A}^{\!\top\!}\mathbf{A} + \epsilon\mathbf{I}$ is positive -definite (and in particular, invertible) for *any* matrix $\mathbf{A}$ -and any $\epsilon > 0$. - -### The geometry of positive definite quadratic forms - -A useful way to understand quadratic forms is by the geometry of their -level sets. A **level set** or **isocontour** of a function is the set -of all inputs such that the function applied to those inputs yields a -given output. Mathematically, the $c$-isocontour of $f$ is -$\{\mathbf{x} \in \operatorname{dom} f : f(\mathbf{x}) = c\}$. - -Let us consider the special case -$f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ where -$\mathbf{A}$ is a positive definite matrix. Since $\mathbf{A}$ is -positive definite, it has a unique matrix square root -$\mathbf{A}^{\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, -where $\mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$ is the -eigendecomposition of $\mathbf{A}$ and -$\mathbf{\Lambda}^{\frac{1}{2}} = \operatorname{diag}(\sqrt{\lambda_1}, \dots \sqrt{\lambda_n})$. -It is easy to see that this matrix $\mathbf{A}^{\frac{1}{2}}$ is -positive definite (consider its eigenvalues) and satisfies -$\mathbf{A}^{\frac{1}{2}}\mathbf{A}^{\frac{1}{2}} = \mathbf{A}$. Fixing -a value $c \geq 0$, the $c$-isocontour of $f$ is the set of -$\mathbf{x} \in \mathbb{R}^n$ such that - -$$c = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A}^{\frac{1}{2}}\mathbf{A}^{\frac{1}{2}}\mathbf{x} = \|\mathbf{A}^{\frac{1}{2}}\mathbf{x}\|_2^2$$ - -where we have used the symmetry of $\mathbf{A}^{\frac{1}{2}}$. Making -the change of variable -$\mathbf{z} = \mathbf{A}^{\frac{1}{2}}\mathbf{x}$, we have the condition -$\|\mathbf{z}\|_2 = \sqrt{c}$. That is, the values $\mathbf{z}$ lie on a -sphere of radius $\sqrt{c}$. These can be parameterized as -$\mathbf{z} = \sqrt{c}\hat{\mathbf{z}}$ where $\hat{\mathbf{z}}$ has -$\|\hat{\mathbf{z}}\|_2 = 1$. Then since -$\mathbf{A}^{-\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, -we have - -$$\mathbf{x} = \mathbf{A}^{-\frac{1}{2}}\mathbf{z} = \mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\mathbf{Q}^{\!\top\!}\sqrt{c}\hat{\mathbf{z}} = \sqrt{c}\mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\tilde{\mathbf{z}}$$ - -where $\tilde{\mathbf{z}} = \mathbf{Q}^{\!\top\!}\hat{\mathbf{z}}$ also -satisfies $\|\tilde{\mathbf{z}}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Using this parameterization, we see that the solution set -$\{\mathbf{x} \in \mathbb{R}^n : f(\mathbf{x}) = c\}$ is the image of -the unit sphere -$\{\tilde{\mathbf{z}} \in \mathbb{R}^n : \|\tilde{\mathbf{z}}\|_2 = 1\}$ -under the invertible linear map -$\mathbf{x} = \sqrt{c}\mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\tilde{\mathbf{z}}$. - -What we have gained with all these manipulations is a clear algebraic -understanding of the $c$-isocontour of $f$ in terms of a sequence of -linear transformations applied to a well-understood set. We begin with -the unit sphere, then scale every axis $i$ by -$\lambda_i^{-\frac{1}{2}}$, resulting in an axis-aligned ellipsoid. -Observe that the axis lengths of the ellipsoid are proportional to the -inverse square roots of the eigenvalues of $\mathbf{A}$. Hence larger -eigenvalues correspond to shorter axis lengths, and vice-versa. - -Then this axis-aligned ellipsoid undergoes a rigid transformation (i.e. -one that preserves length and angles, such as a rotation/reflection) -given by $\mathbf{Q}$. The result of this transformation is that the -axes of the ellipse are no longer along the coordinate axes in general, -but rather along the directions given by the corresponding eigenvectors. -To see this, consider the unit vector $\mathbf{e}_i \in \mathbb{R}^n$ -that has $[\mathbf{e}_i]_j = \delta_{ij}$. In the pre-transformed space, -this vector points along the axis with length proportional to -$\lambda_i^{-\frac{1}{2}}$. But after applying the rigid transformation -$\mathbf{Q}$, the resulting vector points in the direction of the -corresponding eigenvector $\mathbf{q}_i$, since - -$$\mathbf{Q}\mathbf{e}_i = \sum_{j=1}^n [\mathbf{e}_i]_j\mathbf{q}_j = \mathbf{q}_i$$ - -where we have used the matrix-vector product identity from earlier. - -In summary: the isocontours of -$f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ are -ellipsoids such that the axes point in the directions of the -eigenvectors of $\mathbf{A}$, and the radii of these axes are -proportional to the inverse square roots of the corresponding -eigenvalues. +### 🧾 Statement + +Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a **real symmetric matrix**. Then: + +> The linear transformation $\mathbf{x} \mapsto \mathbf{A}\mathbf{x}$ **does not rotate** vectors — i.e., it cannot produce a transformation that changes the direction of a vector **without preserving its span**. + +In particular: + +* The transformation **does not rotate angles** +* The transformation has a basis of **orthogonal eigenvectors** +* Therefore, all action is **stretching/compressing along fixed directions**, not rotation + +--- + +## 🧠 Intuition + +Rotation mixes directions. But symmetric matrices: + +* Have **real eigenvalues** +* Are **orthogonally diagonalizable** +* Have **mutually orthogonal eigenvectors** + +So the matrix acts by **scaling along fixed orthogonal axes**, without changing the direction between basis vectors — i.e., no twisting, hence no rotation. + +--- + +## ✏️ Proof (2D case, generalizes easily) + +Let $\mathbf{A} \in \mathbb{R}^{2 \times 2}$ be symmetric: + +$$ +\mathbf{A} = \begin{pmatrix} a & b \\ b & d \end{pmatrix} +$$ + +We’ll show that $\mathbf{A}$ cannot produce a true rotation. + +### Step 1: Diagonalize $\mathbf{A}$ + +Because $\mathbf{A}$ is real symmetric, there exists an orthogonal matrix $\mathbf{Q}$ and diagonal $\mathbf{\Lambda}$ such that: + +$$ +\mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^\top +$$ + +That is, $\mathbf{A}$ acts as: + +* A rotation (or reflection) $\mathbf{Q}^\top$ +* A stretch along axes $\mathbf{\Lambda}$ +* A second rotation (or reflection) $\mathbf{Q}$ + +But since $\mathbf{Q}$ and $\mathbf{Q}^\top$ cancel out geometrically (they are transposes of each other), this results in: + +> A transformation that **scales but does not rotate** relative to the basis of eigenvectors. + +### Step 2: Show $\mathbf{A}$ preserves alignment + +Let $\mathbf{v}$ be any eigenvector of $\mathbf{A}$. Then: + +$$ +\mathbf{A} \mathbf{v} = \lambda \mathbf{v} +$$ + +So $\mathbf{v}$ is **mapped to a scalar multiple of itself** — its **direction doesn’t change**. + +Because $\mathbb{R}^2$ has two linearly independent eigenvectors (since symmetric matrices are always diagonalizable), **no vector is rotated out of its original span** — just scaled. + +Hence, the transformation only **stretches**, **compresses**, or **reflects**, but never rotates. + +--- + +## 🚫 Counterexample: Rotation matrix is not symmetric + +The rotation matrix: + +$$ +\mathbf{R}_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} +$$ + +is **not symmetric** unless $\theta = 0$ or $\pi$, where it reduces to identity or negation. + +It **does not** have real eigenvectors (except at those degenerate angles), and it **rotates** all directions. + +--- + +## ✅ Conclusion + +**Rotation requires asymmetry.** + +If a linear transformation rotates vectors (changes direction without preserving alignment), the matrix must be **non-symmetric**. + +--- + +## ✅ Corollary + +A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ can perform rotation **only if**: + +* It is **not symmetric**, and +* It has **complex eigenvalues** (at least in 2D rotation) + +Excellent and important question. The answer is: + +> ❗️**Not all non-symmetric matrices have an eigen-decomposition over $\mathbb{R}$ or even $\mathbb{C}$.** + +Let’s unpack what this means. + +--- + +## ✅ What is an Eigen-Decomposition? + +An **eigen-decomposition** of a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ means: + +$$ +\mathbf{A} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1} +$$ + +Where: + +* $\mathbf{\Lambda}$ is a diagonal matrix of eigenvalues +* $\mathbf{V}$ contains eigenvectors as columns +* $\mathbf{V}^{-1}$ exists (i.e., $\mathbf{V}$ is invertible) + +This decomposition is **only possible when $\mathbf{A}$ is diagonalizable**. + +--- + +## ❌ Not All Matrices Are Diagonalizable + +A matrix is **not diagonalizable** if: + +* It **does not have enough linearly independent eigenvectors** (i.e., the geometric multiplicity < algebraic multiplicity) + +This can happen even if all the eigenvalues are real! + +### 🔴 Example (Defective Matrix): + +$$ +\mathbf{A} = \begin{pmatrix} +1 & 1 \\ +0 & 1 +\end{pmatrix} +$$ + +* Eigenvalue: $\lambda = 1$ +* But only **one** linearly independent eigenvector +* So it **cannot be diagonalized** + +--- + +## ✅ When Does a Matrix Have an Eigen-Decomposition? + +| Matrix Type | Diagonalizable? | Notes | +| ------------------------------ | --------------- | ------------------------------------------------ | +| Symmetric (real) | ✅ Always | Eigen-decomposition with orthogonal eigenvectors | +| Diagonalizable (in general) | ✅ Yes | Can write $A = V \Lambda V^{-1}$ | +| Defective (non-diagonalizable) | ❌ No | Needs Jordan form instead | + +--- + +## 🔁 Jordan Decomposition: The General Replacement + +If a matrix is **not diagonalizable**, it still has a **Jordan decomposition**: + +$$ +\mathbf{A} = \mathbf{P} \mathbf{J} \mathbf{P}^{-1} +$$ + +Where: + +* $\mathbf{J}$ is **block diagonal**: eigenvalues + possible **Jordan blocks** +* This captures **generalized eigenvectors** + +So **every square matrix** has a **Jordan decomposition**, but **not every one has an eigen-decomposition**. + +--- + +## ✅ Summary + +* **Symmetric matrices**: always have an eigen-decomposition (with real, orthogonal eigenvectors) +* **Non-symmetric matrices**: + + * May have a complete eigen-decomposition (if diagonalizable) + * May **not**, if they are **defective** +* In the general case, you must use **Jordan form** + +A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ has **complex eigenvalues or eigenvectors** when: + +### ✅ 1. The matrix is **not symmetric** (i.e., $\mathbf{A} \ne \mathbf{A}^\top$) + +* Real symmetric matrices **always** have **real** eigenvalues and orthogonal eigenvectors. +* Non-symmetric real matrices can have complex eigenvalues and eigenvectors. + +### ✅ 2. The **characteristic polynomial** has **complex roots** + +For example, consider: + +$$ +\mathbf{A} = \begin{pmatrix} 0 & -2 \\ 1 & 0 \end{pmatrix} +$$ + +Its characteristic polynomial is: + +$$ +\det(\mathbf{A} - \lambda \mathbf{I}) = \lambda^2 + 1 +$$ + +The roots are: + +$$ +\lambda = \pm i +$$ + +So it has **pure imaginary eigenvalues**, and its eigenvectors are also **complex**. +## ✅ Quick Answer: + +The eigenvectors and their transformed versions $\mathbf{A} \mathbf{v} = \lambda \mathbf{v}$ **are** parallel — **but only in complex vector space** $\mathbb{C}^n$. + +In **real space**, we usually visualize: + +* The **real part** of a complex vector: $\mathrm{Re}(\mathbf{v})$ +* The **imaginary part** of a complex vector: $\mathrm{Im}(\mathbf{v})$ + +But neither of these alone is invariant under multiplication by $\lambda \in \mathbb{C}$. So when you look at: + +$$ +\mathbf{v} = \mathrm{Re}(\mathbf{v}) + i \cdot \mathrm{Im}(\mathbf{v}) +$$ + +and apply $\mathbf{A}$, what you see in the real plane is: + +$$ +\mathrm{Re}(\mathbf{A} \mathbf{v}) \quad \text{vs.} \quad \mathrm{Re}(\lambda \mathbf{v}) +$$ + +These are **not scalar multiples** of $\mathrm{Re}(\mathbf{v})$ or $\mathrm{Im}(\mathbf{v})$, because complex scaling **mixes real and imaginary components** — unless $\lambda$ is real. + +--- + +## 🔍 Example + +Say: + +$$ +\lambda = a + ib, \quad \mathbf{v} = \begin{pmatrix} x + iy \\ z + iw \end{pmatrix} +$$ + +Then: + +$$ +\lambda \mathbf{v} = (a + ib)(\text{real} + i \cdot \text{imag}) = \text{mix of real and imaginary} +$$ + +So $\mathbf{A} \mathbf{v} = \lambda \mathbf{v}$, but $\mathrm{Re}(\mathbf{A} \mathbf{v})$ will **not be parallel** to $\mathrm{Re}(\mathbf{v})$ alone — it's a rotated and scaled mixture. + +--- + +## 🧠 Bottom Line + +> **Eigenvectors and their transformations are parallel in $\mathbb{C}^n$, but not necessarily in $\mathbb{R}^n$.** + + +> Note: The eigenvectors and their transformations are parallel in complex space, but their real and imaginary parts generally point in different directions due to complex scaling (rotation + stretch). + +--- + +## 🧠 Intuition + +* Complex eigenvalues often indicate **rotational behavior** in linear dynamical systems. +* The matrix above rotates vectors by 90° and has no real direction that stays on its span after transformation — hence no real eigenvectors. + +--- + +## 🔄 Summary +| Matrix Type | Eigenvalues | Eigenvectors | +| ------------------ | ----------------- | ----------------- | +| Symmetric real | Real | Real & orthogonal | +| Non-symmetric real | Real or complex | Real or complex | +| Complex (any) | Complex (general) | Complex (general) | diff --git a/book/chapter_decompositions/trace.md b/book/chapter_decompositions/trace.md new file mode 100644 index 0000000..2eb75a5 --- /dev/null +++ b/book/chapter_decompositions/trace.md @@ -0,0 +1,108 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Trace + +The **trace** of a square matrix is the sum of its diagonal entries: + +$$\operatorname{tr}(\mathbf{A}) = \sum_{i=1}^n A_{ii}$$ + +The trace has several nice algebraic properties: + +(i) $\operatorname{tr}(\mathbf{A}+\mathbf{B}) = \operatorname{tr}(\mathbf{A}) + \operatorname{tr}(\mathbf{B})$ + +(ii) $\operatorname{tr}(\alpha\mathbf{A}) = \alpha\operatorname{tr}(\mathbf{A})$ + +(iii) $\operatorname{tr}(\mathbf{A}^{\!\top\!}) = \operatorname{tr}(\mathbf{A})$ + +(iv) $\operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D}) = \operatorname{tr}(\mathbf{B}\mathbf{C}\mathbf{D}\mathbf{A}) = \operatorname{tr}(\mathbf{C}\mathbf{D}\mathbf{A}\mathbf{B}) = \operatorname{tr}(\mathbf{D}\mathbf{A}\mathbf{B}\mathbf{C})$ + +The first three properties follow readily from the definition. +The last is known as **invariance under cyclic permutations**. +Note that the matrices cannot be reordered arbitrarily, for example +$\operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D}) \neq \operatorname{tr}(\mathbf{B}\mathbf{A}\mathbf{C}\mathbf{D})$ +in general. +Also, there is nothing special about the product of four matrices -- analogous rules hold for more or fewer matrices. + + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Set up matrices for demonstration +A = np.array([[2, 1], + [0, 3]]) +B = np.array([[1, -1], + [2, 0]]) +alpha = 2.5 + +# Compute traces +trace_A = np.trace(A) +trace_B = np.trace(B) +trace_A_plus_B = np.trace(A + B) +trace_alphaA = np.trace(alpha * A) +trace_AT = np.trace(A.T) + +# Cyclic permutation example +C = np.array([[0, 2], + [1, 1]]) +D = np.array([[1, 1], + [0, -1]]) + +product_1 = A @ B @ C @ D +product_2 = B @ C @ D @ A +product_3 = C @ D @ A @ B +product_4 = D @ A @ B @ C + +traces = [ + np.trace(product_1), + np.trace(product_2), + np.trace(product_3), + np.trace(product_4), +] + +# Plotting +fig, axes = plt.subplots(2, 2, figsize=(12, 10)) + +# (i) Linearity +axes[0, 0].bar(['tr(A)', 'tr(B)', 'tr(A+B)'], [trace_A, trace_B, trace_A_plus_B], + color=['blue', 'green', 'purple']) +axes[0, 0].set_title("Linearity: tr(A + B) = tr(A) + tr(B)") +axes[0, 0].axhline(trace_A + trace_B, color='gray', linestyle='--', label='Expected tr(A) + tr(B)') +axes[0, 0].legend() + +# (ii) Scalar multiplication +axes[0, 1].bar(['tr(A)', 'tr(αA)'], [trace_A, trace_alphaA], color=['blue', 'orange']) +axes[0, 1].axhline(alpha * trace_A, color='gray', linestyle='--', label='Expected α·tr(A)') +axes[0, 1].set_title("Scaling: tr(αA) = α·tr(A)") +axes[0, 1].legend() + +# (iii) Transpose invariance +axes[1, 0].bar(['tr(A)', 'tr(Aᵀ)'], [trace_A, trace_AT], color=['blue', 'red']) +axes[1, 0].set_title("Transpose: tr(Aᵀ) = tr(A)") + +# (iv) Cyclic permutation invariance +axes[1, 1].bar(['ABCD', 'BCDA', 'CDAB', 'DABC'], traces, color='teal') +axes[1, 1].axhline(traces[0], color='gray', linestyle='--', label='Invariant trace') +axes[1, 1].set_title("Cyclic Permutation: tr(ABCD) = tr(BCDA) = ...") +axes[1, 1].legend() + +plt.suptitle("Visualizing the Properties of the Trace Operator", fontsize=16) +plt.tight_layout(rect=[0, 0, 1, 0.95]) +plt.show() + +``` + +Interestingly, the trace of a matrix is equal to the sum of its eigenvalues (repeated according to multiplicity): + +$$\operatorname{tr}(\mathbf{A}) = \sum_i \lambda_i(\mathbf{A})$$ diff --git a/book/chapter_decompositions/trace_determinant.md b/book/chapter_decompositions/trace_determinant.md deleted file mode 100644 index 882c968..0000000 --- a/book/chapter_decompositions/trace_determinant.md +++ /dev/null @@ -1,51 +0,0 @@ -## Trace - -The **trace** of a square matrix is the sum of its diagonal entries: - -$$\operatorname{tr}(\mathbf{A}) = \sum_{i=1}^n A_{ii}$$ - -The trace has several nice -algebraic properties: - -(i) $\operatorname{tr}(\mathbf{A}+\mathbf{B}) = \operatorname{tr}(\mathbf{A}) + \operatorname{tr}(\mathbf{B})$ - -(ii) $\operatorname{tr}(\alpha\mathbf{A}) = \alpha\operatorname{tr}(\mathbf{A})$ - -(iii) $\operatorname{tr}(\mathbf{A}^{\!\top\!}) = \operatorname{tr}(\mathbf{A})$ - -(iv) $\operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D}) = \operatorname{tr}(\mathbf{B}\mathbf{C}\mathbf{D}\mathbf{A}) = \operatorname{tr}(\mathbf{C}\mathbf{D}\mathbf{A}\mathbf{B}) = \operatorname{tr}(\mathbf{D}\mathbf{A}\mathbf{B}\mathbf{C})$ - -The first three properties follow readily from the definition. The last -is known as **invariance under cyclic permutations**. Note that the -matrices cannot be reordered arbitrarily, for example -$\operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D}) \neq \operatorname{tr}(\mathbf{B}\mathbf{A}\mathbf{C}\mathbf{D})$ -in general. Also, there is nothing special about the product of four -matrices -- analogous rules hold for more or fewer matrices. - -Interestingly, the trace of a matrix is equal to the sum of its -eigenvalues (repeated according to multiplicity): - -$$\operatorname{tr}(\mathbf{A}) = \sum_i \lambda_i(\mathbf{A})$$ - -## Determinant - -The **determinant** of a square matrix can be defined in several -different confusing ways, none of which are particularly important for -our purposes; go look at an introductory linear algebra text (or -Wikipedia) if you need a definition. But it's good to know the -properties: - -(i) $\det(\mathbf{I}) = 1$ - -(ii) $\det(\mathbf{A}^{\!\top\!}) = \det(\mathbf{A})$ - -(iii) $\det(\mathbf{A}\mathbf{B}) = \det(\mathbf{A})\det(\mathbf{B})$ - -(iv) $\det(\mathbf{A}^{-1}) = \det(\mathbf{A})^{-1}$ - -(v) $\det(\alpha\mathbf{A}) = \alpha^n \det(\mathbf{A})$ - -Interestingly, the determinant of a matrix is equal to the product of -its eigenvalues (repeated according to multiplicity): - -$$\det(\mathbf{A}) = \prod_i \lambda_i(\mathbf{A})$$ \ No newline at end of file From 35251647548d1591db8597593580ce5ad8ef2bfe Mon Sep 17 00:00:00 2001 From: clippert Date: Tue, 13 May 2025 23:59:15 +0200 Subject: [PATCH 18/43] week 5 up to quadratic forms --- book/_toc.yml | 8 +- .../Rayleigh_quotients.md | 461 ++++++++++++++ book/chapter_decompositions/determinant.md | 237 +++++++- book/chapter_decompositions/eigenvectors.md | 312 ++++++---- .../orthogonal_matrices.md | 66 +- .../symmetric_matrices.md | 571 +----------------- book/chapter_decompositions/trace.md | 74 +-- 7 files changed, 995 insertions(+), 734 deletions(-) create mode 100644 book/chapter_decompositions/Rayleigh_quotients.md diff --git a/book/_toc.yml b/book/_toc.yml index 5bc2b09..3ed1d30 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -82,10 +82,10 @@ parts: - file: chapter_decompositions/determinant - file: chapter_decompositions/row_equivalence - file: chapter_decompositions/square_matrices -# - file: chapter_decompositions/eigenvectors -# - file: chapter_decompositions/trace -# - file: chapter_decompositions/orthogonal_matrices -# - file: chapter_decompositions/symmetric_matrices + - file: chapter_decompositions/trace + - file: chapter_decompositions/eigenvectors + - file: chapter_decompositions/orthogonal_matrices + - file: chapter_decompositions/symmetric_matrices # - file: chapter_decompositions/psd_matrices # - file: chapter_decompositions/svd # - file: chapter_decompositions/big_picture diff --git a/book/chapter_decompositions/Rayleigh_quotients.md b/book/chapter_decompositions/Rayleigh_quotients.md new file mode 100644 index 0000000..894bce8 --- /dev/null +++ b/book/chapter_decompositions/Rayleigh_quotients.md @@ -0,0 +1,461 @@ +# Rayleigh quotients + + +There turns out to be an interesting connection between the quadratic +form of a symmetric matrix and its eigenvalues. This connection is +provided by the **Rayleigh quotient** + +$$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}$$ + +The Rayleigh quotient has a couple of important properties which the +reader can (and should!) easily verify from the definition: + +(i) **Scale invariance**: for any vector $\mathbf{x} \neq \mathbf{0}$ + and any scalar $\alpha \neq 0$, + $R_\mathbf{A}(\mathbf{x}) = R_\mathbf{A}(\alpha\mathbf{x})$. + +(ii) If $\mathbf{x}$ is an eigenvector of $\mathbf{A}$ with eigenvalue + $\lambda$, then $R_\mathbf{A}(\mathbf{x}) = \lambda$. + +We can further show that the Rayleigh quotient is bounded by the largest +and smallest eigenvalues of $\mathbf{A}$. But first we will show a +useful special case of the final result. + +*Proposition.* +For any $\mathbf{x}$ such that $\|\mathbf{x}\|_2 = 1$, + +$$\lambda_{\min}(\mathbf{A}) \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$$ + +with equality if and only if $\mathbf{x}$ is a corresponding +eigenvector. + +*Proof.* We show only the $\max$ case because the argument for the +$\min$ case is entirely analogous. + +Since $\mathbf{A}$ is symmetric, we can decompose it as +$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$. Then use +the change of variable $\mathbf{y} = \mathbf{Q}^{\!\top\!}\mathbf{x}$, +noting that the relationship between $\mathbf{x}$ and $\mathbf{y}$ is +one-to-one and that $\|\mathbf{y}\|_2 = 1$ since $\mathbf{Q}$ is +orthogonal. Hence + +$$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \max_{\|\mathbf{y}\|_2 = 1} \mathbf{y}^{\!\top\!}\mathbf{\Lambda}\mathbf{y} = \max_{y_1^2+\dots+y_n^2=1} \sum_{i=1}^n \lambda_i y_i^2$$ + +Written this way, it is clear that $\mathbf{y}$ maximizes this +expression exactly if and only if it satisfies +$\sum_{i \in I} y_i^2 = 1$ where +$I = \{i : \lambda_i = \max_{j=1,\dots,n} \lambda_j = \lambda_{\max}(\mathbf{A})\}$ +and $y_j = 0$ for $j \not\in I$. That is, $I$ contains the index or +indices of the largest eigenvalue. In this case, the maximal value of +the expression is + +$$\sum_{i=1}^n \lambda_i y_i^2 = \sum_{i \in I} \lambda_i y_i^2 = \lambda_{\max}(\mathbf{A}) \sum_{i \in I} y_i^2 = \lambda_{\max}(\mathbf{A})$$ + +Then writing $\mathbf{q}_1, \dots, \mathbf{q}_n$ for the columns of +$\mathbf{Q}$, we have + +$$\mathbf{x} = \mathbf{Q}\mathbf{Q}^{\!\top\!}\mathbf{x} = \mathbf{Q}\mathbf{y} = \sum_{i=1}^n y_i\mathbf{q}_i = \sum_{i \in I} y_i\mathbf{q}_i$$ + +where we have used the matrix-vector product identity. + +Recall that $\mathbf{q}_1, \dots, \mathbf{q}_n$ are eigenvectors of +$\mathbf{A}$ and form an orthonormal basis for $\mathbb{R}^n$. Therefore +by construction, the set $\{\mathbf{q}_i : i \in I\}$ forms an +orthonormal basis for the eigenspace of $\lambda_{\max}(\mathbf{A})$. +Hence $\mathbf{x}$, which is a linear combination of these, lies in that +eigenspace and thus is an eigenvector of $\mathbf{A}$ corresponding to +$\lambda_{\max}(\mathbf{A})$. + +We have shown that +$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \lambda_{\max}(\mathbf{A})$, +from which we have the general inequality +$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$ +for all unit-length $\mathbf{x}$. ◻ + + +By the scale invariance of the Rayleigh quotient, we immediately have as +a corollary (since +$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = R_{\mathbf{A}}(\mathbf{x})$ +for unit $\mathbf{x}$) + +*Theorem.* +(Min-max theorem) For all $\mathbf{x} \neq \mathbf{0}$, + +$$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ + +with equality if and only if $\mathbf{x}$ is a corresponding +eigenvector. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Define symmetric matrix +A = np.array([[2, 1], + [1, 3]]) + +# Eigenvalues and eigenvectors +eigvals, eigvecs = np.linalg.eigh(A) +λ_min, λ_max = eigvals + +# Generate unit circle points +theta = np.linspace(0, 2*np.pi, 300) +circle = np.stack((np.cos(theta), np.sin(theta))) + +# Rayleigh quotient computation +R = np.einsum('ij,ji->i', circle.T @ A, circle) # x^T A x +R /= np.einsum('ij,ji->i', circle.T, circle) # x^T x + +# Rayleigh extrema +idx_min = np.argmin(R) +idx_max = np.argmax(R) +x_min = circle[:, idx_min] +x_max = circle[:, idx_max] + +# Prepare grid for quadratic form level sets +x = np.linspace(-2, 2, 400) +y = np.linspace(-2, 2, 400) +X, Y = np.meshgrid(x, y) +XY = np.stack((X, Y), axis=-1) +Z = np.einsum('...i,ij,...j->...', XY, A, XY) +levels = np.linspace(np.min(Z), np.max(Z), 20) + +# Create combined figure +fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) + +# Left: Rayleigh quotient on unit circle +sc = ax1.scatter(circle[0], circle[1], c=R, cmap='viridis', s=10) +ax1.quiver(0, 0, x_min[0], x_min[1], color='red', scale=1, scale_units='xy', angles='xy', label='argmin R(x)') +ax1.quiver(0, 0, x_max[0], x_max[1], color='orange', scale=1, scale_units='xy', angles='xy', label='argmax R(x)') +for i in range(2): + eigvec = eigvecs[:, i] + ax1.quiver(0, 0, eigvec[0], eigvec[1], color='black', alpha=0.5, scale=1, scale_units='xy', angles='xy', width=0.008) +ax1.set_title("Rayleigh Quotient on the Unit Circle") +ax1.set_aspect('equal') +ax1.set_xlim(-1.1, 1.1) +ax1.set_ylim(-1.1, 1.1) +ax1.grid(True) +ax1.legend() +plt.colorbar(sc, ax=ax1, label="Rayleigh Quotient $R_A(\\mathbf{x})$") + +# Right: Level sets of quadratic form +contour = ax2.contour(X, Y, Z, levels=levels, cmap='viridis') +ax2.clabel(contour, inline=True, fontsize=8, fmt="%.1f") +ax2.set_title("Level Sets of $\\mathbf{x}^\\top \\mathbf{A} \\mathbf{x}$") +ax2.set_xlabel("$x_1$") +ax2.set_ylabel("$x_2$") +ax2.axhline(0, color='gray', lw=0.5) +ax2.axvline(0, color='gray', lw=0.5) +for i in range(2): + vec = eigvecs[:, i] * np.sqrt(eigvals[i]) + ax2.quiver(0, 0, vec[0], vec[1], color='red', scale=1, scale_units='xy', angles='xy', width=0.01, label=f"$\\mathbf{{q}}_{i+1}$") +ax2.set_aspect('equal') +ax2.legend() + +plt.suptitle("Rayleigh Quotient and Quadratic Form Level Sets", fontsize=16) +plt.tight_layout(rect=[0, 0, 1, 0.93]) +plt.show() +``` + +This combined visualization brings together the **Rayleigh quotient** and the **level sets of the quadratic form** $\mathbf{x}^\top \mathbf{A} \mathbf{x}$: + +* **Left panel**: Rayleigh quotient $R_\mathbf{A}(\mathbf{x})$ on the unit circle + + * Color shows how the value varies with direction. + * Extremes occur at eigenvector directions (marked with arrows). + +* **Right panel**: Level sets (contours) of the quadratic form + + * Elliptical shapes aligned with eigenvectors. + * Red vectors indicate principal axes (scaled eigenvectors). + +Together, these panels illustrate how the **direction of a vector determines how strongly it is scaled** by the symmetric matrix, and how this scaling relates to the matrix's **eigenstructure**. + +✅ As guaranteed by the **Min–Max Theorem**, the maximum and minimum of the Rayleigh quotient occur precisely at the **eigenvectors corresponding to the largest and smallest eigenvalues**. + + + +--- + +## ✅ Theorem: Real symmetric matrices cannot produce rotation + +### 🧾 Statement + +Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a **real symmetric matrix**. Then: + +> The linear transformation $\mathbf{x} \mapsto \mathbf{A}\mathbf{x}$ **does not rotate** vectors — i.e., it cannot produce a transformation that changes the direction of a vector **without preserving its span**. + +In particular: + +* The transformation **does not rotate angles** +* The transformation has a basis of **orthogonal eigenvectors** +* Therefore, all action is **stretching/compressing along fixed directions**, not rotation + +--- + +## 🧠 Intuition + +Rotation mixes directions. But symmetric matrices: + +* Have **real eigenvalues** +* Are **orthogonally diagonalizable** +* Have **mutually orthogonal eigenvectors** + +So the matrix acts by **scaling along fixed orthogonal axes**, without changing the direction between basis vectors — i.e., no twisting, hence no rotation. + +--- + +## ✏️ Proof (2D case, generalizes easily) + +Let $\mathbf{A} \in \mathbb{R}^{2 \times 2}$ be symmetric: + +$$ +\mathbf{A} = \begin{pmatrix} a & b \\ b & d \end{pmatrix} +$$ + +We’ll show that $\mathbf{A}$ cannot produce a true rotation. + +### Step 1: Diagonalize $\mathbf{A}$ + +Because $\mathbf{A}$ is real symmetric, there exists an orthogonal matrix $\mathbf{Q}$ and diagonal $\mathbf{\Lambda}$ such that: + +$$ +\mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^\top +$$ + +That is, $\mathbf{A}$ acts as: + +* A rotation (or reflection) $\mathbf{Q}^\top$ +* A stretch along axes $\mathbf{\Lambda}$ +* A second rotation (or reflection) $\mathbf{Q}$ + +But since $\mathbf{Q}$ and $\mathbf{Q}^\top$ cancel out geometrically (they are transposes of each other), this results in: + +> A transformation that **scales but does not rotate** relative to the basis of eigenvectors. + +### Step 2: Show $\mathbf{A}$ preserves alignment + +Let $\mathbf{v}$ be any eigenvector of $\mathbf{A}$. Then: + +$$ +\mathbf{A} \mathbf{v} = \lambda \mathbf{v} +$$ + +So $\mathbf{v}$ is **mapped to a scalar multiple of itself** — its **direction doesn’t change**. + +Because $\mathbb{R}^2$ has two linearly independent eigenvectors (since symmetric matrices are always diagonalizable), **no vector is rotated out of its original span** — just scaled. + +Hence, the transformation only **stretches**, **compresses**, or **reflects**, but never rotates. + +--- + +## 🚫 Counterexample: Rotation matrix is not symmetric + +The rotation matrix: + +$$ +\mathbf{R}_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} +$$ + +is **not symmetric** unless $\theta = 0$ or $\pi$, where it reduces to identity or negation. + +It **does not** have real eigenvectors (except at those degenerate angles), and it **rotates** all directions. + +--- + +## ✅ Conclusion + +**Rotation requires asymmetry.** + +If a linear transformation rotates vectors (changes direction without preserving alignment), the matrix must be **non-symmetric**. + +--- + +## ✅ Corollary + +A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ can perform rotation **only if**: + +* It is **not symmetric**, and +* It has **complex eigenvalues** (at least in 2D rotation) + +Excellent and important question. The answer is: + +> ❗️**Not all non-symmetric matrices have an eigen-decomposition over $\mathbb{R}$ or even $\mathbb{C}$.** + +Let’s unpack what this means. + +--- + +## ✅ What is an Eigen-Decomposition? + +An **eigen-decomposition** of a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ means: + +$$ +\mathbf{A} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1} +$$ + +Where: + +* $\mathbf{\Lambda}$ is a diagonal matrix of eigenvalues +* $\mathbf{V}$ contains eigenvectors as columns +* $\mathbf{V}^{-1}$ exists (i.e., $\mathbf{V}$ is invertible) + +This decomposition is **only possible when $\mathbf{A}$ is diagonalizable**. + +--- + +## ❌ Not All Matrices Are Diagonalizable + +A matrix is **not diagonalizable** if: + +* It **does not have enough linearly independent eigenvectors** (i.e., the geometric multiplicity < algebraic multiplicity) + +This can happen even if all the eigenvalues are real! + +### 🔴 Example (Defective Matrix): + +$$ +\mathbf{A} = \begin{pmatrix} +1 & 1 \\ +0 & 1 +\end{pmatrix} +$$ + +* Eigenvalue: $\lambda = 1$ +* But only **one** linearly independent eigenvector +* So it **cannot be diagonalized** + +--- + +## ✅ When Does a Matrix Have an Eigen-Decomposition? + +| Matrix Type | Diagonalizable? | Notes | +| ------------------------------ | --------------- | ------------------------------------------------ | +| Symmetric (real) | ✅ Always | Eigen-decomposition with orthogonal eigenvectors | +| Diagonalizable (in general) | ✅ Yes | Can write $A = V \Lambda V^{-1}$ | +| Defective (non-diagonalizable) | ❌ No | Needs Jordan form instead | + +--- + +## 🔁 Jordan Decomposition: The General Replacement + +If a matrix is **not diagonalizable**, it still has a **Jordan decomposition**: + +$$ +\mathbf{A} = \mathbf{P} \mathbf{J} \mathbf{P}^{-1} +$$ + +Where: + +* $\mathbf{J}$ is **block diagonal**: eigenvalues + possible **Jordan blocks** +* This captures **generalized eigenvectors** + +So **every square matrix** has a **Jordan decomposition**, but **not every one has an eigen-decomposition**. + +--- + +## ✅ Summary + +* **Symmetric matrices**: always have an eigen-decomposition (with real, orthogonal eigenvectors) +* **Non-symmetric matrices**: + + * May have a complete eigen-decomposition (if diagonalizable) + * May **not**, if they are **defective** +* In the general case, you must use **Jordan form** + +A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ has **complex eigenvalues or eigenvectors** when: + +### ✅ 1. The matrix is **not symmetric** (i.e., $\mathbf{A} \ne \mathbf{A}^\top$) + +* Real symmetric matrices **always** have **real** eigenvalues and orthogonal eigenvectors. +* Non-symmetric real matrices can have complex eigenvalues and eigenvectors. + +### ✅ 2. The **characteristic polynomial** has **complex roots** + +For example, consider: + +$$ +\mathbf{A} = \begin{pmatrix} 0 & -2 \\ 1 & 0 \end{pmatrix} +$$ + +Its characteristic polynomial is: + +$$ +\det(\mathbf{A} - \lambda \mathbf{I}) = \lambda^2 + 1 +$$ + +The roots are: + +$$ +\lambda = \pm i +$$ + +So it has **pure imaginary eigenvalues**, and its eigenvectors are also **complex**. +## ✅ Quick Answer: + +The eigenvectors and their transformed versions $\mathbf{A} \mathbf{v} = \lambda \mathbf{v}$ **are** parallel — **but only in complex vector space** $\mathbb{C}^n$. + +In **real space**, we usually visualize: + +* The **real part** of a complex vector: $\mathrm{Re}(\mathbf{v})$ +* The **imaginary part** of a complex vector: $\mathrm{Im}(\mathbf{v})$ + +But neither of these alone is invariant under multiplication by $\lambda \in \mathbb{C}$. So when you look at: + +$$ +\mathbf{v} = \mathrm{Re}(\mathbf{v}) + i \cdot \mathrm{Im}(\mathbf{v}) +$$ + +and apply $\mathbf{A}$, what you see in the real plane is: + +$$ +\mathrm{Re}(\mathbf{A} \mathbf{v}) \quad \text{vs.} \quad \mathrm{Re}(\lambda \mathbf{v}) +$$ + +These are **not scalar multiples** of $\mathrm{Re}(\mathbf{v})$ or $\mathrm{Im}(\mathbf{v})$, because complex scaling **mixes real and imaginary components** — unless $\lambda$ is real. + +--- + +## 🔍 Example + +Say: + +$$ +\lambda = a + ib, \quad \mathbf{v} = \begin{pmatrix} x + iy \\ z + iw \end{pmatrix} +$$ + +Then: + +$$ +\lambda \mathbf{v} = (a + ib)(\text{real} + i \cdot \text{imag}) = \text{mix of real and imaginary} +$$ + +So $\mathbf{A} \mathbf{v} = \lambda \mathbf{v}$, but $\mathrm{Re}(\mathbf{A} \mathbf{v})$ will **not be parallel** to $\mathrm{Re}(\mathbf{v})$ alone — it's a rotated and scaled mixture. + +--- + +## 🧠 Bottom Line + +> **Eigenvectors and their transformations are parallel in $\mathbb{C}^n$, but not necessarily in $\mathbb{R}^n$.** + + +> Note: The eigenvectors and their transformations are parallel in complex space, but their real and imaginary parts generally point in different directions due to complex scaling (rotation + stretch). + +--- + +## 🧠 Intuition + +* Complex eigenvalues often indicate **rotational behavior** in linear dynamical systems. +* The matrix above rotates vectors by 90° and has no real direction that stays on its span after transformation — hence no real eigenvectors. + +--- + +## 🔄 Summary + +| Matrix Type | Eigenvalues | Eigenvectors | +| ------------------ | ----------------- | ----------------- | +| Symmetric real | Real | Real & orthogonal | +| Non-symmetric real | Real or complex | Real or complex | +| Complex (any) | Complex (general) | Complex (general) | + diff --git a/book/chapter_decompositions/determinant.md b/book/chapter_decompositions/determinant.md index 6bfee75..c2b43fd 100644 --- a/book/chapter_decompositions/determinant.md +++ b/book/chapter_decompositions/determinant.md @@ -12,11 +12,29 @@ kernelspec: --- # Determinant -The **determinant** of a square matrix can be defined in several -different ways. +The **determinant** is a scalar quantity associated with any square matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$. It encodes important geometric and algebraic information about the transformation represented by $\mathbf{A}$. + +**Geometrically**, the determinant tells us: + +* How the matrix $\mathbf{A}$ **scales volume**: + The absolute value $|\det(\mathbf{A})|$ is the **volume-scaling factor** for the linear transformation $\mathbf{x} \mapsto \mathbf{A}\mathbf{x}$. +* Whether the transformation **preserves or flips orientation**: + If $\det(\mathbf{A}) > 0$, the transformation preserves orientation; if $\det(\mathbf{A}) < 0$, it reverses it (like a reflection). + +**Algebraically**, the determinant can be defined as: + +$$ +\det(\mathbf{A}) = \sum_{\sigma \in S_n} \operatorname{sgn}(\sigma) \cdot a_{1\sigma(1)} a_{2\sigma(2)} \cdots a_{n\sigma(n)} +$$ + +where: + +* The sum is over all permutations $\sigma$ of $\{1, 2, \dots, n\}$, +* $\operatorname{sgn}(\sigma)$ is $+1$ or $-1$ depending on the parity of the permutation. + +This formula is **computationally expensive** and confusing, but conceptually important: it captures how the determinant depends on all possible signed products of entries, each taken once from a distinct row and column. Let's illustrate the determinant geometrically. -The determinant can be considered a factor on the change of volume of a unit square after transformation. ```{code-cell} ipython3 :tags: [hide-input] @@ -78,6 +96,12 @@ plt.show() 3. **Flip**: Reflects across the diagonal — determinant < 0. 4. **Rotation**: Rotates without distortion — determinant = 1. +## What is the Determinant? + +The **determinant** of a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ is a scalar that describes how $\mathbf{A}$ scales space. +Algebraically, it is defined by a signed sum over all permutations of the matrix’s entries. +Geometrically, it quantifies the change in **signed volume** of a unit shape under transformation by $\mathbf{A}$. If the determinant is zero, the transformation collapses the volume entirely, and $\mathbf{A}$ is singular (non-invertible). + --- The determinant has several important properties: @@ -93,5 +117,212 @@ The determinant has several important properties: (v) $\det(\alpha\mathbf{A}) = \alpha^n \det(\mathbf{A})$ --- +## Practical Computation of the Determinant + +The algebraic definition of the determinant is computationally expensive. +In practice, we compute the determinant using property (iii) and a matrix factorization such as the **PLU decomposition**: + +$$ +\mathbf{A} = \mathbf{P} \mathbf{L} \mathbf{U} +$$ + +where $\mathbf{P}$ is a permutation matrix, $\mathbf{L}$ is a unit lower triangular matrix, and $\mathbf{U}$ is an upper triangular matrix. + +:::{prf:theorem} Triangular Matrix Determinant +:label: trm-triangular-determinant +:nonumber: + +Let $\mathbf{T} \in \mathbb{R}^{n \times n}$ be a **triangular matrix**, either upper or lower triangular. + +Then: + +$$ +\boxed{ +\det(\mathbf{T}) = \prod_{i=1}^n T_{ii} +} +$$ +::: + + +Then, + +$$ +\boxed{ +\det(\mathbf{A}) = \det(\mathbf{P}) \cdot \det(\mathbf{L}) \cdot \det(\mathbf{U}) +} +$$ + +Since: + +* $\det(\mathbf{L}) = 1$ (if unit lower triangular), +* $\det(\mathbf{U}) = \prod_{i=1}^n u_{ii}$, +* $\det(\mathbf{P}) = (-1)^s$, where $s$ is the number of row swaps, + +this method reduces determinant computation to $\mathcal{O}(n)$ operations after LU decomposition. +As the cost for the LU decomposition is $\mathcal{O}(n^3),$ the total cost of computing the determinant is $\mathcal{O}(n^3).$ + +## Cofactor Expansion: Definition + +**Cofactor expansion** (also called **Laplace expansion**) gives a **recursive definition** of the determinant. + +Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a square matrix. + +Then the **determinant** of $\mathbf{A}$ can be computed by expanding along **any row or column**. + +For simplicity, we’ll define it for the **first column**. + +$$ +\boxed{ +\det(\mathbf{A}) = \sum_{i=1}^{n} (-1)^{i+1} \cdot A_{i1} \cdot \det(\mathbf{A}^{(i,1)}) +} +$$ + +Where: + +* $A_{i1}$ is the entry in row $i$, column 1 +* $\mathbf{A}^{(i,1)}$ is the **minor** matrix obtained by deleting row $i$ and column 1 from $\mathbf{A}$ +* $(-1)^{i+1}$ is the **sign** factor for alternating signs (from the **checkerboard sign pattern**) +* $(-1)^{i+j} \cdot \det(\mathbf{A}^{(i,j)})$ is called the **cofactor** of $A_{ij}$ + +This formula recursively reduces the computation of a determinant to smaller and smaller submatrices. + +--- + +### Cofactor Expantion Example (3×3 Matrix) + +Let: + +$$ +\mathbf{A} = +\begin{bmatrix} +1 & 2 & 3 \\ +4 & 5 & 6 \\ +7 & 8 & 9 +\end{bmatrix} +$$ + +Expand along the **first column**: + +$$ +\det(\mathbf{A}) = +(+1) \cdot 1 \cdot +\begin{vmatrix} +5 & 6 \\ +8 & 9 +\end{vmatrix} +- 4 \cdot +\begin{vmatrix} +2 & 3 \\ +8 & 9 +\end{vmatrix} ++ 7 \cdot +\begin{vmatrix} +2 & 3 \\ +5 & 6 +\end{vmatrix} +$$ + +Now compute the 2×2 determinants: + +$$ +\det(\mathbf{A}) = +1 \cdot (5 \cdot 9 - 6 \cdot 8) +- 4 \cdot (2 \cdot 9 - 3 \cdot 8) ++ 7 \cdot (2 \cdot 6 - 3 \cdot 5) +$$ + +$$ += 1 \cdot (-3) - 4 \cdot (-6) + 7 \cdot (-3) += -3 + 24 - 21 = 0 +$$ + +So: + +$$ +\boxed{\det(\mathbf{A}) = 0} +$$ + + +:::{prf:proof} via Laplace Expansion / Cofactor Expansion + +We’ll prove this for **upper triangular** matrices by induction on the matrix size $n$. The same argument applies symmetrically for lower triangular matrices. + +--- + +### Base Case: $n = 1$ + +Let $\mathbf{T} = [t_{11}]$. Then clearly: + +$$ +\det(\mathbf{T}) = t_{11} += \prod_{i=1}^1 T_{ii} +$$ + +The base case holds. + +--- + +### Inductive Step + +Assume the result holds for $(n-1) \times (n-1)$ upper triangular matrices. + +Now let $\mathbf{T} \in \mathbb{R}^{n \times n}$ be upper triangular. That means all entries below the diagonal are zero: + +$$ +\mathbf{T} = +\begin{bmatrix} +t_{11} & t_{12} & \dots & t_{1n} \\ +0 & t_{22} & \dots & t_{2n} \\ +\vdots & \vdots & \ddots & \vdots \\ +0 & 0 & \dots & t_{nn} +\end{bmatrix} +$$ + +Use **cofactor expansion** along the first column. Since the only nonzero entry in the first column is $t_{11}$, we have: + +$$ +\det(\mathbf{T}) = t_{11} \cdot \det(\mathbf{T}^{(1,1)}) +$$ + +Where $\mathbf{T}^{(1,1)}$ is the $(n-1)\times(n-1)$ matrix obtained by deleting row 1 and column 1. But: + +* $\mathbf{T}^{(1,1)}$ is still upper triangular. +* By the inductive hypothesis: + + $$ + \det(\mathbf{T}^{(1,1)}) = \prod_{i=2}^{n} T_{ii} + $$ + +So: + +$$ +\det(\mathbf{T}) = t_{11} \cdot \prod_{i=2}^{n} T_{ii} += \prod_{i=1}^{n} T_{ii} +$$ + +The inductive step holds. + +--- + +### Conclusion + +By induction, for any upper (or lower) triangular matrix $\mathbf{T} \in \mathbb{R}^{n \times n}$, + +$$ +\boxed{ +\det(\mathbf{T}) = \prod_{i=1}^n T_{ii} +} +$$ + +::: + +* The determinant accumulates only the diagonal entries because **each pivot is isolated**, and all other paths in the expansion have zero entries. +* This result is frequently used in: + + * Computing determinants from **LU decomposition** + * Checking invertibility efficiently + * Proving properties of **eigenvalues** and **characteristic polynomials** + + diff --git a/book/chapter_decompositions/eigenvectors.md b/book/chapter_decompositions/eigenvectors.md index 99e30e6..76f453e 100644 --- a/book/chapter_decompositions/eigenvectors.md +++ b/book/chapter_decompositions/eigenvectors.md @@ -16,7 +16,7 @@ For a *square matrix* $\mathbf{A} \in \mathbb{R}^{n \times n}$, there may be vectors which, when $\mathbf{A}$ is applied to them, are simply scaled by some constant. -We say that a nonzero vector $\mathbf{x} \in \mathbb{R}^n$ is an **eigenvector** of $\mathbf{A}$ corresponding to **eigenvalue** $\lambda$ if +A nonzero vector $\mathbf{x} \in \mathbb{C}^n$ is an **eigenvector** of $\mathbf{A}$ corresponding to **eigenvalue** $\lambda \in \mathbb{C}$ if $$\mathbf{A}\mathbf{x} = \lambda\mathbf{x}$$ @@ -24,9 +24,10 @@ The zero vector is excluded from this definition because $\mathbf{A}\mathbf{0} = \mathbf{0} = \lambda\mathbf{0}$ for every $\lambda$. ---- +Eigenvalues and eigenvectors can be complex numbers, even if $\mathbf{A}$ is real-valued. +We will provide a high-level discussion of the conditions below. -Let's look at an example and how multiplication with a matrix $\mathbf{A}$ transforms vectors that lie on the unit circle and, in particular, how it changes it's eivenvectors during multiplication. +First, let's look at an example and how multiplication with a matrix $\mathbf{A}$ transforms vectors that lie on the unit circle and, in particular, how it changes it's eivenvectors during multiplication. $$ \mathbf{A} = \begin{pmatrix}1.5 & 0.5 \\ 0.1 & 1.2\end{pmatrix} @@ -94,6 +95,105 @@ The visualization shows: Note how the eigenvectors are aligned with the directions that remain unchanged in orientation under transformation — they are only scaled by their respective eigenvalues. +## Eigenvectors can be real-valued or complex. + +Here’s a breakdown of the geometric distinction between linear maps that have **only real eigenvectors** and those that have **complex eigenvectors**: + +### Real Eigenvectors → Maps That Stretch or Reflect Along Fixed Directions + +If a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ has **only real eigenvalues and eigenvectors**, it means: + +* There exist **real directions** in space that are preserved (up to scaling). +* The action of the matrix is intuitively: + + * **Scaling** (positive eigenvalues) + * **Reflection + scaling** (negative eigenvalues) +* You can visualize this as: + + * Pulling/stretching space along certain axes + * Possibly flipping directions + + +### Complex Eigenvectors → Maps That Rotate or Spiral + +If a matrix has **complex eigenvalues** and **no real eigenvectors**, it **cannot leave any real direction invariant**. + +This typically corresponds to: + +* **Rotation** or **spiral** motion +* Sometimes **rotation + scaling** (when complex eigenvalues have modulus $\ne 1$) +* The action in real space: + + * **No real eigenvector** + * Points are **rotated** or **rotated and scaled** + * Repeated application creates **circular** or **spiraling trajectories** + +#### Example: Stretching vs. Shearing vs. Rotation +* **Stretching**: scales space differently along the axes. The matrix has only real eigenvalues and eigenvectors. +* **Shearing**: shifts one axis direction while keeping the other fixed. The matrix has only real eigenvalues and eigenvectors. +* **Rotation**: turns everything around the origin. The matrix has only complex eigenvalues and eigenvectors. + +Each transformation is applied to a **unit square** and a **grid**, so you can clearly see how space is deformed under each linear map. +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +def apply_transform(grid, matrix): + return np.tensordot(matrix, grid, axes=1) + +def draw_transform(ax, matrix, title, color='red'): + # Draw original grid + x = np.linspace(-1, 1, 11) + y = np.linspace(-1, 1, 11) + for xi in x: + ax.plot([xi]*len(y), y, color='lightgray', lw=0.5) + for yi in y: + ax.plot(x, [yi]*len(x), color='lightgray', lw=0.5) + + # Draw transformed grid + for xi in x: + line = np.stack(([xi]*len(y), y)) + transformed = apply_transform(line, matrix) + ax.plot(transformed[0], transformed[1], color=color, lw=1) + for yi in y: + line = np.stack((x, [yi]*len(x))) + transformed = apply_transform(line, matrix) + ax.plot(transformed[0], transformed[1], color=color, lw=1) + + # Draw unit square before and after + square = np.array([[0, 1, 1, 0, 0], + [0, 0, 1, 1, 0]]) + transformed_square = matrix @ square + ax.plot(square[0], square[1], 'k--', label='Original square') + ax.plot(transformed_square[0], transformed_square[1], 'k-', label='Transformed square') + ax.set_aspect('equal') + ax.set_xlim(-2, 2) + ax.set_ylim(-2, 2) + ax.set_title(title) + ax.legend() + +# Define transformation matrices +stretch = np.array([[1.5, 0], + [0, 0.5]]) + +shear = np.array([[1, 1], + [0, 1]]) + +theta = np.pi / 4 +rotation = np.array([[np.cos(theta), -np.sin(theta)], + [np.sin(theta), np.cos(theta)]]) + +# Plot all three +fig, axes = plt.subplots(1, 3, figsize=(15, 5)) +draw_transform(axes[0], stretch, "Stretching") +draw_transform(axes[1], shear, "Shearing") +draw_transform(axes[2], rotation, "Rotation") +plt.suptitle("Linear Transformations: Stretch vs Shear vs Rotation", fontsize=14) +plt.tight_layout(rect=[0, 0, 1, 0.95]) +plt.show() +``` + --- We now give some useful results about how eigenvalues change after @@ -201,7 +301,8 @@ We observe that: Note how the red-transformed circles deform differently in each panel, but the eigenvector stays aligned. -:::{prf:proof} +:::{prf:proof} + (i) follows readily: $$(\mathbf{A} + \gamma\mathbf{I})\mathbf{x} = \mathbf{A}\mathbf{x} + \gamma\mathbf{I}\mathbf{x} = \lambda\mathbf{x} + \gamma\mathbf{x} = (\lambda + \gamma)\mathbf{x}$$ @@ -220,147 +321,154 @@ $k \geq 0$ case with (ii). ◻ ::: -## Eigenvectors can be real-valued or complex. -There's a deep and intuitive geometric distinction between linear maps that have **only real eigenvectors** and those that have **complex eigenvectors**. -Here’s a breakdown: +## Relationship between Eigenvalues and Determinant +Interestingly, the determinant of a matrix is equal to the product of +its eigenvalues (repeated according to multiplicity): -### Real Eigenvectors → Maps That Stretch or Reflect Along Fixed Directions +$$\det(\mathbf{A}) = \prod_{i=1}^n \lambda_i(\mathbf{A})$$ -If a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ has **only real eigenvalues and eigenvectors**, it means: +This provides a means to find the eigenvalues by deriving the roots of the characteristic polynomial. -* There exist **real directions** in space that are preserved (up to scaling). -* The action of the matrix is intuitively: +:::{prf:corollary} Characteristic Polynomial +:label: trm-characteristic-polynomial +:nonumber: - * **Scaling** (positive eigenvalues) - * **Reflection + scaling** (negative eigenvalues) -* You can visualize this as: +The eigenvalues of a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ are the **roots of its characteristic polynomial** defined as: - * Pulling/stretching space along certain axes - * Possibly flipping directions +$$ +p(\lambda) = \det(\mathbf{A} - \lambda \mathbf{I}) +$$ +It is a degree-$n$ polynomial in $\lambda$, and its roots are precisely the eigenvalues of $\mathbf{A}$. -### Complex Eigenvectors → Maps That Rotate or Spiral +::: -If a matrix has **complex eigenvalues** and **no real eigenvectors**, it **cannot leave any real direction invariant**. This typically corresponds to: +:::{prf:proof} Characteristic Polynomial -* **Rotation** or **spiral** motion -* Sometimes **rotation + scaling** (when complex eigenvalues have modulus $\ne 1$) -* The action in real space: +By definition, $\lambda$ is an **eigenvalue** of $\mathbf{A}$ if: - * **No real eigenvector** - * Points are **rotated** or **rotated and scaled** - * Repeated application creates **circular** or **spiraling trajectories** +$$ +\exists \, \mathbf{x} \neq \mathbf{0} \text{ such that } \mathbf{A} \mathbf{x} = \lambda \mathbf{x} +$$ -#### Example: Stretching vs. Shearing vs. Rotation -* **Stretching**: scales space differently along the axes. The matrix has only real eigenvalues and eigenvectors. -* **Shearing**: shifts one axis direction while keeping the other fixed. The matrix has only real eigenvalues and eigenvectors. -* **Rotation**: turns everything around the origin. The matrix has only complex eigenvalues and eigenvectors. +Rewriting: -Each transformation is applied to a **unit square** and a **grid**, so you can clearly see how space is deformed under each linear map. -```{code-cell} ipython3 -:tags: [hide-input] -import numpy as np -import matplotlib.pyplot as plt +$$ +(\mathbf{A} - \lambda \mathbf{I}) \mathbf{x} = \mathbf{0} +$$ -def apply_transform(grid, matrix): - return np.tensordot(matrix, grid, axes=1) +This is a homogeneous linear system. +A **nontrivial solution** exists **if and only if** the matrix $\mathbf{A} - \lambda \mathbf{I}$ is **not invertible**, which according to the **fundamental equivalences for square matrices** is equivalent to: -def draw_transform(ax, matrix, title, color='red'): - # Draw original grid - x = np.linspace(-1, 1, 11) - y = np.linspace(-1, 1, 11) - for xi in x: - ax.plot([xi]*len(y), y, color='lightgray', lw=0.5) - for yi in y: - ax.plot(x, [yi]*len(x), color='lightgray', lw=0.5) +$$ +\det(\mathbf{A} - \lambda \mathbf{I}) = 0 +$$ - # Draw transformed grid - for xi in x: - line = np.stack(([xi]*len(y), y)) - transformed = apply_transform(line, matrix) - ax.plot(transformed[0], transformed[1], color=color, lw=1) - for yi in y: - line = np.stack((x, [yi]*len(x))) - transformed = apply_transform(line, matrix) - ax.plot(transformed[0], transformed[1], color=color, lw=1) +Therefore, the **eigenvalues are the roots of the characteristic polynomial** $p(\lambda)$. +::: - # Draw unit square before and after - square = np.array([[0, 1, 1, 0, 0], - [0, 0, 1, 1, 0]]) - transformed_square = matrix @ square - ax.plot(square[0], square[1], 'k--', label='Original square') - ax.plot(transformed_square[0], transformed_square[1], 'k-', label='Transformed square') - ax.set_aspect('equal') - ax.set_xlim(-2, 2) - ax.set_ylim(-2, 2) - ax.set_title(title) - ax.legend() +$$ +\mathbf{A} - \lambda \mathbf{I} = +\begin{bmatrix} +a_{11} - \lambda & a_{12} & \cdots & a_{1n} \\ +a_{21} & a_{22} - \lambda & \cdots & a_{2n} \\ +\vdots & \vdots & \ddots & \vdots \\ +a_{n1} & a_{n2} & \cdots & a_{nn} - \lambda +\end{bmatrix}. +$$ -# Define transformation matrices -stretch = np.array([[1.5, 0], - [0, 0.5]]) +Taking the determinant of this matrix yields a **polynomial in $\lambda$**. -shear = np.array([[1, 1], - [0, 1]]) +Each term in the determinant expansion is a product of $n$ entries, and due to the linearity in $\lambda$ of each diagonal term, the highest degree term in $\lambda$ is $(-\lambda)^n$. -theta = np.pi / 4 -rotation = np.array([[np.cos(theta), -np.sin(theta)], - [np.sin(theta), np.cos(theta)]]) +Hence: -# Plot all three -fig, axes = plt.subplots(1, 3, figsize=(15, 5)) -draw_transform(axes[0], stretch, "Stretching") -draw_transform(axes[1], shear, "Shearing") -draw_transform(axes[2], rotation, "Rotation") -plt.suptitle("Linear Transformations: Stretch vs Shear vs Rotation", fontsize=14) -plt.tight_layout(rect=[0, 0, 1, 0.95]) -plt.show() -``` +$$ +p(\lambda) = \det(\mathbf{A} - \lambda \mathbf{I}) = (-1)^n \lambda^n + c_{n-1} \lambda^{n-1} + \cdots + c_1 \lambda + c_0, +$$ -## Relationship between Eienvalues and Determinant -Interestingly, the determinant of a matrix is equal to the product of -its eigenvalues (repeated according to multiplicity): +for some coefficients $c_i \in \mathbb{R}$. -$$\det(\mathbf{A}) = \prod_i \lambda_i(\mathbf{A})$$ +Thus, $p(\lambda)$ is a **monic polynomial** of degree $n$. -This provides a means to find the eigenvalues by deriving the roots of the characteristic polynomial. +### **Example: Characteristic Polynomial of a 2×2 Matrix** -:::{prf:corollary} Characteristic Polynomial -:label: trm-characteristic-polynomial -:nonumber: +Here is the full derivation of the **characteristic polynomial** for a general $2 \times 2$ matrix, step by step: -The eigenvalues of a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ are the **roots of its characteristic polynomial** defined as: + +Let $$ -p(\lambda) = \det(\mathbf{A} - \lambda \mathbf{I}) +\mathbf{A} = \begin{bmatrix} +a & b \\ +c & d +\end{bmatrix} \in \mathbb{R}^{2 \times 2}. $$ -It is a degree-$n$ polynomial in $\lambda$, and its roots are precisely the eigenvalues of $\mathbf{A}$. +We want to compute the characteristic polynomial: -::: +$$ +p(\lambda) = \det(\mathbf{A} - \lambda \mathbf{I}). +$$ -:::{prf:proof} -By definition, $\lambda$ is an **eigenvalue** of $\mathbf{A}$ if: +#### Step 1: Subtract $\lambda \mathbf{I}$ $$ -\exists \, \mathbf{x} \neq \mathbf{0} \text{ such that } \mathbf{A} \mathbf{x} = \lambda \mathbf{x} +\mathbf{A} - \lambda \mathbf{I} = +\begin{bmatrix} +a - \lambda & b \\ +c & d - \lambda +\end{bmatrix}. $$ -Rewriting: +#### Step 2: Compute the determinant $$ -(\mathbf{A} - \lambda \mathbf{I}) \mathbf{x} = \mathbf{0} +p(\lambda) = \det(\mathbf{A} - \lambda \mathbf{I}) += (a - \lambda)(d - \lambda) - bc. $$ -This is a homogeneous linear system. -A **nontrivial solution** exists **if and only if** the matrix $\mathbf{A} - \lambda \mathbf{I}$ is **not invertible**, which means: +#### Step 3: Expand the polynomial $$ -\det(\mathbf{A} - \lambda \mathbf{I}) = 0 +p(\lambda) += ad - a\lambda - d\lambda + \lambda^2 - bc += \lambda^2 - (a + d)\lambda + (ad - bc). $$ -Therefore, the **eigenvalues are the roots of the characteristic polynomial** $p(\lambda)$. -::: +--- + +### **Interpretation** + +So the characteristic polynomial is: + +$$ +p(\lambda) = \lambda^2 - \mathrm{tr}(\mathbf{A})\lambda + \det(\mathbf{A}), +$$ + +where: + +* $\mathrm{tr}(\mathbf{A}) = a + d$ is the **trace**, +* $\det(\mathbf{A}) = ad - bc$ is the **determinant**. + +--- + +### **Eigenvalues** + +The eigenvalues are the roots of this quadratic polynomial: + +$$ +\lambda_{1,2} = \frac{1}{2} \left( \mathrm{tr}(\mathbf{A}) \pm \sqrt{ \mathrm{tr}(\mathbf{A})^2 - 4 \det(\mathbf{A}) } \right). +$$ + +## Relationship between the Trace of a Matrix and its Eigenvalues + +Interestingly, the trace of a matrix $\mathbf{A}\in\mathbb{R}^{n \times n}$ is equal to the sum of its eigenvalues (repeated according to multiplicity): + +$$\operatorname{tr}(\mathbf{A}) = \sum_{i=1}^n \lambda_i(\mathbf{A})$$ + +Note that this sum yields a real value even holds if $\mathbf{A}$ has complex eigenvalues. +The reason is that complex eigenvalues always appear in conjugate pairs. diff --git a/book/chapter_decompositions/orthogonal_matrices.md b/book/chapter_decompositions/orthogonal_matrices.md index 34cb013..262758a 100644 --- a/book/chapter_decompositions/orthogonal_matrices.md +++ b/book/chapter_decompositions/orthogonal_matrices.md @@ -20,8 +20,9 @@ This definition implies that $$\mathbf{Q}^{\!\top\!} \mathbf{Q} = \mathbf{Q}\mathbf{Q}^{\!\top\!} = \mathbf{I}$$ -or equivalently, $\mathbf{Q}^{\!\top\!} = \mathbf{Q}^{-1}$. A nice thing -about orthogonal matrices is that they preserve inner products: +or equivalently, $\mathbf{Q}^{\!\top\!} = \mathbf{Q}^{-1}$. + +A nice thing about orthogonal matrices is that they preserve inner products: $$(\mathbf{Q}\mathbf{x})^{\!\top\!}(\mathbf{Q}\mathbf{y}) = \mathbf{x}^{\!\top\!} \mathbf{Q}^{\!\top\!} \mathbf{Q}\mathbf{y} = \mathbf{x}^{\!\top\!} \mathbf{I}\mathbf{y} = \mathbf{x}^{\!\top\!}\mathbf{y}$$ @@ -105,5 +106,64 @@ This enhanced visualization shows how **orthogonal transformations** affect both ✅ This highlights that orthogonal matrices are **distance- and angle-preserving**, making them key to rigid transformations like rotations and reflections. -Would you like to include a numerical check that verifies length and angle invariance? +--- +:::{prf:theorem} Determinant of an Orthogonal Matrix +:label: thm-determinant-orthogonal-matrix +:nonumber: + +Let $\mathbf{Q} \in \mathbb{R}^{n \times n}$ be an **orthogonal matrix**, meaning: + +$$ +\mathbf{Q}^\top \mathbf{Q} = \mathbf{I} +$$ + +Then: + +$$ +\boxed{ +\det(\mathbf{Q}) = \pm 1 +} +$$ +::: + +:::{prf:proof} + +We start with the identity: + +$$ +\mathbf{Q}^\top \mathbf{Q} = \mathbf{I} +$$ + +Now take the determinant of both sides: + +$$ +\det(\mathbf{Q}^\top \mathbf{Q}) = \det(\mathbf{I}) = 1 +$$ + +Using the **multiplicativity of determinants** and the fact that $\det(\mathbf{Q}^\top) = \det(\mathbf{Q})$ (since $\det(\mathbf{A}^\top) = \det(\mathbf{A})$): + +$$ +\det(\mathbf{Q}^\top) \cdot \det(\mathbf{Q}) = (\det(\mathbf{Q}))^2 = 1 +$$ + +Taking square roots: + +$$ +\boxed{ +\det(\mathbf{Q}) = \pm 1 +} +$$ + +Thus, the determinant of any orthogonal matrix is either $+1$ (rotation) or $-1$ (reflection). + +$\quad \blacksquare$ +::: +--- + +## 🧠 Interpretation + +* **$\det(\mathbf{Q}) = 1$**: The transformation preserves orientation — e.g., **rotation**. +* **$\det(\mathbf{Q}) = -1$**: The transformation flips orientation — e.g., **reflection**. + +This theorem is foundational in rigid body transformations, 3D graphics, PCA, and more. diff --git a/book/chapter_decompositions/symmetric_matrices.md b/book/chapter_decompositions/symmetric_matrices.md index 0db52c7..1480819 100644 --- a/book/chapter_decompositions/symmetric_matrices.md +++ b/book/chapter_decompositions/symmetric_matrices.md @@ -17,23 +17,33 @@ A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ is said to be ($\mathbf{A} = \mathbf{A}^{\!\top\!}$), meaning that $A_{ij} = A_{ji}$ for all $(i,j)$. -This definition seems harmless enough but turns out to -have some strong implications. We summarize the most important of these -as +This definition seems harmless but turns out to +have some strong implications. -*Theorem.* -(Spectral Theorem) If $\mathbf{A} \in \mathbb{R}^{n \times n}$ is +## Spectral Decopmosition + +:::{prf:theorem} Spectral Theorem +:label: trm-spectral-decomposition +:nonumber: + +If $\mathbf{A} \in \mathbb{R}^{n \times n}$ is symmetric, then there exists an orthonormal basis for $\mathbb{R}^n$ consisting of eigenvectors of $\mathbf{A}$. +::: The practical application of this theorem is a particular factorization of symmetric matrices, referred to as the **eigendecomposition** or -**spectral decomposition**. Denote the orthonormal basis of eigenvectors +**spectral decomposition**. + +Denote the orthonormal basis of eigenvectors $\mathbf{q}_1, \dots, \mathbf{q}_n$ and their eigenvalues -$\lambda_1, \dots, \lambda_n$. Let $\mathbf{Q}$ be an orthogonal matrix -with $\mathbf{q}_1, \dots, \mathbf{q}_n$ as its columns, and -$\mathbf{\Lambda} = \operatorname{diag}(\lambda_1, \dots, \lambda_n)$. Since by -definition $\mathbf{A}\mathbf{q}_i = \lambda_i\mathbf{q}_i$ for every +$\lambda_1, \dots, \lambda_n$. + +Let $\mathbf{Q}$ be an orthogonal matrix with $\mathbf{q}_1, \dots, \mathbf{q}_n$ as its columns, and + +$$\mathbf{\Lambda} = \operatorname{diag}(\lambda_1, \dots, \lambda_n).$$ + +Since by definition $\mathbf{A}\mathbf{q}_i = \lambda_i\mathbf{q}_i$ for every $i$, the following relationship holds: $$\mathbf{A}\mathbf{Q} = \mathbf{Q}\mathbf{\Lambda}$$ @@ -43,86 +53,6 @@ by $\mathbf{Q}^{\!\top\!}$, we arrive at the decomposition $$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$$ -```{code-cell} ipython3 -:tags: [hide-input] -import numpy as np -import matplotlib.pyplot as plt - -# Define a symmetric matrix -A = np.array([[3, 1], - [1, 2]]) - -# Eigendecomposition -eigvals, eigvecs = np.linalg.eigh(A) # eigh guarantees real symmetric matrix handling -Λ = np.diag(eigvals) -Q = eigvecs - -# Confirm A = Q Λ Qᵀ -A_reconstructed = Q @ Λ @ Q.T - -# Create unit circle -theta = np.linspace(0, 2*np.pi, 100) -circle = np.stack((np.cos(theta), np.sin(theta))) - -# Transformations -circle_stretched = Λ @ circle -circle_eigen_transformed = Q @ circle_stretched -circle_direct = A @ circle - -# Plotting -fig, axes = plt.subplots(1, 3, figsize=(18, 6)) - -# Original unit circle -axes[0].plot(circle[0], circle[1], 'k--', label='Unit Circle') -axes[0].set_title("Original Space") -axes[0].set_xlim(-3, 3) -axes[0].set_ylim(-3, 3) - -# Stretch along eigenbasis -axes[1].plot(circle_stretched[0], circle_stretched[1], 'r-', label='Stretched (Λ)') -axes[1].quiver(0, 0, Λ[0, 0], 0, angles='xy', scale_units='xy', scale=1, color='blue', label='λ₁q₁') -axes[1].quiver(0, 0, 0, Λ[1, 1], angles='xy', scale_units='xy', scale=1, color='green', label='λ₂q₂') -axes[1].set_title("Stretch in Eigenbasis") -axes[1].set_xlim(-3, 3) -axes[1].set_ylim(-3, 3) -axes[1].legend() - -# Transform via Q Λ Qᵀ -axes[2].plot(circle_direct[0], circle_direct[1], 'purple', label='A ∘ circle') -axes[2].plot(circle_eigen_transformed[0], circle_eigen_transformed[1], 'orange', linestyle='--', label='QΛQᵀ ∘ circle') -axes[2].quiver(0, 0, *eigvecs[:, 0]*eigvals[0], angles='xy', scale_units='xy', scale=1, color='blue') -axes[2].quiver(0, 0, *eigvecs[:, 1]*eigvals[1], angles='xy', scale_units='xy', scale=1, color='green') -axes[2].set_title("Transformation by Symmetric A") -axes[2].set_xlim(-3, 3) -axes[2].set_ylim(-3, 3) -axes[2].legend() - -for ax in axes: - ax.set_aspect('equal') - ax.axhline(0, color='gray', lw=0.5) - ax.axvline(0, color='gray', lw=0.5) - ax.grid(True) - -plt.suptitle("Geometric Intuition of the Spectral Decomposition for Symmetric Matrices", fontsize=16) -plt.tight_layout(rect=[0, 0, 1, 0.93]) -plt.show() -``` - -This visualization gives geometric insight into the **spectral theorem for symmetric matrices**: - -1. **Left Panel** – The original unit circle in $\mathbb{R}^2$. -2. **Middle Panel** – The action of the diagonal matrix $\Lambda$ in the **eigenbasis**: stretching along coordinate axes defined by eigenvectors. -3. **Right Panel** – The full symmetric transformation $\mathbf{A} = \mathbf{Q} \Lambda \mathbf{Q}^\top$: - - * This first rotates into the eigenbasis (via $\mathbf{Q}^\top$), - * Then stretches (via $\Lambda$), - * Then rotates back (via $\mathbf{Q}$). - * Both $\mathbf{A} \circ \text{circle}$ and $\mathbf{Q}\Lambda\mathbf{Q}^\top \circ \text{circle}$ overlap perfectly. - -✅ This illustrates how symmetric matrices are always diagonalizable with **orthogonal eigenvectors**, and why they never induce rotation — only **axis-aligned stretching in some rotated basis**. - - - ### Quadratic forms Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix. @@ -143,464 +73,3 @@ This identity is valid for any square matrix (need not be symmetric), although quadratic forms are usually only discussed in the context of symmetric matrices. -### Rayleigh quotients - - -There turns out to be an interesting connection between the quadratic -form of a symmetric matrix and its eigenvalues. This connection is -provided by the **Rayleigh quotient** - -$$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}$$ - -The Rayleigh quotient has a couple of important properties which the -reader can (and should!) easily verify from the definition: - -(i) **Scale invariance**: for any vector $\mathbf{x} \neq \mathbf{0}$ - and any scalar $\alpha \neq 0$, - $R_\mathbf{A}(\mathbf{x}) = R_\mathbf{A}(\alpha\mathbf{x})$. - -(ii) If $\mathbf{x}$ is an eigenvector of $\mathbf{A}$ with eigenvalue - $\lambda$, then $R_\mathbf{A}(\mathbf{x}) = \lambda$. - -We can further show that the Rayleigh quotient is bounded by the largest -and smallest eigenvalues of $\mathbf{A}$. But first we will show a -useful special case of the final result. - -*Proposition.* -For any $\mathbf{x}$ such that $\|\mathbf{x}\|_2 = 1$, - -$$\lambda_{\min}(\mathbf{A}) \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$$ - -with equality if and only if $\mathbf{x}$ is a corresponding -eigenvector. - -*Proof.* We show only the $\max$ case because the argument for the -$\min$ case is entirely analogous. - -Since $\mathbf{A}$ is symmetric, we can decompose it as -$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$. Then use -the change of variable $\mathbf{y} = \mathbf{Q}^{\!\top\!}\mathbf{x}$, -noting that the relationship between $\mathbf{x}$ and $\mathbf{y}$ is -one-to-one and that $\|\mathbf{y}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Hence - -$$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \max_{\|\mathbf{y}\|_2 = 1} \mathbf{y}^{\!\top\!}\mathbf{\Lambda}\mathbf{y} = \max_{y_1^2+\dots+y_n^2=1} \sum_{i=1}^n \lambda_i y_i^2$$ - -Written this way, it is clear that $\mathbf{y}$ maximizes this -expression exactly if and only if it satisfies -$\sum_{i \in I} y_i^2 = 1$ where -$I = \{i : \lambda_i = \max_{j=1,\dots,n} \lambda_j = \lambda_{\max}(\mathbf{A})\}$ -and $y_j = 0$ for $j \not\in I$. That is, $I$ contains the index or -indices of the largest eigenvalue. In this case, the maximal value of -the expression is - -$$\sum_{i=1}^n \lambda_i y_i^2 = \sum_{i \in I} \lambda_i y_i^2 = \lambda_{\max}(\mathbf{A}) \sum_{i \in I} y_i^2 = \lambda_{\max}(\mathbf{A})$$ - -Then writing $\mathbf{q}_1, \dots, \mathbf{q}_n$ for the columns of -$\mathbf{Q}$, we have - -$$\mathbf{x} = \mathbf{Q}\mathbf{Q}^{\!\top\!}\mathbf{x} = \mathbf{Q}\mathbf{y} = \sum_{i=1}^n y_i\mathbf{q}_i = \sum_{i \in I} y_i\mathbf{q}_i$$ - -where we have used the matrix-vector product identity. - -Recall that $\mathbf{q}_1, \dots, \mathbf{q}_n$ are eigenvectors of -$\mathbf{A}$ and form an orthonormal basis for $\mathbb{R}^n$. Therefore -by construction, the set $\{\mathbf{q}_i : i \in I\}$ forms an -orthonormal basis for the eigenspace of $\lambda_{\max}(\mathbf{A})$. -Hence $\mathbf{x}$, which is a linear combination of these, lies in that -eigenspace and thus is an eigenvector of $\mathbf{A}$ corresponding to -$\lambda_{\max}(\mathbf{A})$. - -We have shown that -$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \lambda_{\max}(\mathbf{A})$, -from which we have the general inequality -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$ -for all unit-length $\mathbf{x}$. ◻ - - -By the scale invariance of the Rayleigh quotient, we immediately have as -a corollary (since -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = R_{\mathbf{A}}(\mathbf{x})$ -for unit $\mathbf{x}$) - -*Theorem.* -(Min-max theorem) For all $\mathbf{x} \neq \mathbf{0}$, - -$$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ - -with equality if and only if $\mathbf{x}$ is a corresponding -eigenvector. - -```{code-cell} ipython3 -:tags: [hide-input] -import numpy as np -import matplotlib.pyplot as plt - -# Define symmetric matrix -A = np.array([[2, 1], - [1, 3]]) - -# Eigenvalues and eigenvectors -eigvals, eigvecs = np.linalg.eigh(A) -λ_min, λ_max = eigvals - -# Generate unit circle points -theta = np.linspace(0, 2*np.pi, 300) -circle = np.stack((np.cos(theta), np.sin(theta))) - -# Rayleigh quotient computation -R = np.einsum('ij,ji->i', circle.T @ A, circle) # x^T A x -R /= np.einsum('ij,ji->i', circle.T, circle) # x^T x - -# Rayleigh extrema -idx_min = np.argmin(R) -idx_max = np.argmax(R) -x_min = circle[:, idx_min] -x_max = circle[:, idx_max] - -# Prepare grid for quadratic form level sets -x = np.linspace(-2, 2, 400) -y = np.linspace(-2, 2, 400) -X, Y = np.meshgrid(x, y) -XY = np.stack((X, Y), axis=-1) -Z = np.einsum('...i,ij,...j->...', XY, A, XY) -levels = np.linspace(np.min(Z), np.max(Z), 20) - -# Create combined figure -fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) - -# Left: Rayleigh quotient on unit circle -sc = ax1.scatter(circle[0], circle[1], c=R, cmap='viridis', s=10) -ax1.quiver(0, 0, x_min[0], x_min[1], color='red', scale=1, scale_units='xy', angles='xy', label='argmin R(x)') -ax1.quiver(0, 0, x_max[0], x_max[1], color='orange', scale=1, scale_units='xy', angles='xy', label='argmax R(x)') -for i in range(2): - eigvec = eigvecs[:, i] - ax1.quiver(0, 0, eigvec[0], eigvec[1], color='black', alpha=0.5, scale=1, scale_units='xy', angles='xy', width=0.008) -ax1.set_title("Rayleigh Quotient on the Unit Circle") -ax1.set_aspect('equal') -ax1.set_xlim(-1.1, 1.1) -ax1.set_ylim(-1.1, 1.1) -ax1.grid(True) -ax1.legend() -plt.colorbar(sc, ax=ax1, label="Rayleigh Quotient $R_A(\\mathbf{x})$") - -# Right: Level sets of quadratic form -contour = ax2.contour(X, Y, Z, levels=levels, cmap='viridis') -ax2.clabel(contour, inline=True, fontsize=8, fmt="%.1f") -ax2.set_title("Level Sets of $\\mathbf{x}^\\top \\mathbf{A} \\mathbf{x}$") -ax2.set_xlabel("$x_1$") -ax2.set_ylabel("$x_2$") -ax2.axhline(0, color='gray', lw=0.5) -ax2.axvline(0, color='gray', lw=0.5) -for i in range(2): - vec = eigvecs[:, i] * np.sqrt(eigvals[i]) - ax2.quiver(0, 0, vec[0], vec[1], color='red', scale=1, scale_units='xy', angles='xy', width=0.01, label=f"$\\mathbf{{q}}_{i+1}$") -ax2.set_aspect('equal') -ax2.legend() - -plt.suptitle("Rayleigh Quotient and Quadratic Form Level Sets", fontsize=16) -plt.tight_layout(rect=[0, 0, 1, 0.93]) -plt.show() -``` - -This combined visualization brings together the **Rayleigh quotient** and the **level sets of the quadratic form** $\mathbf{x}^\top \mathbf{A} \mathbf{x}$: - -* **Left panel**: Rayleigh quotient $R_\mathbf{A}(\mathbf{x})$ on the unit circle - - * Color shows how the value varies with direction. - * Extremes occur at eigenvector directions (marked with arrows). - -* **Right panel**: Level sets (contours) of the quadratic form - - * Elliptical shapes aligned with eigenvectors. - * Red vectors indicate principal axes (scaled eigenvectors). - -Together, these panels illustrate how the **direction of a vector determines how strongly it is scaled** by the symmetric matrix, and how this scaling relates to the matrix's **eigenstructure**. - -✅ As guaranteed by the **Min–Max Theorem**, the maximum and minimum of the Rayleigh quotient occur precisely at the **eigenvectors corresponding to the largest and smallest eigenvalues**. - - - ---- - -## ✅ Theorem: Real symmetric matrices cannot produce rotation - -### 🧾 Statement - -Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a **real symmetric matrix**. Then: - -> The linear transformation $\mathbf{x} \mapsto \mathbf{A}\mathbf{x}$ **does not rotate** vectors — i.e., it cannot produce a transformation that changes the direction of a vector **without preserving its span**. - -In particular: - -* The transformation **does not rotate angles** -* The transformation has a basis of **orthogonal eigenvectors** -* Therefore, all action is **stretching/compressing along fixed directions**, not rotation - ---- - -## 🧠 Intuition - -Rotation mixes directions. But symmetric matrices: - -* Have **real eigenvalues** -* Are **orthogonally diagonalizable** -* Have **mutually orthogonal eigenvectors** - -So the matrix acts by **scaling along fixed orthogonal axes**, without changing the direction between basis vectors — i.e., no twisting, hence no rotation. - ---- - -## ✏️ Proof (2D case, generalizes easily) - -Let $\mathbf{A} \in \mathbb{R}^{2 \times 2}$ be symmetric: - -$$ -\mathbf{A} = \begin{pmatrix} a & b \\ b & d \end{pmatrix} -$$ - -We’ll show that $\mathbf{A}$ cannot produce a true rotation. - -### Step 1: Diagonalize $\mathbf{A}$ - -Because $\mathbf{A}$ is real symmetric, there exists an orthogonal matrix $\mathbf{Q}$ and diagonal $\mathbf{\Lambda}$ such that: - -$$ -\mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^\top -$$ - -That is, $\mathbf{A}$ acts as: - -* A rotation (or reflection) $\mathbf{Q}^\top$ -* A stretch along axes $\mathbf{\Lambda}$ -* A second rotation (or reflection) $\mathbf{Q}$ - -But since $\mathbf{Q}$ and $\mathbf{Q}^\top$ cancel out geometrically (they are transposes of each other), this results in: - -> A transformation that **scales but does not rotate** relative to the basis of eigenvectors. - -### Step 2: Show $\mathbf{A}$ preserves alignment - -Let $\mathbf{v}$ be any eigenvector of $\mathbf{A}$. Then: - -$$ -\mathbf{A} \mathbf{v} = \lambda \mathbf{v} -$$ - -So $\mathbf{v}$ is **mapped to a scalar multiple of itself** — its **direction doesn’t change**. - -Because $\mathbb{R}^2$ has two linearly independent eigenvectors (since symmetric matrices are always diagonalizable), **no vector is rotated out of its original span** — just scaled. - -Hence, the transformation only **stretches**, **compresses**, or **reflects**, but never rotates. - ---- - -## 🚫 Counterexample: Rotation matrix is not symmetric - -The rotation matrix: - -$$ -\mathbf{R}_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} -$$ - -is **not symmetric** unless $\theta = 0$ or $\pi$, where it reduces to identity or negation. - -It **does not** have real eigenvectors (except at those degenerate angles), and it **rotates** all directions. - ---- - -## ✅ Conclusion - -**Rotation requires asymmetry.** - -If a linear transformation rotates vectors (changes direction without preserving alignment), the matrix must be **non-symmetric**. - ---- - -## ✅ Corollary - -A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ can perform rotation **only if**: - -* It is **not symmetric**, and -* It has **complex eigenvalues** (at least in 2D rotation) - -Excellent and important question. The answer is: - -> ❗️**Not all non-symmetric matrices have an eigen-decomposition over $\mathbb{R}$ or even $\mathbb{C}$.** - -Let’s unpack what this means. - ---- - -## ✅ What is an Eigen-Decomposition? - -An **eigen-decomposition** of a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ means: - -$$ -\mathbf{A} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1} -$$ - -Where: - -* $\mathbf{\Lambda}$ is a diagonal matrix of eigenvalues -* $\mathbf{V}$ contains eigenvectors as columns -* $\mathbf{V}^{-1}$ exists (i.e., $\mathbf{V}$ is invertible) - -This decomposition is **only possible when $\mathbf{A}$ is diagonalizable**. - ---- - -## ❌ Not All Matrices Are Diagonalizable - -A matrix is **not diagonalizable** if: - -* It **does not have enough linearly independent eigenvectors** (i.e., the geometric multiplicity < algebraic multiplicity) - -This can happen even if all the eigenvalues are real! - -### 🔴 Example (Defective Matrix): - -$$ -\mathbf{A} = \begin{pmatrix} -1 & 1 \\ -0 & 1 -\end{pmatrix} -$$ - -* Eigenvalue: $\lambda = 1$ -* But only **one** linearly independent eigenvector -* So it **cannot be diagonalized** - ---- - -## ✅ When Does a Matrix Have an Eigen-Decomposition? - -| Matrix Type | Diagonalizable? | Notes | -| ------------------------------ | --------------- | ------------------------------------------------ | -| Symmetric (real) | ✅ Always | Eigen-decomposition with orthogonal eigenvectors | -| Diagonalizable (in general) | ✅ Yes | Can write $A = V \Lambda V^{-1}$ | -| Defective (non-diagonalizable) | ❌ No | Needs Jordan form instead | - ---- - -## 🔁 Jordan Decomposition: The General Replacement - -If a matrix is **not diagonalizable**, it still has a **Jordan decomposition**: - -$$ -\mathbf{A} = \mathbf{P} \mathbf{J} \mathbf{P}^{-1} -$$ - -Where: - -* $\mathbf{J}$ is **block diagonal**: eigenvalues + possible **Jordan blocks** -* This captures **generalized eigenvectors** - -So **every square matrix** has a **Jordan decomposition**, but **not every one has an eigen-decomposition**. - ---- - -## ✅ Summary - -* **Symmetric matrices**: always have an eigen-decomposition (with real, orthogonal eigenvectors) -* **Non-symmetric matrices**: - - * May have a complete eigen-decomposition (if diagonalizable) - * May **not**, if they are **defective** -* In the general case, you must use **Jordan form** - -A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ has **complex eigenvalues or eigenvectors** when: - -### ✅ 1. The matrix is **not symmetric** (i.e., $\mathbf{A} \ne \mathbf{A}^\top$) - -* Real symmetric matrices **always** have **real** eigenvalues and orthogonal eigenvectors. -* Non-symmetric real matrices can have complex eigenvalues and eigenvectors. - -### ✅ 2. The **characteristic polynomial** has **complex roots** - -For example, consider: - -$$ -\mathbf{A} = \begin{pmatrix} 0 & -2 \\ 1 & 0 \end{pmatrix} -$$ - -Its characteristic polynomial is: - -$$ -\det(\mathbf{A} - \lambda \mathbf{I}) = \lambda^2 + 1 -$$ - -The roots are: - -$$ -\lambda = \pm i -$$ - -So it has **pure imaginary eigenvalues**, and its eigenvectors are also **complex**. -## ✅ Quick Answer: - -The eigenvectors and their transformed versions $\mathbf{A} \mathbf{v} = \lambda \mathbf{v}$ **are** parallel — **but only in complex vector space** $\mathbb{C}^n$. - -In **real space**, we usually visualize: - -* The **real part** of a complex vector: $\mathrm{Re}(\mathbf{v})$ -* The **imaginary part** of a complex vector: $\mathrm{Im}(\mathbf{v})$ - -But neither of these alone is invariant under multiplication by $\lambda \in \mathbb{C}$. So when you look at: - -$$ -\mathbf{v} = \mathrm{Re}(\mathbf{v}) + i \cdot \mathrm{Im}(\mathbf{v}) -$$ - -and apply $\mathbf{A}$, what you see in the real plane is: - -$$ -\mathrm{Re}(\mathbf{A} \mathbf{v}) \quad \text{vs.} \quad \mathrm{Re}(\lambda \mathbf{v}) -$$ - -These are **not scalar multiples** of $\mathrm{Re}(\mathbf{v})$ or $\mathrm{Im}(\mathbf{v})$, because complex scaling **mixes real and imaginary components** — unless $\lambda$ is real. - ---- - -## 🔍 Example - -Say: - -$$ -\lambda = a + ib, \quad \mathbf{v} = \begin{pmatrix} x + iy \\ z + iw \end{pmatrix} -$$ - -Then: - -$$ -\lambda \mathbf{v} = (a + ib)(\text{real} + i \cdot \text{imag}) = \text{mix of real and imaginary} -$$ - -So $\mathbf{A} \mathbf{v} = \lambda \mathbf{v}$, but $\mathrm{Re}(\mathbf{A} \mathbf{v})$ will **not be parallel** to $\mathrm{Re}(\mathbf{v})$ alone — it's a rotated and scaled mixture. - ---- - -## 🧠 Bottom Line - -> **Eigenvectors and their transformations are parallel in $\mathbb{C}^n$, but not necessarily in $\mathbb{R}^n$.** - - -> Note: The eigenvectors and their transformations are parallel in complex space, but their real and imaginary parts generally point in different directions due to complex scaling (rotation + stretch). - ---- - -## 🧠 Intuition - -* Complex eigenvalues often indicate **rotational behavior** in linear dynamical systems. -* The matrix above rotates vectors by 90° and has no real direction that stays on its span after transformation — hence no real eigenvectors. - ---- - -## 🔄 Summary - -| Matrix Type | Eigenvalues | Eigenvectors | -| ------------------ | ----------------- | ----------------- | -| Symmetric real | Real | Real & orthogonal | -| Non-symmetric real | Real or complex | Real or complex | -| Complex (any) | Complex (general) | Complex (general) | - diff --git a/book/chapter_decompositions/trace.md b/book/chapter_decompositions/trace.md index 2eb75a5..a3cef18 100644 --- a/book/chapter_decompositions/trace.md +++ b/book/chapter_decompositions/trace.md @@ -27,82 +27,14 @@ The trace has several nice algebraic properties: (iv) $\operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D}) = \operatorname{tr}(\mathbf{B}\mathbf{C}\mathbf{D}\mathbf{A}) = \operatorname{tr}(\mathbf{C}\mathbf{D}\mathbf{A}\mathbf{B}) = \operatorname{tr}(\mathbf{D}\mathbf{A}\mathbf{B}\mathbf{C})$ The first three properties follow readily from the definition. + The last is known as **invariance under cyclic permutations**. + Note that the matrices cannot be reordered arbitrarily, for example $\operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D}) \neq \operatorname{tr}(\mathbf{B}\mathbf{A}\mathbf{C}\mathbf{D})$ in general. -Also, there is nothing special about the product of four matrices -- analogous rules hold for more or fewer matrices. - - -```{code-cell} ipython3 -:tags: [hide-input] -import numpy as np -import matplotlib.pyplot as plt - -# Set up matrices for demonstration -A = np.array([[2, 1], - [0, 3]]) -B = np.array([[1, -1], - [2, 0]]) -alpha = 2.5 - -# Compute traces -trace_A = np.trace(A) -trace_B = np.trace(B) -trace_A_plus_B = np.trace(A + B) -trace_alphaA = np.trace(alpha * A) -trace_AT = np.trace(A.T) - -# Cyclic permutation example -C = np.array([[0, 2], - [1, 1]]) -D = np.array([[1, 1], - [0, -1]]) -product_1 = A @ B @ C @ D -product_2 = B @ C @ D @ A -product_3 = C @ D @ A @ B -product_4 = D @ A @ B @ C - -traces = [ - np.trace(product_1), - np.trace(product_2), - np.trace(product_3), - np.trace(product_4), -] - -# Plotting -fig, axes = plt.subplots(2, 2, figsize=(12, 10)) - -# (i) Linearity -axes[0, 0].bar(['tr(A)', 'tr(B)', 'tr(A+B)'], [trace_A, trace_B, trace_A_plus_B], - color=['blue', 'green', 'purple']) -axes[0, 0].set_title("Linearity: tr(A + B) = tr(A) + tr(B)") -axes[0, 0].axhline(trace_A + trace_B, color='gray', linestyle='--', label='Expected tr(A) + tr(B)') -axes[0, 0].legend() - -# (ii) Scalar multiplication -axes[0, 1].bar(['tr(A)', 'tr(αA)'], [trace_A, trace_alphaA], color=['blue', 'orange']) -axes[0, 1].axhline(alpha * trace_A, color='gray', linestyle='--', label='Expected α·tr(A)') -axes[0, 1].set_title("Scaling: tr(αA) = α·tr(A)") -axes[0, 1].legend() - -# (iii) Transpose invariance -axes[1, 0].bar(['tr(A)', 'tr(Aᵀ)'], [trace_A, trace_AT], color=['blue', 'red']) -axes[1, 0].set_title("Transpose: tr(Aᵀ) = tr(A)") - -# (iv) Cyclic permutation invariance -axes[1, 1].bar(['ABCD', 'BCDA', 'CDAB', 'DABC'], traces, color='teal') -axes[1, 1].axhline(traces[0], color='gray', linestyle='--', label='Invariant trace') -axes[1, 1].set_title("Cyclic Permutation: tr(ABCD) = tr(BCDA) = ...") -axes[1, 1].legend() - -plt.suptitle("Visualizing the Properties of the Trace Operator", fontsize=16) -plt.tight_layout(rect=[0, 0, 1, 0.95]) -plt.show() +Also, there is nothing special about the product of four matrices -- analogous rules hold for more or fewer matrices. -``` -Interestingly, the trace of a matrix is equal to the sum of its eigenvalues (repeated according to multiplicity): -$$\operatorname{tr}(\mathbf{A}) = \sum_i \lambda_i(\mathbf{A})$$ From b13e57498d6b5b21a94e103e65d86b9bcad75dfb Mon Sep 17 00:00:00 2001 From: clippert Date: Wed, 14 May 2025 00:01:40 +0200 Subject: [PATCH 19/43] merged w05 --- book/_toc.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/book/_toc.yml b/book/_toc.yml index 3ed1d30..69d98e9 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -86,6 +86,7 @@ parts: - file: chapter_decompositions/eigenvectors - file: chapter_decompositions/orthogonal_matrices - file: chapter_decompositions/symmetric_matrices +# - file: chapter_decompositions/Rayleigh_quotients # - file: chapter_decompositions/psd_matrices # - file: chapter_decompositions/svd # - file: chapter_decompositions/big_picture From 9399e325aae5726d537d15d87a35f4c5e52b07ca Mon Sep 17 00:00:00 2001 From: clippert Date: Wed, 14 May 2025 15:08:59 +0200 Subject: [PATCH 20/43] material w06 --- book/_toc.yml | 10 +-- .../Rayleigh_quotients.md | 89 +++++++------------ book/chapter_decompositions/eigenvectors.md | 4 + .../symmetric_matrices.md | 2 +- 4 files changed, 40 insertions(+), 65 deletions(-) diff --git a/book/_toc.yml b/book/_toc.yml index 69d98e9..52b0024 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -83,13 +83,13 @@ parts: - file: chapter_decompositions/row_equivalence - file: chapter_decompositions/square_matrices - file: chapter_decompositions/trace - - file: chapter_decompositions/eigenvectors + - file: chapter_decompositions/eigenvectors # end week 05 - file: chapter_decompositions/orthogonal_matrices - file: chapter_decompositions/symmetric_matrices -# - file: chapter_decompositions/Rayleigh_quotients -# - file: chapter_decompositions/psd_matrices -# - file: chapter_decompositions/svd -# - file: chapter_decompositions/big_picture +# - file: chapter_decompositions/Rayleigh_quotients # skip for now + - file: chapter_decompositions/psd_matrices # PCA as example + - file: chapter_decompositions/svd # + - file: chapter_decompositions/big_picture # - file: chapter_decompositions/pseudoinverse # - file: chapter_decompositions/matrix_norms # - file: chapter_convexity/overview_convexity diff --git a/book/chapter_decompositions/Rayleigh_quotients.md b/book/chapter_decompositions/Rayleigh_quotients.md index 894bce8..1ab9a43 100644 --- a/book/chapter_decompositions/Rayleigh_quotients.md +++ b/book/chapter_decompositions/Rayleigh_quotients.md @@ -1,5 +1,4 @@ -# Rayleigh quotients - +# Rayleigh Quotients There turns out to be an interesting connection between the quadratic form of a symmetric matrix and its eigenvalues. This connection is @@ -21,23 +20,32 @@ We can further show that the Rayleigh quotient is bounded by the largest and smallest eigenvalues of $\mathbf{A}$. But first we will show a useful special case of the final result. -*Proposition.* +:::{prf:theorem} Bound Rayleigh Quotient +:label: trm-bound-Rayleigh-quotient +:nonumber: + For any $\mathbf{x}$ such that $\|\mathbf{x}\|_2 = 1$, $$\lambda_{\min}(\mathbf{A}) \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$$ with equality if and only if $\mathbf{x}$ is a corresponding eigenvector. +::: -*Proof.* We show only the $\max$ case because the argument for the +:::{prf:proof} +We show only the $\max$ case because the argument for the $\min$ case is entirely analogous. Since $\mathbf{A}$ is symmetric, we can decompose it as -$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$. Then use +$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$. + +Then use the change of variable $\mathbf{y} = \mathbf{Q}^{\!\top\!}\mathbf{x}$, noting that the relationship between $\mathbf{x}$ and $\mathbf{y}$ is one-to-one and that $\|\mathbf{y}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Hence +orthogonal. + +Hence $$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \max_{\|\mathbf{y}\|_2 = 1} \mathbf{y}^{\!\top\!}\mathbf{\Lambda}\mathbf{y} = \max_{y_1^2+\dots+y_n^2=1} \sum_{i=1}^n \lambda_i y_i^2$$ @@ -45,7 +53,9 @@ Written this way, it is clear that $\mathbf{y}$ maximizes this expression exactly if and only if it satisfies $\sum_{i \in I} y_i^2 = 1$ where $I = \{i : \lambda_i = \max_{j=1,\dots,n} \lambda_j = \lambda_{\max}(\mathbf{A})\}$ -and $y_j = 0$ for $j \not\in I$. That is, $I$ contains the index or +and $y_j = 0$ for $j \not\in I$. + +That is, $I$ contains the index or indices of the largest eigenvalue. In this case, the maximal value of the expression is @@ -59,8 +69,9 @@ $$\mathbf{x} = \mathbf{Q}\mathbf{Q}^{\!\top\!}\mathbf{x} = \mathbf{Q}\mathbf{y} where we have used the matrix-vector product identity. Recall that $\mathbf{q}_1, \dots, \mathbf{q}_n$ are eigenvectors of -$\mathbf{A}$ and form an orthonormal basis for $\mathbb{R}^n$. Therefore -by construction, the set $\{\mathbf{q}_i : i \in I\}$ forms an +$\mathbf{A}$ and form an orthonormal basis for $\mathbb{R}^n$. + +Therefore by construction, the set $\{\mathbf{q}_i : i \in I\}$ forms an orthonormal basis for the eigenspace of $\lambda_{\max}(\mathbf{A})$. Hence $\mathbf{x}$, which is a linear combination of these, lies in that eigenspace and thus is an eigenvector of $\mathbf{A}$ corresponding to @@ -71,20 +82,24 @@ $\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \lambda from which we have the general inequality $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$ for all unit-length $\mathbf{x}$. ◻ - +::: By the scale invariance of the Rayleigh quotient, we immediately have as a corollary (since $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = R_{\mathbf{A}}(\mathbf{x})$ for unit $\mathbf{x}$) -*Theorem.* -(Min-max theorem) For all $\mathbf{x} \neq \mathbf{0}$, +:::{prf:theorem} Min-Max Theorem +:label: trm-min-max +:nonumber: + +For all $\mathbf{x} \neq \mathbf{0}$, $$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ with equality if and only if $\mathbf{x}$ is a corresponding eigenvector. +::: ```{code-cell} ipython3 :tags: [hide-input] @@ -236,7 +251,9 @@ But since $\mathbf{Q}$ and $\mathbf{Q}^\top$ cancel out geometrically (they are ### Step 2: Show $\mathbf{A}$ preserves alignment -Let $\mathbf{v}$ be any eigenvector of $\mathbf{A}$. Then: +Let $\mathbf{v}$ be any eigenvector of $\mathbf{A}$. + +Then: $$ \mathbf{A} \mathbf{v} = \lambda \mathbf{v} @@ -279,52 +296,6 @@ A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ can perform rotation **only if * It is **not symmetric**, and * It has **complex eigenvalues** (at least in 2D rotation) -Excellent and important question. The answer is: - -> ❗️**Not all non-symmetric matrices have an eigen-decomposition over $\mathbb{R}$ or even $\mathbb{C}$.** - -Let’s unpack what this means. - ---- - -## ✅ What is an Eigen-Decomposition? - -An **eigen-decomposition** of a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ means: - -$$ -\mathbf{A} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1} -$$ - -Where: - -* $\mathbf{\Lambda}$ is a diagonal matrix of eigenvalues -* $\mathbf{V}$ contains eigenvectors as columns -* $\mathbf{V}^{-1}$ exists (i.e., $\mathbf{V}$ is invertible) - -This decomposition is **only possible when $\mathbf{A}$ is diagonalizable**. - ---- - -## ❌ Not All Matrices Are Diagonalizable - -A matrix is **not diagonalizable** if: - -* It **does not have enough linearly independent eigenvectors** (i.e., the geometric multiplicity < algebraic multiplicity) - -This can happen even if all the eigenvalues are real! - -### 🔴 Example (Defective Matrix): - -$$ -\mathbf{A} = \begin{pmatrix} -1 & 1 \\ -0 & 1 -\end{pmatrix} -$$ - -* Eigenvalue: $\lambda = 1$ -* But only **one** linearly independent eigenvector -* So it **cannot be diagonalized** --- diff --git a/book/chapter_decompositions/eigenvectors.md b/book/chapter_decompositions/eigenvectors.md index 76f453e..a21104e 100644 --- a/book/chapter_decompositions/eigenvectors.md +++ b/book/chapter_decompositions/eigenvectors.md @@ -463,6 +463,8 @@ $$ \lambda_{1,2} = \frac{1}{2} \left( \mathrm{tr}(\mathbf{A}) \pm \sqrt{ \mathrm{tr}(\mathbf{A})^2 - 4 \det(\mathbf{A}) } \right). $$ + + ## Relationship between the Trace of a Matrix and its Eigenvalues Interestingly, the trace of a matrix $\mathbf{A}\in\mathbb{R}^{n \times n}$ is equal to the sum of its eigenvalues (repeated according to multiplicity): @@ -472,3 +474,5 @@ $$\operatorname{tr}(\mathbf{A}) = \sum_{i=1}^n \lambda_i(\mathbf{A})$$ Note that this sum yields a real value even holds if $\mathbf{A}$ has complex eigenvalues. The reason is that complex eigenvalues always appear in conjugate pairs. + + diff --git a/book/chapter_decompositions/symmetric_matrices.md b/book/chapter_decompositions/symmetric_matrices.md index 1480819..5115bd2 100644 --- a/book/chapter_decompositions/symmetric_matrices.md +++ b/book/chapter_decompositions/symmetric_matrices.md @@ -60,7 +60,6 @@ Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix. The expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ is called a **quadratic form**. - Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix, and recall that the expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ is called a quadratic form of $\mathbf{A}$. It is in some cases helpful @@ -73,3 +72,4 @@ This identity is valid for any square matrix (need not be symmetric), although quadratic forms are usually only discussed in the context of symmetric matrices. +We have seen quadratic forms in the context of quadratic optimization problems, where the goal was to minimize a quadratic form. \ No newline at end of file From 9f36e599b253e2de04d1c14b55d569d66a0923ee Mon Sep 17 00:00:00 2001 From: Arman Beykmohammadi Date: Thu, 15 May 2025 21:40:52 +0200 Subject: [PATCH 21/43] week 1 and 2 sheet solutions added to the book --- book/_toc.yml | 7 + book/appendix/Exercise Sheet 1 Solutions.md | 60 +++ book/appendix/Exercise Sheet 2 Solutions.md | 422 ++++++++++++++++++++ book/appendix/Exercise Sheet Solutions.md | 6 + 4 files changed, 495 insertions(+) create mode 100644 book/appendix/Exercise Sheet 1 Solutions.md create mode 100644 book/appendix/Exercise Sheet 2 Solutions.md create mode 100644 book/appendix/Exercise Sheet Solutions.md diff --git a/book/_toc.yml b/book/_toc.yml index 69d98e9..1a7b5e4 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -163,6 +163,13 @@ parts: title: Clairaut's Theorem - file: appendix/differentiation_rules title: Differentiation Rules + - file: appendix/Exercise Sheet Solutions.md + title: Exercise Sheet Solutions + sections: + - file: appendix/Exercise Sheet 1 Solutions.md + title: Exercise Sheet 1 Solutions + - file: appendix/Exercise Sheet 2 Solutions.md + title: Exercise Sheet 2 Solutions # sections: # - file: appendix/proof_vector_spaces # title: Vector Spaces diff --git a/book/appendix/Exercise Sheet 1 Solutions.md b/book/appendix/Exercise Sheet 1 Solutions.md new file mode 100644 index 0000000..779cd0e --- /dev/null +++ b/book/appendix/Exercise Sheet 1 Solutions.md @@ -0,0 +1,60 @@ +# Exercise Sheet 1 Solutions + + +### 1. +#### (a) +Take any \(v_1=(a,b)\) and \(v_2=(c,d)\) in \(V\); then \(b=3a+1\) and \(d=3c+1\). +Their sum is +\[ +v_1+v_2=(a+c,\;b+d)=(a+c,\;3a+1+3c+1)=\bigl(a+c,\;3(a+c)+2\bigr), +\] +which **does not** satisfy \(b+d=3(a+c)+1\). Hence \(V\) is *not* closed under addition ⇒ **not a vector space**. +(Equivalently, the additive identity \((0,0)\notin V\), violating axiom V1.) + +#### (b) +All axioms except **distributivity over scalar addition** fail: + +Take \(v=(a,b)\) and scalars \(\alpha,\beta\in\mathbb R\). +\[ +(\alpha+\beta)\,v=((\alpha+\beta)a,\;b), +\quad +\alpha v+\beta v=(\alpha a,\;b)+(\beta a,\;b)=((\alpha+\beta)a,\;2b). +\] +Unless \(b=0\), the second component differs, so +\((\alpha+\beta)v\neq\alpha v+\beta v\). +Therefore \(V\) is **not** a vector space. + + +### 2. +#### (a) +*Zero vector:* \((0,0)\) satisfies \(0=2\cdot0\). +*Closure (addition):* if \(y_1=2x_1\) and \(y_2=2x_2\), then +\[ +y_1+y_2 = 2(x_1+x_2). +\] +*Closure (scalar mult.):* for \(\alpha\in\mathbb R\), +\[ +\alpha(x,y)=(\alpha x,\;2\alpha x). +\] +All three conditions hold ⇒ \(W\) **is a subspace**. + +#### (b) +Pick \((x,y)\in W\) with \(x>0\) and any negative scalar \(\alpha<0\). +Then +\[ +\alpha(x,y)=(\alpha x,\;\alpha y), +\] +and \(\alpha x<0\). Thus \(\alpha(x,y)\notin W\). +Not closed under scalar multiplication ⇒ **not a subspace**. + + +### 3. +For \(x=(a,b)\), \(y=(c,d)\) and scalars \(\alpha,\beta\): +\[ +T(\alpha x+\beta y)=\bigl((\alpha a+\beta c)^{2},\;\alpha b+\beta d\bigr), +\] +\[ +\alpha T(x)+\beta T(y)=\bigl(\alpha^{2}a^{2}+\beta^{2}c^{2},\;\alpha b+\beta d\bigr). +\] +The first components differ unless \(a c=0\) or \(\alpha\beta=0\). +Hence \(T\) **violates additivity/homogeneity ⇒ not linear**. diff --git a/book/appendix/Exercise Sheet 2 Solutions.md b/book/appendix/Exercise Sheet 2 Solutions.md new file mode 100644 index 0000000..89cf37b --- /dev/null +++ b/book/appendix/Exercise Sheet 2 Solutions.md @@ -0,0 +1,422 @@ +# Exercise Sheet 2 Solutions + + +### 1. +#### (a) +Let +\[ +f(x, y) = +\begin{cases} +\dfrac{x \sin y}{x^2 + y^2} & \text{if } (x, y) \neq (0, 0) \\ +0 & \text{if } (x, y) = (0, 0) +\end{cases} +\] + +We are asked to examine the **continuity of \( f \) in \( \mathbb{R}^2 \)**. + +*Remark* +We say that a single-variable function \( f : \mathbb{R} \rightarrow \mathbb{R} \) is **continuous at a point** \( a \in \mathbb{R} \) if + +\[ +\lim_{x \to a} f(x) = f(a) +\] + +*Extension to Two Variables* +Similarly, for a function of two variables \( f : \mathbb{R}^2 \rightarrow \mathbb{R} \), we say that \( f \) is continuous at the point \( (a, b) \in \mathbb{R}^2 \) if + +\[ +\lim_{(x, y) \to (a, b)} f(x, y) = f(a, b) +\] + +So, to study the continuity of \( f \), we need to check whether this limit exists and equals the value of the function at that point. + +*Strategy* +To determine the **existence of** +\[ +\lim_{(x, y) \to (0, 0)} f(x, y), +\] +we must examine whether the limit exists and is the same **along all possible directions** towards \( (0, 0) \). + +*Direction 1: Along the x-axis* +We approach \( (0, 0) \) along the **x-axis**, i.e., set \( y = 0 \). +Then: + +\[ +f(x, 0) = \frac{x \sin(0)}{x^2 + 0^2} = \frac{0}{x^2} = 0 \quad \text{for all } x \neq 0 +\] + +\[ +\Rightarrow \lim_{(x, y) \to (0, 0)} f(x, y) = \lim_{x \to 0} f(x, 0) = 0 +\] + +*Direction 2: Along the y-axis* +Let \( x = 0 \), then: + +\[ +f(0, y) = \frac{0 \cdot \sin(y)}{0 + y^2} = 0 +\] + +\[ +\Rightarrow \lim_{(x, y) \to (0, 0)} f(x, y) = \lim_{y \to 0} f(0, y) = 0 +\] + +*Direction 3: Along \( y = x \)* +Now we approach the origin along a different line, say \( y = x \): + +\[ +f(x, x) = \frac{x \sin x}{x^2 + x^2} = \frac{x \sin x}{2x^2} = \frac{\sin x}{2x} +\] + +\[ +\Rightarrow \lim_{(x, y) \to (0, 0)} f(x, y) = \lim_{x \to 0} \frac{\sin x}{2x} = \frac{1}{2} +\] + +Since this limit \(\frac{1}{2} \neq 0\), the two-dimensional limit +\[ +\lim_{(x, y) \to (0, 0)} f(x, y) +\] +**does not exist**, and hence \( f(x, y) \) is **not continuous** at the point \( (0, 0) \). + +#### (b) +*Partial derivatives of \( f \) at point \( (0, 0) \)* +If we have a function of two variables +\[ +f : \mathbb{R}^2 \rightarrow \mathbb{R}, \quad (x, y) \mapsto f(x, y) +\] +then the partial derivative of \( f \) with respect to \( x \) at \( (a, b) \) is defined as: + +\[ +f_x(a, b) = \lim_{h \to 0} \frac{f(a + h, b) - f(a, b)}{h} +\] + +and similarly, the partial derivative of \( f \) with respect to \( y \) at \( (a, b) \) is: + +\[ +f_y(a, b) = \lim_{h \to 0} \frac{f(a, b + h) - f(a, b)}{h} +\] + +*Compute partial derivatives at \( (0, 0) \)* +-Partial derivative with respect to \( x \): +\[ +f_x(0, 0) = \lim_{h \to 0} \frac{f(h, 0) - f(0, 0)}{h} += \lim_{h \to 0} \frac{0 - 0}{h} = 0 +\] + +-Partial derivative with respect to \( y \): + +\[ +f_y(0, 0) = \lim_{h \to 0} \frac{f(0, h) - f(0, 0)}{h} += \lim_{h \to 0} \frac{0 - 0}{h} = 0 +\] + +#### (c) +*At which points is \( f \) differentiable?* +To determine where the function \( f \) is differentiable, we use the following **theorem**: + +*Remark (Theorem)* +If \( f \) is a continuous function in an open set \( U \), +and has **continuous partial derivatives** at \( U \), +then \( f \) is **continuously differentiable** at all points in \( U \). + +Let \( U = \mathbb{R}^2 \setminus \{(0, 0)\} \). +The function \( f(x, y) = \dfrac{x \sin y}{x^2 + y^2} \) is continuous at all points in \( U \). + +Now we examine the partial derivatives of \( f \): + +*Compute \( \dfrac{\partial f}{\partial x} \) and \( \dfrac{\partial f}{\partial y} \)* +\[ +\frac{\partial}{\partial x} \left( \frac{x \sin y}{x^2 + y^2} \right) += \frac{(x^2 + y^2)\sin y - 2x^2 \sin y}{(x^2 + y^2)^2} += \frac{(y^2 - x^2)\sin y}{(x^2 + y^2)^2} +\] + +\[ +\frac{\partial}{\partial y} \left( \frac{x \sin y}{x^2 + y^2} \right) += \frac{x \cos y (x^2 + y^2) - 2x y \sin y}{(x^2 + y^2)^2} +\] + +These are rational functions where the **numerator and denominator** are composed of continuous functions, and the **denominator only vanishes at the origin** \( (0, 0) \). +Thus, the partial derivatives are continuous **everywhere in \( U \)**. + +*Conclusion* +So, based on the theorem, function \( f \) is **differentiable at all points except** the origin, that is, point \( (0, 0) \). + + +### 2. +#### (a) +Let the function \( f(z) = \exp\left(-\dfrac{1}{2} z\right) \), +where \( z = g(y) = y^\top S^{-1} y \), +and \( y = h(x) = x - u \), +with: + +- \( x, u \in \mathbb{R}^D \) +- \( S \in \mathbb{R}^{D \times D} \) + +*Chain Rule* +Based on the chain rule, we have: + +\[ +\frac{df}{dx} = \frac{df}{dz} \cdot \frac{dz}{dy} \cdot \frac{dy}{dx} +\] + +*Step 1: Note the functions and their domains* +- \( y = h(x) = x - u \) → maps \( \mathbb{R}^D \to \mathbb{R}^D \) +- \( z = g(y) = y^\top S^{-1} y \) → maps \( \mathbb{R}^D \to \mathbb{R} \) +- \( f(z) = e^{- \frac{1}{2} z} \) → maps \( \mathbb{R} \to \mathbb{R} \) + +So the full composition is: +\[ +x \mapsto y = x - u \mapsto z = y^\top S^{-1} y \mapsto f(z) = e^{- \frac{1}{2} z} +\] + +*Step 2: Compute \( \dfrac{dy}{dx} \)* +Since \( y = x - u \), the Jacobian \( \dfrac{dy}{dx} \) is: + +\[ +\frac{dy}{dx} = I_{D \times D} +\quad \text{(identity matrix)} +\] + +*Step 3: Compute \( \dfrac{dz}{dy} \)* +We have \( z = y^\top S^{-1} y \). +Using gradient rules for quadratic forms: + +\[ +\frac{d}{dy} (y^\top A y) = y^\top (A + A^\top) +\] + +Apply this: + +\[ +\frac{dz}{dy} = y^\top (S^{-1} + (S^{-1})^\top) +\quad \in \mathbb{R}^{1 \times D} +\] + +*Step 4: Compute \( \dfrac{df}{dz} \)* +\[ +f(z) = e^{- \frac{1}{2} z} +\quad \Rightarrow \quad +\frac{df}{dz} = -\frac{1}{2} e^{- \frac{1}{2} z} +\quad \in \mathbb{R} +\] + +*Final Result* +\[ +\frac{df}{dx} = -\frac{1}{2} e^{- \frac{1}{2} z} \cdot y^\top (S^{-1} + (S^{-1})^\top) +\quad \in \mathbb{R}^{1 \times D} +\] + +#### (b) +Let + +\[ +f(z) = \tanh(z), \quad z = Ax + b +\] + +where: + +- \( x \in \mathbb{R}^N \) +- \( A \in \mathbb{R}^{M \times N} \) +- \( b \in \mathbb{R}^M \) + +*Apply Chain Rule* +\[ +\frac{df}{dx} = \frac{df}{dz} \cdot \frac{dz}{dx} +\] + +*Step 1: Understand \( z = Ax + b \)* +We note: + +\[ +z = Ax + b \in \mathbb{R}^M +\Rightarrow \frac{dz}{dx} = A \in \mathbb{R}^{M \times N} +\] + +*Step 2: Compute \( \dfrac{df}{dz} \)* +We have: + +\[ +f(z) = +\begin{bmatrix} +\tanh(z_1) \\ +\tanh(z_2) \\ +\vdots \\ +\tanh(z_M) +\end{bmatrix} +\] + +So the Jacobian of \( f \) is diagonal: + +\[ +\frac{df}{dz} = +\begin{bmatrix} +\text{sech}^2(z_1) & & \\ +& \ddots & \\ +& & \text{sech}^2(z_M) +\end{bmatrix} +\in \mathbb{R}^{M \times M} +\] + +*Final Result* +```math +\frac{df}{dx} += +\operatorname{diag}\left( + \text{sech}^2(z_1),\ + \text{sech}^2(z_2),\ + \dots,\ + \text{sech}^2(z_M) +\right) +\cdot A +\quad \in \mathbb{R}^{M \times N} +``` + +### 3. +#### (a) +Let +```math +x^{(0)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad \eta = 1 +``` +and perform two steps of **gradient descent**. + +The update rule for gradient descent is: + +```math +x^{(i+1)} = x^{(i)} - \eta \nabla f(x^{(i)}) +``` + +So two steps of the gradient descent algorithm are: + +```math +\text{Step 1:} \quad x^{(1)} = x^{(0)} - \eta \nabla f(x^{(0)}) +``` + +```math +\text{Step 2:} \quad x^{(2)} = x^{(1)} - \eta \nabla f(x^{(1)}) +``` + +Given the gradient: + +```math +\nabla f = \begin{bmatrix} +x_1 + 2 \\ +2x_2 + 1 +\end{bmatrix} +``` + +We compute: + +```math +x^{(0)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} +\quad \Rightarrow \quad +\nabla f(x^{(0)}) = \begin{bmatrix} 2 \\ 1 \end{bmatrix} +``` + +*Step 1:* +```math +x^{(1)} = x^{(0)} - 1 \cdot \nabla f(x^{(0)}) = +\begin{bmatrix} 0 \\ 0 \end{bmatrix} +- \begin{bmatrix} 2 \\ 1 \end{bmatrix} += \begin{bmatrix} -2 \\ -1 \end{bmatrix} +``` + +```math +\nabla f(x^{(1)}) = +\begin{bmatrix} -2 + 2 \\ -2 + 1 \end{bmatrix} += \begin{bmatrix} 0 \\ -1 \end{bmatrix} +``` + +*Step 2:* +```math +x^{(2)} = x^{(1)} - 1 \cdot \nabla f(x^{(1)}) = +\begin{bmatrix} -2 \\ -1 \end{bmatrix} +- \begin{bmatrix} 0 \\ -1 \end{bmatrix} += \begin{bmatrix} -2 \\ 0 \end{bmatrix} +``` + +#### (b) +Will the gradient descent procedure from part (b) converge to the minimizer \( x^* \)? Why or why not? How can we fix it? + +Let’s look at the values over iterations: + +```math +x^{(0)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad +x^{(1)} = \begin{bmatrix} -2 \\ -1 \end{bmatrix}, \quad +x^{(2)} = \begin{bmatrix} -2 \\ 0 \end{bmatrix}, \quad +x^* = \begin{bmatrix} -2 \\ -0.5 \end{bmatrix} +``` + +And: + +```math +\nabla f(x^{(0)}) = \begin{bmatrix} 2 \\ 1 \end{bmatrix}, \quad +\nabla f(x^{(1)}) = \begin{bmatrix} 0 \\ -1 \end{bmatrix}, \quad +\nabla f(x^{(2)}) = \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \quad +\nabla f(x^*) = \begin{bmatrix} 0 \\ 0 \end{bmatrix} +``` + +We observe that **gradient descent does not converge** to \( x^* \). Why? + +Because the gradients do **not decrease constantly**. +Let’s examine the **partial derivatives**: + +```math +\frac{\partial f}{\partial x_1} \big|_{x^{(0)}} = 2, \quad +\frac{\partial f}{\partial x_1} \big|_{x^{(1)}} = 0, \quad +\frac{\partial f}{\partial x_1} \big|_{x^{(2)}} = 0 +``` + +```math +\frac{\partial f}{\partial x_2} \big|_{x^{(0)}} = 1, \quad +\frac{\partial f}{\partial x_2} \big|_{x^{(1)}} = -1, \quad +\frac{\partial f}{\partial x_2} \big|_{x^{(2)}} = 1 +``` + +Since \( x^* \) is a minimum and \( \nabla f(x^*) = 0 \), we expect the GD algorithm to converge to \( x^* \) **if the partial derivatives reduce toward zero**. + +But here, GD **jumps over the minimum** due to a **too high learning rate** \( \eta = 1 \). If we **decrease** the learning rate, convergence improves. + +*Trying smaller learning rates:* +Let’s try \( \eta = 0.5 \): + +```math +x^{(0)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad +\nabla f(x^{(0)}) = \begin{bmatrix} 2 \\ 1 \end{bmatrix} +``` + +*Step 1:* +```math +x^{(1)} = x^{(0)} - 0.5 \cdot \nabla f(x^{(0)}) += \begin{bmatrix} 0 \\ 0 \end{bmatrix} +- 0.5 \cdot \begin{bmatrix} 2 \\ 1 \end{bmatrix} += \begin{bmatrix} -1 \\ -0.5 \end{bmatrix} +``` + +```math +\nabla f(x^{(1)}) = \begin{bmatrix} 1 \\ 0 \end{bmatrix} +``` + +*Step 2:* +```math +x^{(2)} = x^{(1)} - 0.5 \cdot \nabla f(x^{(1)}) += \begin{bmatrix} -1 \\ -0.5 \end{bmatrix} +- 0.5 \cdot \begin{bmatrix} 1 \\ 0 \end{bmatrix} += \begin{bmatrix} -1.5 \\ -0.5 \end{bmatrix} +``` + +Now we see that the GD algorithm converges towards: + +```math +x^* = \begin{bmatrix} -2 \\ -0.5 \end{bmatrix} +``` + +with gradients: + +```math +\nabla f(x^{(0)}) = \begin{bmatrix} 2 \\ 1 \end{bmatrix}, \quad +\nabla f(x^{(1)}) = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \quad +\nabla f(x^{(2)}) = \begin{bmatrix} 0.5 \\ 0 \end{bmatrix}, \quad +\nabla f(x^*) = \begin{bmatrix} 0 \\ 0 \end{bmatrix} +``` + +✔️ So a smaller \( \eta \) leads to proper convergence! diff --git a/book/appendix/Exercise Sheet Solutions.md b/book/appendix/Exercise Sheet Solutions.md new file mode 100644 index 0000000..cadaa1e --- /dev/null +++ b/book/appendix/Exercise Sheet Solutions.md @@ -0,0 +1,6 @@ +# Exercise Solutions + +In this appendix, we provide worked-out solutions to the weekly exercise sheets accompanying the *Mathematics for Machine Learning* course. These solutions are designed to reinforce understanding of the theoretical material covered in the main chapters. Each solution sheet contains detailed step-by-step derivations and justifications. + +```{tableofcontents} +``` From ebe39a5162e90a45eb1f40830971359ac34ed6e4 Mon Sep 17 00:00:00 2001 From: clippert Date: Thu, 15 May 2025 22:05:32 +0200 Subject: [PATCH 22/43] draft material w05 --- book/_toc.yml | 2 +- .../Rayleigh_quotients.md | 301 +++--------------- book/chapter_decompositions/psd_matrices.md | 105 ++++-- 3 files changed, 116 insertions(+), 292 deletions(-) diff --git a/book/_toc.yml b/book/_toc.yml index 52b0024..e81c955 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -86,7 +86,7 @@ parts: - file: chapter_decompositions/eigenvectors # end week 05 - file: chapter_decompositions/orthogonal_matrices - file: chapter_decompositions/symmetric_matrices -# - file: chapter_decompositions/Rayleigh_quotients # skip for now + - file: chapter_decompositions/Rayleigh_quotients # skip for now - file: chapter_decompositions/psd_matrices # PCA as example - file: chapter_decompositions/svd # - file: chapter_decompositions/big_picture diff --git a/book/chapter_decompositions/Rayleigh_quotients.md b/book/chapter_decompositions/Rayleigh_quotients.md index 1ab9a43..976e143 100644 --- a/book/chapter_decompositions/Rayleigh_quotients.md +++ b/book/chapter_decompositions/Rayleigh_quotients.md @@ -1,13 +1,15 @@ # Rayleigh Quotients -There turns out to be an interesting connection between the quadratic -form of a symmetric matrix and its eigenvalues. This connection is -provided by the **Rayleigh quotient** +There turns out to be an interesting connection between the quadratic form of a symmetric matrix and its eigenvalues. +This connection is provided by the **Rayleigh quotient** -$$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}$$ +> $$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}$$ -The Rayleigh quotient has a couple of important properties which the -reader can (and should!) easily verify from the definition: +The Rayleigh quotient has a couple of important properties: + +:::{prf:lemma} Properties of the Rayleigh Quotient +:label: trm-Rayleigh-properties +:nonumber: (i) **Scale invariance**: for any vector $\mathbf{x} \neq \mathbf{0}$ and any scalar $\alpha \neq 0$, @@ -15,10 +17,23 @@ reader can (and should!) easily verify from the definition: (ii) If $\mathbf{x}$ is an eigenvector of $\mathbf{A}$ with eigenvalue $\lambda$, then $R_\mathbf{A}(\mathbf{x}) = \lambda$. +::: + +:::{prf:proof} +(i) + + $$R_\mathbf{A}(\alpha\mathbf{x}) = \frac{(\alpha\mathbf{x})^{\!\top\!}\mathbf{A}(\alpha\mathbf{x})}{(\alpha\mathbf{x})^{\!\top\!}(\alpha\mathbf{x})} = \frac{\alpha^2}{\alpha^2}\frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}=R_\mathbf{A}(\mathbf{x}).$$ + +(ii) Let $\mathbf{x}$ be an eigenvector of $\mathbf{A}$ with eigenvalue + $\lambda$, then + + $$R_\mathbf{A}(\mathbf{x})= \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}} = \frac{\mathbf{x}^{\!\top\!}(\lambda\mathbf{x})}{\mathbf{x}^{\!\top\!}\mathbf{x}}=\lambda\frac{\mathbf{x}^{\!\top\!}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}} = \lambda.$$ +::: We can further show that the Rayleigh quotient is bounded by the largest -and smallest eigenvalues of $\mathbf{A}$. But first we will show a -useful special case of the final result. +and smallest eigenvalues of $\mathbf{A}$. + +But first we will show a useful special case of the final result. :::{prf:theorem} Bound Rayleigh Quotient :label: trm-bound-Rayleigh-quotient @@ -28,11 +43,11 @@ For any $\mathbf{x}$ such that $\|\mathbf{x}\|_2 = 1$, $$\lambda_{\min}(\mathbf{A}) \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$$ -with equality if and only if $\mathbf{x}$ is a corresponding -eigenvector. +with equality if and only if $\mathbf{x}$ is a corresponding eigenvector. ::: :::{prf:proof} + We show only the $\max$ case because the argument for the $\min$ case is entirely analogous. @@ -56,7 +71,9 @@ $I = \{i : \lambda_i = \max_{j=1,\dots,n} \lambda_j = \lambda_{\max}(\mathbf{A}) and $y_j = 0$ for $j \not\in I$. That is, $I$ contains the index or -indices of the largest eigenvalue. In this case, the maximal value of +indices of the largest eigenvalue. + +In this case, the maximal value of the expression is $$\sum_{i=1}^n \lambda_i y_i^2 = \sum_{i \in I} \lambda_i y_i^2 = \lambda_{\max}(\mathbf{A}) \sum_{i \in I} y_i^2 = \lambda_{\max}(\mathbf{A})$$ @@ -73,6 +90,7 @@ $\mathbf{A}$ and form an orthonormal basis for $\mathbb{R}^n$. Therefore by construction, the set $\{\mathbf{q}_i : i \in I\}$ forms an orthonormal basis for the eigenspace of $\lambda_{\max}(\mathbf{A})$. + Hence $\mathbf{x}$, which is a linear combination of these, lies in that eigenspace and thus is an eigenvector of $\mathbf{A}$ corresponding to $\lambda_{\max}(\mathbf{A})$. @@ -81,13 +99,11 @@ We have shown that $\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \lambda_{\max}(\mathbf{A})$, from which we have the general inequality $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$ -for all unit-length $\mathbf{x}$. ◻ +for all unit-length $\mathbf{x}$. ◻ ::: By the scale invariance of the Rayleigh quotient, we immediately have as -a corollary (since -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = R_{\mathbf{A}}(\mathbf{x})$ -for unit $\mathbf{x}$) +a corollary :::{prf:theorem} Min-Max Theorem :label: trm-min-max @@ -101,6 +117,19 @@ with equality if and only if $\mathbf{x}$ is a corresponding eigenvector. ::: +:::{prf:proof} + +Let $\mathbf{x}\neq \boldsymbol{0},$ then + +$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}} = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\|\mathbf{x}\|^2} = (\frac{\mathbf{x}}{\|\mathbf{x}\|})^{\!\top\!}\mathbf{A}(\frac{\mathbf{x}}{\|\mathbf{x}\|})$ + +Thus, minimimum and maximum of the Rayleigh quotient are identical to minimum and maximum of the squared form $\mathbf{y}\mathbf{A}\mathbf{y}$ for the unit-norm vector $\mathbf{y}=\mathbf{x}/\|\mathbf{x}\|$: + +$$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ + +◻ +::: + ```{code-cell} ipython3 :tags: [hide-input] import numpy as np @@ -188,245 +217,3 @@ This combined visualization brings together the **Rayleigh quotient** and the ** Together, these panels illustrate how the **direction of a vector determines how strongly it is scaled** by the symmetric matrix, and how this scaling relates to the matrix's **eigenstructure**. ✅ As guaranteed by the **Min–Max Theorem**, the maximum and minimum of the Rayleigh quotient occur precisely at the **eigenvectors corresponding to the largest and smallest eigenvalues**. - - - ---- - -## ✅ Theorem: Real symmetric matrices cannot produce rotation - -### 🧾 Statement - -Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a **real symmetric matrix**. Then: - -> The linear transformation $\mathbf{x} \mapsto \mathbf{A}\mathbf{x}$ **does not rotate** vectors — i.e., it cannot produce a transformation that changes the direction of a vector **without preserving its span**. - -In particular: - -* The transformation **does not rotate angles** -* The transformation has a basis of **orthogonal eigenvectors** -* Therefore, all action is **stretching/compressing along fixed directions**, not rotation - ---- - -## 🧠 Intuition - -Rotation mixes directions. But symmetric matrices: - -* Have **real eigenvalues** -* Are **orthogonally diagonalizable** -* Have **mutually orthogonal eigenvectors** - -So the matrix acts by **scaling along fixed orthogonal axes**, without changing the direction between basis vectors — i.e., no twisting, hence no rotation. - ---- - -## ✏️ Proof (2D case, generalizes easily) - -Let $\mathbf{A} \in \mathbb{R}^{2 \times 2}$ be symmetric: - -$$ -\mathbf{A} = \begin{pmatrix} a & b \\ b & d \end{pmatrix} -$$ - -We’ll show that $\mathbf{A}$ cannot produce a true rotation. - -### Step 1: Diagonalize $\mathbf{A}$ - -Because $\mathbf{A}$ is real symmetric, there exists an orthogonal matrix $\mathbf{Q}$ and diagonal $\mathbf{\Lambda}$ such that: - -$$ -\mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^\top -$$ - -That is, $\mathbf{A}$ acts as: - -* A rotation (or reflection) $\mathbf{Q}^\top$ -* A stretch along axes $\mathbf{\Lambda}$ -* A second rotation (or reflection) $\mathbf{Q}$ - -But since $\mathbf{Q}$ and $\mathbf{Q}^\top$ cancel out geometrically (they are transposes of each other), this results in: - -> A transformation that **scales but does not rotate** relative to the basis of eigenvectors. - -### Step 2: Show $\mathbf{A}$ preserves alignment - -Let $\mathbf{v}$ be any eigenvector of $\mathbf{A}$. - -Then: - -$$ -\mathbf{A} \mathbf{v} = \lambda \mathbf{v} -$$ - -So $\mathbf{v}$ is **mapped to a scalar multiple of itself** — its **direction doesn’t change**. - -Because $\mathbb{R}^2$ has two linearly independent eigenvectors (since symmetric matrices are always diagonalizable), **no vector is rotated out of its original span** — just scaled. - -Hence, the transformation only **stretches**, **compresses**, or **reflects**, but never rotates. - ---- - -## 🚫 Counterexample: Rotation matrix is not symmetric - -The rotation matrix: - -$$ -\mathbf{R}_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} -$$ - -is **not symmetric** unless $\theta = 0$ or $\pi$, where it reduces to identity or negation. - -It **does not** have real eigenvectors (except at those degenerate angles), and it **rotates** all directions. - ---- - -## ✅ Conclusion - -**Rotation requires asymmetry.** - -If a linear transformation rotates vectors (changes direction without preserving alignment), the matrix must be **non-symmetric**. - ---- - -## ✅ Corollary - -A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ can perform rotation **only if**: - -* It is **not symmetric**, and -* It has **complex eigenvalues** (at least in 2D rotation) - - ---- - -## ✅ When Does a Matrix Have an Eigen-Decomposition? - -| Matrix Type | Diagonalizable? | Notes | -| ------------------------------ | --------------- | ------------------------------------------------ | -| Symmetric (real) | ✅ Always | Eigen-decomposition with orthogonal eigenvectors | -| Diagonalizable (in general) | ✅ Yes | Can write $A = V \Lambda V^{-1}$ | -| Defective (non-diagonalizable) | ❌ No | Needs Jordan form instead | - ---- - -## 🔁 Jordan Decomposition: The General Replacement - -If a matrix is **not diagonalizable**, it still has a **Jordan decomposition**: - -$$ -\mathbf{A} = \mathbf{P} \mathbf{J} \mathbf{P}^{-1} -$$ - -Where: - -* $\mathbf{J}$ is **block diagonal**: eigenvalues + possible **Jordan blocks** -* This captures **generalized eigenvectors** - -So **every square matrix** has a **Jordan decomposition**, but **not every one has an eigen-decomposition**. - ---- - -## ✅ Summary - -* **Symmetric matrices**: always have an eigen-decomposition (with real, orthogonal eigenvectors) -* **Non-symmetric matrices**: - - * May have a complete eigen-decomposition (if diagonalizable) - * May **not**, if they are **defective** -* In the general case, you must use **Jordan form** - -A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ has **complex eigenvalues or eigenvectors** when: - -### ✅ 1. The matrix is **not symmetric** (i.e., $\mathbf{A} \ne \mathbf{A}^\top$) - -* Real symmetric matrices **always** have **real** eigenvalues and orthogonal eigenvectors. -* Non-symmetric real matrices can have complex eigenvalues and eigenvectors. - -### ✅ 2. The **characteristic polynomial** has **complex roots** - -For example, consider: - -$$ -\mathbf{A} = \begin{pmatrix} 0 & -2 \\ 1 & 0 \end{pmatrix} -$$ - -Its characteristic polynomial is: - -$$ -\det(\mathbf{A} - \lambda \mathbf{I}) = \lambda^2 + 1 -$$ - -The roots are: - -$$ -\lambda = \pm i -$$ - -So it has **pure imaginary eigenvalues**, and its eigenvectors are also **complex**. -## ✅ Quick Answer: - -The eigenvectors and their transformed versions $\mathbf{A} \mathbf{v} = \lambda \mathbf{v}$ **are** parallel — **but only in complex vector space** $\mathbb{C}^n$. - -In **real space**, we usually visualize: - -* The **real part** of a complex vector: $\mathrm{Re}(\mathbf{v})$ -* The **imaginary part** of a complex vector: $\mathrm{Im}(\mathbf{v})$ - -But neither of these alone is invariant under multiplication by $\lambda \in \mathbb{C}$. So when you look at: - -$$ -\mathbf{v} = \mathrm{Re}(\mathbf{v}) + i \cdot \mathrm{Im}(\mathbf{v}) -$$ - -and apply $\mathbf{A}$, what you see in the real plane is: - -$$ -\mathrm{Re}(\mathbf{A} \mathbf{v}) \quad \text{vs.} \quad \mathrm{Re}(\lambda \mathbf{v}) -$$ - -These are **not scalar multiples** of $\mathrm{Re}(\mathbf{v})$ or $\mathrm{Im}(\mathbf{v})$, because complex scaling **mixes real and imaginary components** — unless $\lambda$ is real. - ---- - -## 🔍 Example - -Say: - -$$ -\lambda = a + ib, \quad \mathbf{v} = \begin{pmatrix} x + iy \\ z + iw \end{pmatrix} -$$ - -Then: - -$$ -\lambda \mathbf{v} = (a + ib)(\text{real} + i \cdot \text{imag}) = \text{mix of real and imaginary} -$$ - -So $\mathbf{A} \mathbf{v} = \lambda \mathbf{v}$, but $\mathrm{Re}(\mathbf{A} \mathbf{v})$ will **not be parallel** to $\mathrm{Re}(\mathbf{v})$ alone — it's a rotated and scaled mixture. - ---- - -## 🧠 Bottom Line - -> **Eigenvectors and their transformations are parallel in $\mathbb{C}^n$, but not necessarily in $\mathbb{R}^n$.** - - -> Note: The eigenvectors and their transformations are parallel in complex space, but their real and imaginary parts generally point in different directions due to complex scaling (rotation + stretch). - ---- - -## 🧠 Intuition - -* Complex eigenvalues often indicate **rotational behavior** in linear dynamical systems. -* The matrix above rotates vectors by 90° and has no real direction that stays on its span after transformation — hence no real eigenvectors. - ---- - -## 🔄 Summary - -| Matrix Type | Eigenvalues | Eigenvectors | -| ------------------ | ----------------- | ----------------- | -| Symmetric real | Real | Real & orthogonal | -| Non-symmetric real | Real or complex | Real or complex | -| Complex (any) | Complex (general) | Complex (general) | - diff --git a/book/chapter_decompositions/psd_matrices.md b/book/chapter_decompositions/psd_matrices.md index ab14ce8..91006b6 100644 --- a/book/chapter_decompositions/psd_matrices.md +++ b/book/chapter_decompositions/psd_matrices.md @@ -1,62 +1,80 @@ -## Positive (semi-)definite matrices +# Positive (semi-)definite matrices -A symmetric matrix $\mathbf{A}$ is **positive semi-definite** if for all -$\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \geq 0$. Sometimes people -write $\mathbf{A} \succeq 0$ to indicate that $\mathbf{A}$ is positive +>A symmetric matrix $\mathbf{A}$ is **positive semi-definite** if for all $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \geq 0$. +> +>Sometimes people write $\mathbf{A} \succeq 0$ to indicate that $\mathbf{A}$ is positive semi-definite. -A symmetric matrix $\mathbf{A}$ is **positive definite** if for all -nonzero $\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} > 0$. Sometimes people write -$\mathbf{A} \succ 0$ to indicate that $\mathbf{A}$ is positive definite. +> A symmetric matrix $\mathbf{A}$ is **positive definite** if for all nonzero $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} > 0$. +> +>Sometimes people write $\mathbf{A} \succ 0$ to indicate that $\mathbf{A}$ is positive definite. + Note that positive definiteness is a strictly stronger property than positive semi-definiteness, in the sense that every positive definite matrix is positive semi-definite but not vice-versa. These properties are related to eigenvalues in the following way. -*Proposition.* +:::{prf:proposition} Eigenvalues of Positive Definite Matrices +:label: trm-psd-eigenvalues +:nonumber: A symmetric matrix is positive semi-definite if and only if all of its eigenvalues are nonnegative, and positive definite if and only if all of its eigenvalues are positive. +::: - -*Proof.* Suppose $A$ is positive semi-definite, and let $\mathbf{x}$ be +:::{prf:proof} +Suppose $A$ is positive semi-definite, and let $\mathbf{x}$ be an eigenvector of $\mathbf{A}$ with eigenvalue $\lambda$. Then $$0 \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}(\lambda\mathbf{x}) = \lambda\mathbf{x}^{\!\top\!}\mathbf{x} = \lambda\|\mathbf{x}\|_2^2$$ Since $\mathbf{x} \neq \mathbf{0}$ (by the assumption that it is an eigenvector), we have $\|\mathbf{x}\|_2^2 > 0$, so we can divide both -sides by $\|\mathbf{x}\|_2^2$ to arrive at $\lambda \geq 0$. If -$\mathbf{A}$ is positive definite, the inequality above holds strictly, -so $\lambda > 0$. This proves one direction. +sides by $\|\mathbf{x}\|_2^2$ to arrive at $\lambda \geq 0$. + +If $\mathbf{A}$ is positive definite, the inequality above holds strictly, +so $\lambda > 0$. + +This proves one direction. To simplify the proof of the other direction, we will use the machinery -of Rayleigh quotients. Suppose that $\mathbf{A}$ is symmetric and all -its eigenvalues are nonnegative. Then for all +of Rayleigh quotients. + +Suppose that $\mathbf{A}$ is symmetric and all +its eigenvalues are nonnegative. + +Then for all $\mathbf{x} \neq \mathbf{0}$, $$0 \leq \lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x})$$ Since $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ matches $R_\mathbf{A}(\mathbf{x})$ in sign, we conclude that $\mathbf{A}$ is -positive semi-definite. If the eigenvalues of $\mathbf{A}$ are all +positive semi-definite. + +If the eigenvalues of $\mathbf{A}$ are all strictly positive, then $0 < \lambda_{\min}(\mathbf{A})$, whence it follows that $\mathbf{A}$ is positive definite. ◻ - +::: As an example of how these matrices arise, consider -*Proposition.* -Suppose $\mathbf{A} \in \mathbb{R}^{m \times n}$. Then -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. If -$\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, then +:::{prf:proposition} Gram Matrices +:label: trm-gram-matrices +:nonumber: + +Suppose $\mathbf{A} \in \mathbb{R}^{m \times n}$. + +Then $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. + +If $\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, then $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. +::: +:::{prf:proof} -*Proof.* For any $\mathbf{x} \in \mathbb{R}^n$, +For any $\mathbf{x} \in \mathbb{R}^n$, $$\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = (\mathbf{A}\mathbf{x})^{\!\top\!}(\mathbf{A}\mathbf{x}) = \|\mathbf{A}\mathbf{x}\|_2^2 \geq 0$$ @@ -65,28 +83,37 @@ so $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. Note that $\|\mathbf{A}\mathbf{x}\|_2^2 = 0$ implies $\|\mathbf{A}\mathbf{x}\|_2 = 0$, which in turn implies $\mathbf{A}\mathbf{x} = \mathbf{0}$ (recall that this is a property of -norms). If $\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, +norms). + +If $\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, $\mathbf{A}\mathbf{x} = \mathbf{0}$ implies $\mathbf{x} = \mathbf{0}$, so $\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = 0$ if and only if $\mathbf{x} = \mathbf{0}$, and thus $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. ◻ +::: Positive definite matrices are invertible (since their eigenvalues are nonzero), whereas positive semi-definite matrices might not be. However, if you already have a positive semi-definite matrix, it is possible to perturb its diagonal slightly to produce a positive definite matrix. -*Proposition.* +:::{prf:proposition} +:label: trm-A-plus-eps +:nonumber: + If $\mathbf{A}$ is positive semi-definite and $\epsilon > 0$, then $\mathbf{A} + \epsilon\mathbf{I}$ is positive definite. +::: -*Proof.* Assuming $\mathbf{A}$ is positive semi-definite and +:::{prf:proof} +Assuming $\mathbf{A}$ is positive semi-definite and $\epsilon > 0$, we have for any $\mathbf{x} \neq \mathbf{0}$ that $$\mathbf{x}^{\!\top\!}(\mathbf{A}+\epsilon\mathbf{I})\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} + \epsilon\mathbf{x}^{\!\top\!}\mathbf{I}\mathbf{x} = \underbrace{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}_{\geq 0} + \underbrace{\epsilon\|\mathbf{x}\|_2^2}_{> 0} > 0$$ as claimed. ◻ +::: An obvious but frequently useful consequence of the two propositions we have just shown is that @@ -104,12 +131,15 @@ $\{\mathbf{x} \in \operatorname{dom} f : f(\mathbf{x}) = c\}$. Let us consider the special case $f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ where -$\mathbf{A}$ is a positive definite matrix. Since $\mathbf{A}$ is +$\mathbf{A}$ is a positive definite matrix. + +Since $\mathbf{A}$ is positive definite, it has a unique matrix square root $\mathbf{A}^{\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, where $\mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$ is the eigendecomposition of $\mathbf{A}$ and $\mathbf{\Lambda}^{\frac{1}{2}} = \operatorname{diag}(\sqrt{\lambda_1}, \dots \sqrt{\lambda_n})$. + It is easy to see that this matrix $\mathbf{A}^{\frac{1}{2}}$ is positive definite (consider its eigenvalues) and satisfies $\mathbf{A}^{\frac{1}{2}}\mathbf{A}^{\frac{1}{2}} = \mathbf{A}$. Fixing @@ -121,10 +151,16 @@ $$c = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A where we have used the symmetry of $\mathbf{A}^{\frac{1}{2}}$. Making the change of variable $\mathbf{z} = \mathbf{A}^{\frac{1}{2}}\mathbf{x}$, we have the condition -$\|\mathbf{z}\|_2 = \sqrt{c}$. That is, the values $\mathbf{z}$ lie on a -sphere of radius $\sqrt{c}$. These can be parameterized as +$\|\mathbf{z}\|_2 = \sqrt{c}$. + +That is, the values $\mathbf{z}$ lie on a +sphere of radius $\sqrt{c}$. + +These can be parameterized as $\mathbf{z} = \sqrt{c}\hat{\mathbf{z}}$ where $\hat{\mathbf{z}}$ has -$\|\hat{\mathbf{z}}\|_2 = 1$. Then since +$\|\hat{\mathbf{z}}\|_2 = 1$. + +Then since $\mathbf{A}^{-\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, we have @@ -132,7 +168,9 @@ $$\mathbf{x} = \mathbf{A}^{-\frac{1}{2}}\mathbf{z} = \mathbf{Q}\mathbf{\Lambda}^ where $\tilde{\mathbf{z}} = \mathbf{Q}^{\!\top\!}\hat{\mathbf{z}}$ also satisfies $\|\tilde{\mathbf{z}}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Using this parameterization, we see that the solution set +orthogonal. + +Using this parameterization, we see that the solution set $\{\mathbf{x} \in \mathbb{R}^n : f(\mathbf{x}) = c\}$ is the image of the unit sphere $\{\tilde{\mathbf{z}} \in \mathbb{R}^n : \|\tilde{\mathbf{z}}\|_2 = 1\}$ @@ -164,11 +202,10 @@ $$\mathbf{Q}\mathbf{e}_i = \sum_{j=1}^n [\mathbf{e}_i]_j\mathbf{q}_j = \mathbf{q where we have used the matrix-vector product identity from earlier. -In summary: the isocontours of +**In summary:** the isocontours of $f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ are ellipsoids such that the axes point in the directions of the eigenvectors of $\mathbf{A}$, and the radii of these axes are proportional to the inverse square roots of the corresponding eigenvalues. - From 4fa9a00e2c0a5de8a156d466bd701caec7184300 Mon Sep 17 00:00:00 2001 From: clippert Date: Fri, 16 May 2025 12:56:13 +0200 Subject: [PATCH 23/43] added PCA section --- book/_toc.yml | 4 +- book/chapter_decompositions/pca.md | 425 ++++++++++++++++++++ book/chapter_decompositions/psd_matrices.md | 3 + drafts/example_genetics/pca_genetics.md | 34 +- 4 files changed, 435 insertions(+), 31 deletions(-) create mode 100644 book/chapter_decompositions/pca.md diff --git a/book/_toc.yml b/book/_toc.yml index e81c955..b906480 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -87,7 +87,9 @@ parts: - file: chapter_decompositions/orthogonal_matrices - file: chapter_decompositions/symmetric_matrices - file: chapter_decompositions/Rayleigh_quotients # skip for now - - file: chapter_decompositions/psd_matrices # PCA as example + - file: chapter_decompositions/psd_matrices + - file: chapter_decompositions/pca # PCA as example for the eigenvalue decomposition of a psd matrix + title: Principal Components Analysis - file: chapter_decompositions/svd # - file: chapter_decompositions/big_picture # - file: chapter_decompositions/pseudoinverse diff --git a/book/chapter_decompositions/pca.md b/book/chapter_decompositions/pca.md new file mode 100644 index 0000000..94a40e3 --- /dev/null +++ b/book/chapter_decompositions/pca.md @@ -0,0 +1,425 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: math4ml + language: python + name: python3 +--- ++++ {"slideshow": {"slide_type": "slide"}} + +# Principal Components Analysis + + +Pricnipal Components Analysis (PCA) performs the orthogonal projection of the data onto a lower dimensional linear space. The goal is to find the directions (principal components) in which the variance of the data is maximized. +An alternative definition of PCA is based on minimizing the sum-of-sqares of the projection errors. + +## Formal definition + +Given a dataset $\mathbf{X} \in \mathbb{R}^{N \times D}$ (rows are samples, columns are features), we aim to find an orthonormal basis $\mathbf{U}_k \in \mathbb{R}^{D \times k}$, $k < D$, such that the projection of the data onto the subspace spanned by $\mathbf{U}_k$ captures **as much variance** (energy) as possible. + +In the following example, we visualize how PCA both minimizes reconstruction error in the original space and extracts a lower-dimensional, variance-preserving representation. + +```{code-cell} ipython3 +:tags: [hide-input] + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from pysnptools.snpreader import Bed +from mpl_toolkits.mplot3d import Axes3D + +# Generate synthetic 3D data +np.random.seed(42) +n_samples = 20 +covariance_3d = np.array([ + [5, 0.5, 0.7], + [0.5, 1, 0], + [0.7, 0, 10] +]) +rotation_3d = np.linalg.cholesky(covariance_3d) +data_3d = np.random.randn(n_samples, 3) @ rotation_3d.T + +# Center the data +mean_3d = np.mean(data_3d, axis=0) +data_centered_3d = data_3d - mean_3d + +# Compute SVD +U, S, Vt = np.linalg.svd(data_centered_3d, full_matrices=False) +V = Vt.T +S2 = S[:2] / np.sqrt(n_samples) +V2 = V[:, :2] + +# Project and reconstruct +proj_2d = data_centered_3d @ V2 +recon_3d = (proj_2d @ V2.T) + mean_3d[np.newaxis, :] + +# Create a mesh grid for the 2D PCA plane +grid_range = np.linspace(-2, 2, 100) +xx, yy = np.meshgrid(grid_range, grid_range) +plane_points = np.stack([xx.ravel(), yy.ravel()], axis=1) +plane_points *= S2[np.newaxis, :] +plane_3d = mean_3d[np.newaxis, :] + (plane_points @ V2.T) + +# Plot: 3D PCA + 2D Projection with principal components added in 2D view + +fig = plt.figure(figsize=(16, 6)) + +# 3D plot +ax1 = fig.add_subplot(121, projection='3d') +ax1.scatter(data_3d[:, 0], data_3d[:, 1], data_3d[:, 2], alpha=0.2, label='Original Data') +ax1.scatter(recon_3d[:, 0], recon_3d[:, 1], recon_3d[:, 2], alpha=0.6, label='Projected (Reconstructed) Points') +for i in range(n_samples): + ax1.plot( + [data_3d[i, 0], recon_3d[i, 0]], + [data_3d[i, 1], recon_3d[i, 1]], + [data_3d[i, 2], recon_3d[i, 2]], + 'gray', lw=0.5, alpha=0.5 + ) +ax1.plot_trisurf(plane_3d[:, 0], plane_3d[:, 1], plane_3d[:, 2], alpha=0.3, color='orange') +origin = mean_3d +ax1.quiver(*origin, *V[:, 0]*S2[0]*2, color='r', lw=2) +ax1.quiver(*origin, *V[:, 1]*S2[1]*2, color='blue', lw=2) + +ax1.set_title("PCA in 3D: Projection onto First Two PCs") +ax1.set_xlabel("X") +ax1.set_ylabel("Y") +ax1.set_zlabel("Z") +ax1.legend() + +# 2D projection plot +ax2 = fig.add_subplot(122) +ax2.scatter(proj_2d[:, 0], proj_2d[:, 1], alpha=0.8, c='orange', label='2D Projection') +# draw PC directions +ax2.plot([0, S2[0]*2], [0, 0], color='r', lw=2, label='1st PC') # x-axis +ax2.plot([0, 0], [0, S2[1]*2], color='blue', lw=2, label='2nd PC') # y-axis +ax2.axhline(0, color='gray', lw=0.5) +ax2.axvline(0, color='gray', lw=0.5) +ax2.set_title("Data Projected onto First Two Principal Components") +ax2.set_xlabel("1st PC") +ax2.set_ylabel("2nd PC") +ax2.axis('equal') +ax2.grid(True) +ax2.legend() + +plt.tight_layout() +plt.show() +``` +* **Left panel**: The original 3D data, its projection onto the best-fit 2D PCA plane (orange), and reconstruction lines showing projection error. +* **Right panel**: The same data projected onto the first two principal components, visualized in 2D. + +### Step 1: Center the Data + +We begin by centering the dataset so that the empirical mean is 0: + +$$ +\bar{\mathbf{x}} = \frac{1}{N} \sum_{i=1}^N \mathbf{x}_i, \quad \mathbf{X}_{\text{centered}} = \mathbf{X} - \mathbf{1}_N \bar{\mathbf{x}}^\top +$$ + +Define $\mathbf{X} \leftarrow \mathbf{X}_{\text{centered}}$ for the rest of the derivation. + +--- + +### Step 2: Define the Projection + +Let $\mathbf{U}_k \in \mathbb{R}^{D \times k}$ be an orthonormal matrix: $\mathbf{U}_k^\top \mathbf{U}_k = \mathbf{I}_k$. + +Project each sample $\mathbf{x}_i \in \mathbb{R}^D$ onto the subspace: + +$$ +\mathbf{z}_i = \mathbf{U}_k^\top \mathbf{x}_i \quad \text{(coordinates in the new basis)} +$$ + +$$ +\hat{\mathbf{x}}_i = \mathbf{U}_k \mathbf{z}_i = \mathbf{U}_k \mathbf{U}_k^\top \mathbf{x}_i \quad \text{(projected vector)} +$$ + +The projection matrix is: + +$$ +\mathbf{P} = \mathbf{U}_k \mathbf{U}_k^\top +$$ + +--- + +### Step 3: Define the Reconstruction Error + +We want to **minimize** the total squared reconstruction error (projection error): + +$$ +\sum_{i=1}^N \left\| \mathbf{x}_i - \hat{\mathbf{x}}_i \right\|^2 += \sum_{i=1}^N \left\| \mathbf{x}_i - \mathbf{U}_k \mathbf{U}_k^\top \mathbf{x}_i \right\|^2 +$$ + +In matrix form: + +$$ +\mathcal{L}(\mathbf{U}_k) = \left\| \mathbf{X} - \mathbf{X} \mathbf{U}_k \mathbf{U}_k^\top \right\|_F^2 +$$ + +where $\|\cdot\|_F$ denotes the Frobenius norm. + + +--- + +### Step 4: Reformulate as a Maximization Problem + +Instead of minimizing reconstruction error, we **maximize the variance (energy) retained**: + +$$ +\text{maximize } \text{tr}\left( \mathbf{U}_k^\top \mathbf{X}^\top \mathbf{X} \mathbf{U}_k \right) \quad \text{subject to } \mathbf{U}_k^\top \mathbf{U}_k = \mathbf{I} +$$ + +This comes from noting: + +$$ +\|\mathbf{X} \mathbf{U}_k\|_F^2 = \sum_{i=1}^N \|\mathbf{U}_k^\top \mathbf{x}_i\|^2 = \text{tr}\left( \mathbf{U}_k^\top \mathbf{X}^\top \mathbf{X} \mathbf{U}_k \right) +$$ + +--- + +### Step 5: Solve Using the Spectral Theorem + +Let $\mathbf{X}^\top \mathbf{X} = \mathbf{M} \in \mathbb{R}^{D \times D}$. This matrix is symmetric and positive semidefinite. + +By the **spectral theorem**, there exists an orthonormal basis of eigenvectors $\mathbf{u}_1, \dots, \mathbf{u}_D$ with eigenvalues $\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_D \ge 0$, such that: + +$$ +\mathbf{M} = \mathbf{X}^\top \mathbf{X} = \mathbf{U} \Lambda \mathbf{U}^\top +$$ + +Choose $\mathbf{U}_k = [\mathbf{u}_1, \dots, \mathbf{u}_k]$ to maximize $\text{tr}( \mathbf{U}_k^\top \mathbf{M} \mathbf{U}_k )$. + +This is optimal because trace is maximized by choosing eigenvectors with **largest** eigenvalues (known from Rayleigh-Ritz and Courant-Fischer principles). + +--- + +### Step 6: Compute PCA via SVD (Optional) + +Rather than computing $\mathbf{X}^\top \mathbf{X}$, you can also directly compute the **Singular Value Decomposition** of $\mathbf{X}$: + +$$ +\mathbf{X} = \mathbf{U} \Sigma \mathbf{V}^\top +$$ + +* $\mathbf{U} \in \mathbb{R}^{N \times N}$ +* $\Sigma \in \mathbb{R}^{N \times D}$ +* $\mathbf{V} \in \mathbb{R}^{D \times D}$ + +Then the principal components are the **first $k$ columns** of $\mathbf{V}$, and: + +$$ +\mathbf{Z} = \mathbf{X} \mathbf{V}_k +$$ + +is the reduced representation. + +--- + +## PCA Derivation Summary + +- **Input**: Centered data matrix \(\mathbf{X} \in \mathbb{R}^{N \times D}\) +- **Goal**: Find orthonormal matrix \(\mathbf{U}_k \in \mathbb{R}^{D \times k}\) that captures most variance +- **Solution**: Maximize \( \text{tr}(\mathbf{U}_k^\top \mathbf{X}^\top \mathbf{X} \mathbf{U}_k) \), subject to \( \mathbf{U}_k^\top \mathbf{U}_k = \mathbf{I} \) +- **Optimal**: Columns of \(\mathbf{U}_k\) are top \(k\) eigenvectors of \( \mathbf{X}^\top \mathbf{X} \) +- **Projection**: \( \mathbf{Z} = \mathbf{X} \mathbf{U}_k \) +- **Reconstruction**: \( \tilde{\mathbf{X}} = \mathbf{Z} \mathbf{U}_k^\top \) + +## PCA algorithm step by step + +1. Calculate the mean of the data +$$ \mathbf{\bar{x}} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{x}_i $$ + +2. Calculate the covariance matrix $\mathbf{S}$ of the data: +$$ \mathbf{S} = \frac{1}{N-1} \sum_{i=1}^{N} (\mathbf{x}_i - \mathbf{\bar{x}})(\mathbf{x}_i - \mathbf{\bar{x}})^T $$ + +Both the mean and the covariance matrix are calculated by `empirical_covariance` function. + +3. Calculate the eigenvalues $\lambda_i$ and eigenvectors $\mathbf{u}_i$ of the covariance matrix $\mathbf{S}$ +4. Sort the eigenvalues in descending order and then sort the eigenvectors accordingly. Create a principal components matrix $\mathbf{U}$ by taking the first $k$ eigenvectors, where $k$ is the number of dimensions we want to keep. + This step is implemented in the `fit` method of the `PCA` class. + 5. To project the data onto the new space, we can use the following formula: +$$ \mathbf{Y} = \mathbf{X} \cdot \mathbf{U} $$ +This step is implemented in the `transform` method of the `PCA` class. + +6. To reconstruct the data, we can use the following formula: +$$ \mathbf{\tilde{X}} = \mathbf{Y} \cdot \mathbf{U}^T + \mathbf{\bar{x}} $$ +This step is implemented in the `inverse_transform` method of the `PCA` class. + +Note that recontructing the data will not give us the original data: $\mathbf{X} \neq \mathbf{\tilde{X}}$. + +## Implementation + +For the PCA algorithm we implement `empirical_covariance` method that would be usef do calculating the covariance of the data. + +```{code-cell} ipython3 +def empirical_covariance(X): + """ + Calculates the empirical covariance matrix for a given dataset. + + Parameters: + X (numpy.ndarray): A 2D numpy array where rows represent samples and columns represent features. + + Returns: + tuple: A tuple containing the mean of the dataset and the covariance matrix. + """ + N = X.shape[0] # Number of samples + mean = X.mean(axis=0) # Calculate the mean of each feature + X_centered = X - mean[np.newaxis, :] # Center the data by subtracting the mean + covariance = X_centered.T @ X_centered / (N - 1) # Compute the covariance matrix + return mean, covariance +``` + +We also impmlement `PCA` class with `fit`, `transform` and `reverse_transform` methods. + +```{code-cell} ipython3 +class PCA: + def __init__(self, k=None): + """ + Initializes the PCA class without any components. + + Parameters: + k (int, optional): Number of principal components to use. + """ + self.pc_variances = None # Eigenvalues of the covariance matrix + self.principal_components = None # Eigenvectors of the covariance matrix + self.mean = None # Mean of the dataset + self.k = k # the number of dimensions + + def fit(self, X): + """ + Fit the PCA model to the dataset by computing the covariance matrix and its eigen decomposition. + + Parameters: + X (numpy.ndarray): The data to fit the model on. + """ + self.mean, covariance = empirical_covariance(X=X) + eig_values, eig_vectors = np.linalg.eigh(covariance) # Compute eigenvalues and eigenvectors + self.pc_variances = eig_values[::-1] # the eigenvalues are returned by eigh in ascending order. We want them in descending order (largest first) + self.principal_components = eig_vectors[:, ::-1] # the eigenvectors in same order as eingevalues + if self.k is not None: + self.pc_variances = self.pc_variances[:self.k] + self.principal_components = self.principal_components[:,:self.k] + + def transform(self, X): + """ + Transform the data into the principal component space. + + Parameters: + X (numpy.ndarray): Data to transform. + + Returns: + numpy.ndarray: Transformed data. + """ + X_centered = X - self.mean + return X_centered @ self.principal_components + + def reverse_transform(self, Z): + """ + Transform data back to its original space. + + Parameters: + Z (numpy.ndarray): Transformed data to invert. + + Returns: + numpy.ndarray: Data in its original space. + """ + return Z @ self.principal_components.T + self.mean + + def variance_explained(self): + """ + Returns the amount of variance explained by the first k principal components. + + Returns: + numpy.ndarray: Variances explained by the first k components. + """ + return self.pc_variances +``` + +In the example below, we will use the PCA algorithm to reduce the dimensionality of a genetic dataset from the 1000 genomes project [1,2]. + +[1] Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015) + +[2] Altshuler, D. M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010) + +After reducing the dimensionality, we will plot the results and examine whether clusters of ancestries are visible. + +We consider five ancestries in the dataset: + +- **EUR** - European +- **AFR** - African +- **EAS** - East Asian +- **SAS** - South Asian +- **AMR** - Native American + +```{code-cell} ipython3 +:tags: [hide-input] +snpreader = Bed('./genetic_data/example2.bed', count_A1=True) +data = snpreader.read() +print(data.shape) +# y includes our labels and x includes our features +labels = pd.read_csv("./genetic_data/1kg_annotations_edit.txt", sep="\t", index_col="Sample") +list1 = data.iid[:,1].tolist() #list with the Sample numbers present in genetic dataset +labels = labels[labels.index.isin(list1)] #filter labels DataFrame so it only contains the sampleIDs present in genetic data +y = labels.SuperPopulation # EUR, AFR, AMR, EAS, SAS +X = data.val[:, ~np.isnan(data.val).any(axis=0)] #load genetic data to X, removing NaN values +pca = PCA() +pca.fit(X=X) + +X_pc = pca.transform(X) +X_reconstruction_full = pca.reverse_transform(X_pc) +print("L1 reconstruction error for full PCA : %.4E " % (np.absolute(X - X_reconstruction_full).sum())) + +for rank in range(5): #more correct: X_pc.shape[1]+1 + pca_lowrank = PCA(k=rank) + pca_lowrank.fit(X=X) + X_lowrank = pca_lowrank.transform(X) + X_reconstruction = pca_lowrank.reverse_transform(X_lowrank) + print("L1 reconstruction error for rank %i PCA : %.4E " % (rank, np.absolute(X - X_reconstruction).sum())) + +fig = plt.figure() +plt.plot(X_pc[y=="EUR"][:,0], X_pc[y=="EUR"][:,1],'.', alpha = 0.3) +plt.plot(X_pc[y=="AFR"][:,0], X_pc[y=="AFR"][:,1],'.', alpha = 0.3) +plt.plot(X_pc[y=="EAS"][:,0], X_pc[y=="EAS"][:,1],'.', alpha = 0.3) +plt.plot(X_pc[y=="AMR"][:,0], X_pc[y=="AMR"][:,1],'.', alpha = 0.3) +plt.plot(X_pc[y=="SAS"][:,0], X_pc[y=="SAS"][:,1],'.', alpha = 0.3) +plt.xlabel("PC 1") +plt.ylabel("PC 2") +plt.legend(["EUR", "AFR","EAS","AMR","SAS"]) + +fig2 = plt.figure() +plt.plot(X_pc[y=="EUR"][:,0], X_pc[y=="EUR"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="AFR"][:,0], X_pc[y=="AFR"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="EAS"][:,0], X_pc[y=="EAS"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="AMR"][:,0], X_pc[y=="AMR"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="SAS"][:,0], X_pc[y=="SAS"][:,2],'.', alpha = 0.3) +plt.xlabel("PC 1") +plt.ylabel("PC 3") +plt.legend(["EUR", "AFR","EAS","AMR","SAS"]) + + +fig3 = plt.figure() +plt.plot(X_pc[y=="EUR"][:,1], X_pc[y=="EUR"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="AFR"][:,1], X_pc[y=="AFR"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="EAS"][:,1], X_pc[y=="EAS"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="AMR"][:,1], X_pc[y=="AMR"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="SAS"][:,1], X_pc[y=="SAS"][:,2],'.', alpha = 0.3) +plt.xlabel("PC 2") +plt.ylabel("PC 3") +plt.legend(["EUR", "AFR","EAS","AMR","SAS"]) + +fig4 = plt.figure() +plt.plot(pca.variance_explained()) +plt.xlabel("PC dimension") +plt.ylabel("variance explained") + +fig4 = plt.figure() +plt.plot(pca.variance_explained().cumsum() / pca.variance_explained().sum()) +plt.xlabel("PC dimension") +plt.ylabel("cumulative fraction of variance explained") +plt.show() +``` \ No newline at end of file diff --git a/book/chapter_decompositions/psd_matrices.md b/book/chapter_decompositions/psd_matrices.md index 91006b6..10047f4 100644 --- a/book/chapter_decompositions/psd_matrices.md +++ b/book/chapter_decompositions/psd_matrices.md @@ -209,3 +209,6 @@ eigenvectors of $\mathbf{A}$, and the radii of these axes are proportional to the inverse square roots of the corresponding eigenvalues. + +To demonstrate the eigenvalue decomposition of a positive semi-definite matrix, we will be looking at Principal Component Analysis (PCA) algorithm in the next section. The algorithm is a technique used for applications like dimensionality reduction, lossy data compression, feature extraction and data visualization. + diff --git a/drafts/example_genetics/pca_genetics.md b/drafts/example_genetics/pca_genetics.md index 50763b2..4fadf6e 100644 --- a/drafts/example_genetics/pca_genetics.md +++ b/drafts/example_genetics/pca_genetics.md @@ -37,7 +37,6 @@ An alternative definition of PCA is based on minimizing the sum-of-sqares of the For the PCA algorithm we implement `empirical_covariance` method that would be usef do calculating the covariance of the data. We also impmlement `PCA` class with `fit`, `transform` and `reverse_transform` methods. ```{code-cell} ipython3 - def empirical_covariance(X): """ Calculates the empirical covariance matrix for a given dataset. @@ -56,11 +55,8 @@ def empirical_covariance(X): ``` -+++ {"slideshow": {"slide_type": "subslide"}} - ```{code-cell} ipython3 - class PCA: def __init__(self, k=None): """ @@ -83,9 +79,8 @@ class PCA: """ self.mean, covariance = empirical_covariance(X=X) eig_values, eig_vectors = np.linalg.eigh(covariance) # Compute eigenvalues and eigenvectors - order = np.argsort(eig_values)[::-1] # Get indices of eigenvalues in descending order - self.pc_variances = eig_values[order] # Sort the eigenvalues - self.principal_components = eig_vectors[:, order] # Sort the eigenvectors + self.pc_variances = eig_values[::-1] # the eigenvalues are returned by eigh in ascending order. We want them in descending order (largest first) + self.principal_components = eig_vectors[:, ::-1] # the eigenvectors in same order as eingevalues if self.k is not None: self.pc_variances = self.pc_variances[:self.k] self.principal_components = self.principal_components[:,:self.k] @@ -125,8 +120,6 @@ class PCA: return self.pc_variances ``` -+++ {"slideshow": {"slide_type": "slide"}} - In the example below, we will use the PCA algorithm to reduce the dimensionality of a genetic dataset from the 1000 genomes project [1,2]. [1] Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015) @@ -143,11 +136,8 @@ We consider five ancestries in the dataset: - **SAS** - South Asian - **AMR** - Native American - -+++ {"slideshow": {"slide_type": "subslide"}} - - ```{code-cell} ipython3 +:tags: [hide-input] snpreader = Bed('./genetic_data/example2.bed', count_A1=True) data = snpreader.read() print(data.shape) @@ -157,14 +147,6 @@ list1 = data.iid[:,1].tolist() #list with the Sample numbers present in genetic labels = labels[labels.index.isin(list1)] #filter labels DataFrame so it only contains the sampleIDs present in genetic data y = labels.SuperPopulation # EUR, AFR, AMR, EAS, SAS X = data.val[:, ~np.isnan(data.val).any(axis=0)] #load genetic data to X, removing NaN values -``` - - -+++ {"slideshow": {"slide_type": "subslide"}} - - -```{code-cell} ipython3 - pca = PCA() pca.fit(X=X) @@ -178,13 +160,6 @@ for rank in range(5): #more correct: X_pc.shape[1]+1 X_lowrank = pca_lowrank.transform(X) X_reconstruction = pca_lowrank.reverse_transform(X_lowrank) print("L1 reconstruction error for rank %i PCA : %.4E " % (rank, np.absolute(X - X_reconstruction).sum())) -``` - - -+++ {"slideshow": {"slide_type": "subslide"}} - - -```{code-cell} ipython3 fig = plt.figure() plt.plot(X_pc[y=="EUR"][:,0], X_pc[y=="EUR"][:,1],'.', alpha = 0.3) @@ -227,5 +202,4 @@ plt.plot(pca.variance_explained().cumsum() / pca.variance_explained().sum()) plt.xlabel("PC dimension") plt.ylabel("cumulative fraction of variance explained") plt.show() -``` - +``` \ No newline at end of file From 6aadc18690a6aade7daf8ce6859dbd8c7bc52657 Mon Sep 17 00:00:00 2001 From: clippert Date: Mon, 19 May 2025 14:43:29 -0400 Subject: [PATCH 24/43] fixed bugs --- book/_toc.yml | 3 +- .../Rayleigh_quotients.md | 219 --------- book/chapter_decompositions/big_picture.md | 2 +- book/chapter_decompositions/matrix_norms.md | 13 - book/chapter_decompositions/matrix_rank.md | 6 +- book/chapter_decompositions/pca.md | 31 +- book/chapter_decompositions/psd_matrices.md | 37 +- book/chapter_decompositions/pseudoinverse.md | 37 -- book/chapter_decompositions/svd.md | 4 +- .../RBF_kernel_Positive_Definite.md | 436 ++++++++++++++++++ .../spectral_theorem_self-adjoint.md | 273 +++++++++++ 11 files changed, 750 insertions(+), 311 deletions(-) delete mode 100644 book/chapter_decompositions/Rayleigh_quotients.md delete mode 100644 book/chapter_decompositions/matrix_norms.md delete mode 100644 book/chapter_decompositions/pseudoinverse.md create mode 100644 drafts/chapter_decompositions/RBF_kernel_Positive_Definite.md create mode 100644 drafts/chapter_decompositions/spectral_theorem_self-adjoint.md diff --git a/book/_toc.yml b/book/_toc.yml index b906480..7e54824 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -86,12 +86,13 @@ parts: - file: chapter_decompositions/eigenvectors # end week 05 - file: chapter_decompositions/orthogonal_matrices - file: chapter_decompositions/symmetric_matrices - - file: chapter_decompositions/Rayleigh_quotients # skip for now + # - file: chapter_decompositions/Rayleigh_quotients # skip for now - file: chapter_decompositions/psd_matrices - file: chapter_decompositions/pca # PCA as example for the eigenvalue decomposition of a psd matrix title: Principal Components Analysis - file: chapter_decompositions/svd # - file: chapter_decompositions/big_picture +# - file: chapter_decompositions/RBF_kernel_Positive_Definite # - file: chapter_decompositions/pseudoinverse # - file: chapter_decompositions/matrix_norms # - file: chapter_convexity/overview_convexity diff --git a/book/chapter_decompositions/Rayleigh_quotients.md b/book/chapter_decompositions/Rayleigh_quotients.md deleted file mode 100644 index 976e143..0000000 --- a/book/chapter_decompositions/Rayleigh_quotients.md +++ /dev/null @@ -1,219 +0,0 @@ -# Rayleigh Quotients - -There turns out to be an interesting connection between the quadratic form of a symmetric matrix and its eigenvalues. -This connection is provided by the **Rayleigh quotient** - -> $$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}$$ - -The Rayleigh quotient has a couple of important properties: - -:::{prf:lemma} Properties of the Rayleigh Quotient -:label: trm-Rayleigh-properties -:nonumber: - -(i) **Scale invariance**: for any vector $\mathbf{x} \neq \mathbf{0}$ - and any scalar $\alpha \neq 0$, - $R_\mathbf{A}(\mathbf{x}) = R_\mathbf{A}(\alpha\mathbf{x})$. - -(ii) If $\mathbf{x}$ is an eigenvector of $\mathbf{A}$ with eigenvalue - $\lambda$, then $R_\mathbf{A}(\mathbf{x}) = \lambda$. -::: - -:::{prf:proof} -(i) - - $$R_\mathbf{A}(\alpha\mathbf{x}) = \frac{(\alpha\mathbf{x})^{\!\top\!}\mathbf{A}(\alpha\mathbf{x})}{(\alpha\mathbf{x})^{\!\top\!}(\alpha\mathbf{x})} = \frac{\alpha^2}{\alpha^2}\frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}=R_\mathbf{A}(\mathbf{x}).$$ - -(ii) Let $\mathbf{x}$ be an eigenvector of $\mathbf{A}$ with eigenvalue - $\lambda$, then - - $$R_\mathbf{A}(\mathbf{x})= \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}} = \frac{\mathbf{x}^{\!\top\!}(\lambda\mathbf{x})}{\mathbf{x}^{\!\top\!}\mathbf{x}}=\lambda\frac{\mathbf{x}^{\!\top\!}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}} = \lambda.$$ -::: - -We can further show that the Rayleigh quotient is bounded by the largest -and smallest eigenvalues of $\mathbf{A}$. - -But first we will show a useful special case of the final result. - -:::{prf:theorem} Bound Rayleigh Quotient -:label: trm-bound-Rayleigh-quotient -:nonumber: - -For any $\mathbf{x}$ such that $\|\mathbf{x}\|_2 = 1$, - -$$\lambda_{\min}(\mathbf{A}) \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$$ - -with equality if and only if $\mathbf{x}$ is a corresponding eigenvector. -::: - -:::{prf:proof} - -We show only the $\max$ case because the argument for the -$\min$ case is entirely analogous. - -Since $\mathbf{A}$ is symmetric, we can decompose it as -$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$. - -Then use -the change of variable $\mathbf{y} = \mathbf{Q}^{\!\top\!}\mathbf{x}$, -noting that the relationship between $\mathbf{x}$ and $\mathbf{y}$ is -one-to-one and that $\|\mathbf{y}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. - -Hence - -$$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \max_{\|\mathbf{y}\|_2 = 1} \mathbf{y}^{\!\top\!}\mathbf{\Lambda}\mathbf{y} = \max_{y_1^2+\dots+y_n^2=1} \sum_{i=1}^n \lambda_i y_i^2$$ - -Written this way, it is clear that $\mathbf{y}$ maximizes this -expression exactly if and only if it satisfies -$\sum_{i \in I} y_i^2 = 1$ where -$I = \{i : \lambda_i = \max_{j=1,\dots,n} \lambda_j = \lambda_{\max}(\mathbf{A})\}$ -and $y_j = 0$ for $j \not\in I$. - -That is, $I$ contains the index or -indices of the largest eigenvalue. - -In this case, the maximal value of -the expression is - -$$\sum_{i=1}^n \lambda_i y_i^2 = \sum_{i \in I} \lambda_i y_i^2 = \lambda_{\max}(\mathbf{A}) \sum_{i \in I} y_i^2 = \lambda_{\max}(\mathbf{A})$$ - -Then writing $\mathbf{q}_1, \dots, \mathbf{q}_n$ for the columns of -$\mathbf{Q}$, we have - -$$\mathbf{x} = \mathbf{Q}\mathbf{Q}^{\!\top\!}\mathbf{x} = \mathbf{Q}\mathbf{y} = \sum_{i=1}^n y_i\mathbf{q}_i = \sum_{i \in I} y_i\mathbf{q}_i$$ - -where we have used the matrix-vector product identity. - -Recall that $\mathbf{q}_1, \dots, \mathbf{q}_n$ are eigenvectors of -$\mathbf{A}$ and form an orthonormal basis for $\mathbb{R}^n$. - -Therefore by construction, the set $\{\mathbf{q}_i : i \in I\}$ forms an -orthonormal basis for the eigenspace of $\lambda_{\max}(\mathbf{A})$. - -Hence $\mathbf{x}$, which is a linear combination of these, lies in that -eigenspace and thus is an eigenvector of $\mathbf{A}$ corresponding to -$\lambda_{\max}(\mathbf{A})$. - -We have shown that -$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \lambda_{\max}(\mathbf{A})$, -from which we have the general inequality -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$ -for all unit-length $\mathbf{x}$. ◻ -::: - -By the scale invariance of the Rayleigh quotient, we immediately have as -a corollary - -:::{prf:theorem} Min-Max Theorem -:label: trm-min-max -:nonumber: - -For all $\mathbf{x} \neq \mathbf{0}$, - -$$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ - -with equality if and only if $\mathbf{x}$ is a corresponding -eigenvector. -::: - -:::{prf:proof} - -Let $\mathbf{x}\neq \boldsymbol{0},$ then - -$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}} = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\|\mathbf{x}\|^2} = (\frac{\mathbf{x}}{\|\mathbf{x}\|})^{\!\top\!}\mathbf{A}(\frac{\mathbf{x}}{\|\mathbf{x}\|})$ - -Thus, minimimum and maximum of the Rayleigh quotient are identical to minimum and maximum of the squared form $\mathbf{y}\mathbf{A}\mathbf{y}$ for the unit-norm vector $\mathbf{y}=\mathbf{x}/\|\mathbf{x}\|$: - -$$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ - -◻ -::: - -```{code-cell} ipython3 -:tags: [hide-input] -import numpy as np -import matplotlib.pyplot as plt - -# Define symmetric matrix -A = np.array([[2, 1], - [1, 3]]) - -# Eigenvalues and eigenvectors -eigvals, eigvecs = np.linalg.eigh(A) -λ_min, λ_max = eigvals - -# Generate unit circle points -theta = np.linspace(0, 2*np.pi, 300) -circle = np.stack((np.cos(theta), np.sin(theta))) - -# Rayleigh quotient computation -R = np.einsum('ij,ji->i', circle.T @ A, circle) # x^T A x -R /= np.einsum('ij,ji->i', circle.T, circle) # x^T x - -# Rayleigh extrema -idx_min = np.argmin(R) -idx_max = np.argmax(R) -x_min = circle[:, idx_min] -x_max = circle[:, idx_max] - -# Prepare grid for quadratic form level sets -x = np.linspace(-2, 2, 400) -y = np.linspace(-2, 2, 400) -X, Y = np.meshgrid(x, y) -XY = np.stack((X, Y), axis=-1) -Z = np.einsum('...i,ij,...j->...', XY, A, XY) -levels = np.linspace(np.min(Z), np.max(Z), 20) - -# Create combined figure -fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) - -# Left: Rayleigh quotient on unit circle -sc = ax1.scatter(circle[0], circle[1], c=R, cmap='viridis', s=10) -ax1.quiver(0, 0, x_min[0], x_min[1], color='red', scale=1, scale_units='xy', angles='xy', label='argmin R(x)') -ax1.quiver(0, 0, x_max[0], x_max[1], color='orange', scale=1, scale_units='xy', angles='xy', label='argmax R(x)') -for i in range(2): - eigvec = eigvecs[:, i] - ax1.quiver(0, 0, eigvec[0], eigvec[1], color='black', alpha=0.5, scale=1, scale_units='xy', angles='xy', width=0.008) -ax1.set_title("Rayleigh Quotient on the Unit Circle") -ax1.set_aspect('equal') -ax1.set_xlim(-1.1, 1.1) -ax1.set_ylim(-1.1, 1.1) -ax1.grid(True) -ax1.legend() -plt.colorbar(sc, ax=ax1, label="Rayleigh Quotient $R_A(\\mathbf{x})$") - -# Right: Level sets of quadratic form -contour = ax2.contour(X, Y, Z, levels=levels, cmap='viridis') -ax2.clabel(contour, inline=True, fontsize=8, fmt="%.1f") -ax2.set_title("Level Sets of $\\mathbf{x}^\\top \\mathbf{A} \\mathbf{x}$") -ax2.set_xlabel("$x_1$") -ax2.set_ylabel("$x_2$") -ax2.axhline(0, color='gray', lw=0.5) -ax2.axvline(0, color='gray', lw=0.5) -for i in range(2): - vec = eigvecs[:, i] * np.sqrt(eigvals[i]) - ax2.quiver(0, 0, vec[0], vec[1], color='red', scale=1, scale_units='xy', angles='xy', width=0.01, label=f"$\\mathbf{{q}}_{i+1}$") -ax2.set_aspect('equal') -ax2.legend() - -plt.suptitle("Rayleigh Quotient and Quadratic Form Level Sets", fontsize=16) -plt.tight_layout(rect=[0, 0, 1, 0.93]) -plt.show() -``` - -This combined visualization brings together the **Rayleigh quotient** and the **level sets of the quadratic form** $\mathbf{x}^\top \mathbf{A} \mathbf{x}$: - -* **Left panel**: Rayleigh quotient $R_\mathbf{A}(\mathbf{x})$ on the unit circle - - * Color shows how the value varies with direction. - * Extremes occur at eigenvector directions (marked with arrows). - -* **Right panel**: Level sets (contours) of the quadratic form - - * Elliptical shapes aligned with eigenvectors. - * Red vectors indicate principal axes (scaled eigenvectors). - -Together, these panels illustrate how the **direction of a vector determines how strongly it is scaled** by the symmetric matrix, and how this scaling relates to the matrix's **eigenstructure**. - -✅ As guaranteed by the **Min–Max Theorem**, the maximum and minimum of the Rayleigh quotient occur precisely at the **eigenvectors corresponding to the largest and smallest eigenvalues**. diff --git a/book/chapter_decompositions/big_picture.md b/book/chapter_decompositions/big_picture.md index 07f132d..4f6735e 100644 --- a/book/chapter_decompositions/big_picture.md +++ b/book/chapter_decompositions/big_picture.md @@ -10,7 +10,7 @@ kernelspec: language: python name: python3 --- -## The fundamental subspaces of a matrix +# The fundamental subspaces of a matrix The fundamental subspaces of a matrix $A$ are the four subspaces associated with the matrix and its transpose. These subspaces are important in linear algebra and numerical analysis, particularly in the context of solving linear systems and eigenvalue problems. diff --git a/book/chapter_decompositions/matrix_norms.md b/book/chapter_decompositions/matrix_norms.md deleted file mode 100644 index 3c28d81..0000000 --- a/book/chapter_decompositions/matrix_norms.md +++ /dev/null @@ -1,13 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.16.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- -# Matrix Norms diff --git a/book/chapter_decompositions/matrix_rank.md b/book/chapter_decompositions/matrix_rank.md index c7e3f60..5066210 100644 --- a/book/chapter_decompositions/matrix_rank.md +++ b/book/chapter_decompositions/matrix_rank.md @@ -29,7 +29,7 @@ $$ --- -### ✅ Interpretations +## ✅ Interpretations * **Column Rank**: The number of linearly independent **columns** * **Row Rank**: The number of linearly independent **rows** @@ -38,7 +38,7 @@ $$ --- -### ✅ Practical View +## ✅ Practical View To compute $\operatorname{rank}(\mathbf{A})$ in practice: @@ -47,7 +47,7 @@ To compute $\operatorname{rank}(\mathbf{A})$ in practice: --- -### 🧠 Summary +## 🧠 Summary $$ \boxed{ diff --git a/book/chapter_decompositions/pca.md b/book/chapter_decompositions/pca.md index 94a40e3..4190b33 100644 --- a/book/chapter_decompositions/pca.md +++ b/book/chapter_decompositions/pca.md @@ -10,11 +10,8 @@ kernelspec: language: python name: python3 --- -+++ {"slideshow": {"slide_type": "slide"}} - # Principal Components Analysis - Pricnipal Components Analysis (PCA) performs the orthogonal projection of the data onto a lower dimensional linear space. The goal is to find the directions (principal components) in which the variance of the data is maximized. An alternative definition of PCA is based on minimizing the sum-of-sqares of the projection errors. @@ -196,30 +193,6 @@ Choose $\mathbf{U}_k = [\mathbf{u}_1, \dots, \mathbf{u}_k]$ to maximize $\text{t This is optimal because trace is maximized by choosing eigenvectors with **largest** eigenvalues (known from Rayleigh-Ritz and Courant-Fischer principles). ---- - -### Step 6: Compute PCA via SVD (Optional) - -Rather than computing $\mathbf{X}^\top \mathbf{X}$, you can also directly compute the **Singular Value Decomposition** of $\mathbf{X}$: - -$$ -\mathbf{X} = \mathbf{U} \Sigma \mathbf{V}^\top -$$ - -* $\mathbf{U} \in \mathbb{R}^{N \times N}$ -* $\Sigma \in \mathbb{R}^{N \times D}$ -* $\mathbf{V} \in \mathbb{R}^{D \times D}$ - -Then the principal components are the **first $k$ columns** of $\mathbf{V}$, and: - -$$ -\mathbf{Z} = \mathbf{X} \mathbf{V}_k -$$ - -is the reduced representation. - ---- - ## PCA Derivation Summary - **Input**: Centered data matrix \(\mathbf{X} \in \mathbb{R}^{N \times D}\) @@ -358,11 +331,11 @@ We consider five ancestries in the dataset: ```{code-cell} ipython3 :tags: [hide-input] -snpreader = Bed('./genetic_data/example2.bed', count_A1=True) +snpreader = Bed('../../datasets/genetic_data_1kg/example2.bed', count_A1=True) data = snpreader.read() print(data.shape) # y includes our labels and x includes our features -labels = pd.read_csv("./genetic_data/1kg_annotations_edit.txt", sep="\t", index_col="Sample") +labels = pd.read_csv("../../datasets/genetic_data_1kg/1kg_annotations_edit.txt", sep="\t", index_col="Sample") list1 = data.iid[:,1].tolist() #list with the Sample numbers present in genetic dataset labels = labels[labels.index.isin(list1)] #filter labels DataFrame so it only contains the sampleIDs present in genetic data y = labels.SuperPopulation # EUR, AFR, AMR, EAS, SAS diff --git a/book/chapter_decompositions/psd_matrices.md b/book/chapter_decompositions/psd_matrices.md index 10047f4..90fdf61 100644 --- a/book/chapter_decompositions/psd_matrices.md +++ b/book/chapter_decompositions/psd_matrices.md @@ -1,3 +1,15 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- # Positive (semi-)definite matrices >A symmetric matrix $\mathbf{A}$ is **positive semi-definite** if for all $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \geq 0$. @@ -25,7 +37,8 @@ its eigenvalues are positive. :::{prf:proof} Suppose $A$ is positive semi-definite, and let $\mathbf{x}$ be -an eigenvector of $\mathbf{A}$ with eigenvalue $\lambda$. Then +an eigenvector of $\mathbf{A}$ with eigenvalue $\lambda$. +Then $$0 \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}(\lambda\mathbf{x}) = \lambda\mathbf{x}^{\!\top\!}\mathbf{x} = \lambda\|\mathbf{x}\|_2^2$$ @@ -58,7 +71,8 @@ strictly positive, then $0 < \lambda_{\min}(\mathbf{A})$, whence it follows that $\mathbf{A}$ is positive definite. ◻ ::: -As an example of how these matrices arise, consider +## Gram matrices +In many machine learning algorithms, especially those involving regression, classification, or kernel methods, we frequently work with **data matrices** $\mathbf{A} \in \mathbb{R}^{m \times n}$, where each **row** represents a sample and each **column** a feature. From such matrices, we often compute **matrices of inner products** like $\mathbf{A}^\top \mathbf{A}$. These matrices — called **Gram matrices** — encode the pairwise **similarity between features** (or, in kernelized settings, between samples), and play a central role in optimization problems such as least squares, ridge regression, and principal component analysis. :::{prf:proposition} Gram Matrices :label: trm-gram-matrices @@ -93,9 +107,15 @@ if and only if $\mathbf{x} = \mathbf{0}$, and thus $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. ◻ ::: +We observe that kernel matrices computed for all pairs of instances in a data set are positive semi definite. In fact, many kernel functions, like for example the RBF kernel, guarantee positive definiteness of the kernel matrix as long as all data points are pairwise distinct. + + +## Invertibility + Positive definite matrices are invertible (since their eigenvalues are -nonzero), whereas positive semi-definite matrices might not be. However, -if you already have a positive semi-definite matrix, it is possible to +nonzero), whereas positive semi-definite matrices might not be. + +However, if you already have a positive semi-definite matrix, it is possible to perturb its diagonal slightly to produce a positive definite matrix. :::{prf:proposition} @@ -121,12 +141,15 @@ $\mathbf{A}^{\!\top\!}\mathbf{A} + \epsilon\mathbf{I}$ is positive definite (and in particular, invertible) for *any* matrix $\mathbf{A}$ and any $\epsilon > 0$. -### The geometry of positive definite quadratic forms +## The geometry of positive definite quadratic forms A useful way to understand quadratic forms is by the geometry of their -level sets. A **level set** or **isocontour** of a function is the set +level sets. +A **level set** or **isocontour** of a function is the set of all inputs such that the function applied to those inputs yields a -given output. Mathematically, the $c$-isocontour of $f$ is +given output. + +Mathematically, the $c$-isocontour of $f$ is $\{\mathbf{x} \in \operatorname{dom} f : f(\mathbf{x}) = c\}$. Let us consider the special case diff --git a/book/chapter_decompositions/pseudoinverse.md b/book/chapter_decompositions/pseudoinverse.md deleted file mode 100644 index 1b016db..0000000 --- a/book/chapter_decompositions/pseudoinverse.md +++ /dev/null @@ -1,37 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.16.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- -# Moore-Penrose Pseudoinverse -The Moore-Penrose pseudoinverse is a generalization of the matrix inverse that can be applied to non-square or singular matrices. It is denoted as $ A^+ $ for a matrix $ A $. The pseudoinverse satisfies the following properties: -1. **Existence**: The pseudoinverse exists for any matrix $ A $. -2. **Uniqueness**: The pseudoinverse is unique. -3. **Properties**: - - $ A A^+ A = A $ - - $ A^+ A A^+ = A^+ $ - - $ (A A^+)^\top = A A^+ $ - - $ (A^+ A)^\top = A^+ A $ -4. **Rank**: The rank of $ A^+ $ is equal to the rank of $ A $. -5. **Singular Value Decomposition (SVD)**: The pseudoinverse can be computed using the singular value decomposition of $ A $. If $ A = U \Sigma V^\top $, where $ U $ and $ V $ are orthogonal matrices and $ \Sigma $ is a diagonal matrix with singular values, then: - - $$ - A^+ = V \Sigma^+ U^\top - $$ - where $ \Sigma^+ $ is obtained by taking the reciprocal of the non-zero singular values in $ \Sigma $ and transposing the resulting matrix. -6. **Applications**: The pseudoinverse is used in various applications, including solving linear systems, least squares problems, and in machine learning algorithms such as linear regression. -7. **Least Squares Solution**: The pseudoinverse provides a least squares solution to the equation $ Ax = b $ when $ A $ is not square or has no unique solution. The least squares solution is given by: - - $$ - x = A^+ b - $$ -8. **Geometric Interpretation**: The pseudoinverse can be interpreted geometrically as the projection of a vector onto the column space of $ A $. -9. **Computational Considerations**: The computation of the pseudoinverse can be done efficiently using numerical methods, such as the SVD, especially for large matrices. -10. **Limitations**: The pseudoinverse may not be suitable for all applications, especially when the matrix is ill-conditioned or has a high condition number. diff --git a/book/chapter_decompositions/svd.md b/book/chapter_decompositions/svd.md index 72f1eb2..60bd36a 100644 --- a/book/chapter_decompositions/svd.md +++ b/book/chapter_decompositions/svd.md @@ -1,4 +1,4 @@ -## Singular value decomposition +# Singular value decomposition Singular value decomposition (SVD) is a widely applicable tool in linear algebra. Its strength stems partially from the fact that *every matrix* @@ -46,6 +46,8 @@ $\mathbf{A}\mathbf{A}^{\!\top\!}$)[^5]. ## Some useful matrix identities +In the following, we present a number of important identities for the SVD. + ### Matrix-vector product as linear combination of matrix columns *Proposition.* diff --git a/drafts/chapter_decompositions/RBF_kernel_Positive_Definite.md b/drafts/chapter_decompositions/RBF_kernel_Positive_Definite.md new file mode 100644 index 0000000..6b5e410 --- /dev/null +++ b/drafts/chapter_decompositions/RBF_kernel_Positive_Definite.md @@ -0,0 +1,436 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# RBF Kernel Positive Definite + +In this chapter, we will state and prove Mercer's theorem, showing that a set of kernel's, so called Mercer kernels exist that represent infinite dimensional reproducing kernel Hilbert spaces. Mercer kernels always produce positive definite kernel matrices. + +## Mercer's Theorem + +Mercer’s Theorem is a cornerstone in understanding **positive-definite kernels** and their representation in **reproducing kernel Hilbert spaces (RKHS)** — foundational for kernel methods like SVMs and kernel PCA. + +Below is a careful statement and proof outline of **Mercer’s Theorem**, suitable for a course that has covered eigenvalues, symmetric matrices, and function spaces. + +--- + +## 📜 Mercer’s Theorem (Simplified Version) + +Let $k : \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ be a **symmetric, continuous, positive semi-definite kernel** function on a **compact domain** $\mathcal{X} \subset \mathbb{R}^d$. + +> Then there exists an **orthonormal basis** $\{\phi_i\}_{i=1}^\infty$ of $L^2(\mathcal{X})$, and **non-negative eigenvalues** $\{\lambda_i\}_{i=1}^\infty$, such that: +> +> $$ +> k(x, x') = \sum_{i=1}^{\infty} \lambda_i \phi_i(x) \phi_i(x') \quad \text{with convergence in } L^2(\mathcal{X} \times \mathcal{X}) +> $$ +> +> Furthermore, the integral operator: +> +> $$ +> (Tf)(x) := \int_{\mathcal{X}} k(x, x') f(x') dx' +> $$ +> +> is **compact, self-adjoint**, and **positive semi-definite** on $L^2(\mathcal{X})$. + +--- + +## 🧠 Intuition + +Mercer’s Theorem says: + +* A symmetric, continuous, PSD kernel defines a **nice integral operator** on functions. +* That operator has a **spectral decomposition**, just like symmetric matrices do. +* The kernel function $k(x, x')$ can be written as a **sum over eigenfunctions** of this operator, just like how a Gram matrix can be decomposed as $K = \sum \lambda_i u_i u_i^\top$. + +This justifies using **feature maps** $\phi_i(x) = \sqrt{\lambda_i} \psi_i(x)$ and writing: + +$$ +k(x, x') = \langle \phi(x), \phi(x') \rangle_{\ell^2} +$$ + +--- + +## ✍️ Sketch of the Proof + +### Step 1: Define the Integral Operator + +Given a kernel $k(x, x')$, define an operator $T$ on $L^2(\mathcal{X})$ by: + +$$ +(Tf)(x) = \int_{\mathcal{X}} k(x, x') f(x') dx' +$$ + +* $T$ is **linear** +* $T$ is **self-adjoint** since $k(x, x') = k(x', x)$ +* $T$ is **compact**, due to continuity of $k$ on a compact domain + +### Step 2: Apply the Spectral Theorem for Compact Self-Adjoint Operators + +From functional analysis: + +* $T$ has an orthonormal basis of eigenfunctions $\{\phi_i\}_{i=1}^\infty$ +* Corresponding eigenvalues $\lambda_i \geq 0$ (since $T$ is PSD) + +### Step 3: Represent the Kernel + +Show that: + +$$ +k(x, x') = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(x') +$$ + +This expansion converges **absolutely and uniformly** on $\mathcal{X} \times \mathcal{X}$ if $k$ is continuous. + +### Step 4: Show PSD and Feature Map Representation + +From the expansion, define the map: + +$$ +\phi(x) := \left( \sqrt{\lambda_1} \phi_1(x), \sqrt{\lambda_2} \phi_2(x), \dots \right) +$$ + +Then: + +$$ +k(x, x') = \langle \phi(x), \phi(x') \rangle_{\ell^2} +$$ + +So the kernel is **an inner product in an infinite-dimensional Hilbert space** — justifying its use in kernel methods. + +--- + +## ✅ Summary Box + + +**Mercer's Theorem (simplified)** + +Let $ k(x, x') $ be a continuous, symmetric, positive semi-definite kernel on a compact domain $ \mathcal{X} \subset \mathbb{R}^d $. + +Then there exist orthonormal functions $ \phi_i \in L^2(\mathcal{X}) $ and eigenvalues $ \lambda_i \geq 0 $ such that: + +$$ +k(x, x') = \sum_{i=1}^{\infty} \lambda_i \phi_i(x) \phi_i(x') +$$ +with convergence in $ L^2(\mathcal{X} \times \mathcal{X}) $. + +Moreover, $ k $ defines a compact, self-adjoint, PSD operator on $ L^2(\mathcal{X}) $. + + + + +## 🧠 Setup: What is the RBF kernel? + +Let $\mathbf{x}_1, \dots, \mathbf{x}_n \in \mathbb{R}^d$ be a set of data points. + +The **RBF kernel** (also called Gaussian kernel) is defined as: + +$$ +k(\mathbf{x}, \mathbf{x}') = \exp\left(-\gamma \|\mathbf{x} - \mathbf{x}'\|^2\right) +\quad \text{for } \gamma > 0 +$$ + +The **RBF kernel matrix** $\mathbf{K} \in \mathbb{R}^{n \times n}$ has entries: + +$$ +\mathbf{K}_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) +$$ + +--- + +## ✅ Claim + +> The RBF kernel matrix $\mathbf{K}$ is **positive semi-definite** for all $\gamma > 0$, i.e., for any $\mathbf{c} \in \mathbb{R}^n$, + +$$ +\mathbf{c}^\top \mathbf{K} \mathbf{c} \geq 0 +$$ + +Moreover, if all $\mathbf{x}_i$ are distinct, then $\mathbf{K}$ is **positive definite**. + +--- + +## ✍️ Proof (via Mercer's Theorem / Fourier Representation) + +The RBF kernel is a special case of a **positive-definite kernel** as characterized by Mercer's theorem, but here’s a more constructive argument: + +### Step 1: Express the kernel as an inner product in an infinite-dimensional feature space. + +Let’s define the feature map $\phi: \mathbb{R}^d \to \ell^2$ via: + +$$ +\phi(\mathbf{x}) = \left( \sqrt{a_k} \, \psi_k(\mathbf{x}) \right)_{k=1}^{\infty} +$$ + +such that: + +$$ +k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle +$$ + +It is known (e.g., via Taylor expansion or Fourier basis) that the RBF kernel corresponds to an **inner product in an infinite-dimensional Hilbert space**, and hence: + +$$ +k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle +\Rightarrow +\mathbf{K}_{ij} = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle +$$ + +Then, for any $\mathbf{c} \in \mathbb{R}^n$: + +$$ +\mathbf{c}^\top \mathbf{K} \mathbf{c} += \sum_{i,j=1}^n c_i c_j \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle += \left\| \sum_{i=1}^n c_i \phi(\mathbf{x}_i) \right\|^2 \geq 0 +$$ + +✅ Hence, $\mathbf{K}$ is **positive semi-definite**. + +--- + +### 🚀 Positive definiteness + +If the $\mathbf{x}_i$ are **pairwise distinct**, then the feature vectors $\phi(\mathbf{x}_i)$ are **linearly independent** in the Hilbert space, and the only way for the sum to vanish is $\mathbf{c} = 0$. Hence: + +$$ +\mathbf{c}^\top \mathbf{K} \mathbf{c} > 0 \quad \text{for all } \mathbf{c} \ne 0 +$$ + +✅ So $\mathbf{K}$ is **positive definite** if all data points are distinct. + +--- + +### 📦 Summary + + +**Proposition**: The RBF kernel matrix $ \mathbf{K}_{ij} = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2) $ is positive semi-definite for all $\gamma > 0$, and positive definite if all $\mathbf{x}_i$ are distinct. + +**Proof sketch**: The kernel function is an inner product in a Hilbert space, so the Gram matrix $ \mathbf{K} $ has the form $ \mathbf{K} = \Phi \Phi^\top $, which is always PSD. + +## **proof by induction** that the **RBF kernel matrix is positive semi-definite**, based on verifying the PSD property for matrices of increasing size. This approach is constructive, concrete, and aligns well with students familiar with induction and Gram matrices. + + +--- + +## 🧩 Goal + +Let $K \in \mathbb{R}^{n \times n}$, with entries: + +$$ +K_{ij} = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2) +$$ + +We aim to prove: + +> For any $n \in \mathbb{N}$, $K$ is **positive semi-definite**, i.e., for all $\mathbf{c} \in \mathbb{R}^n$: + +$$ +\mathbf{c}^\top K \mathbf{c} \geq 0 +$$ + +--- + +## 🧠 Strategy: Induction on Matrix Size $n$ + +Let’s prove it by **induction on $n$**, the number of input points $\mathbf{x}_1, \dots, \mathbf{x}_n \in \mathbb{R}^d$. + +--- + +### 🧱 Base Case $n = 1$ + +We have: + +$$ +K = [1] \quad \text{since } \|\mathbf{x}_1 - \mathbf{x}_1\|^2 = 0 \Rightarrow K_{11} = \exp(0) = 1 +$$ + +Then for any $c \in \mathbb{R}$: + +$$ +c^\top K c = c^2 \cdot 1 = c^2 \geq 0 +$$ + +✅ Base case holds. + +--- + +### 🔁 Inductive Hypothesis + +Assume that for some $n$, the kernel matrix $K_n \in \mathbb{R}^{n \times n}$ formed from $\mathbf{x}_1, \dots, \mathbf{x}_n$ is **positive semi-definite**. + +--- + +### 🔄 Inductive Step: $n+1$ + +We add a new point $\mathbf{x}_{n+1}$ and form the $(n+1) \times (n+1)$ matrix $K_{n+1}$: + +$$ +K_{n+1} = +\begin{bmatrix} +K_n & \mathbf{k} \\ +\mathbf{k}^\top & 1 +\end{bmatrix} +$$ + +where: + +* $K_n \in \mathbb{R}^{n \times n}$ is the existing RBF matrix (assumed PSD) +* $\mathbf{k} \in \mathbb{R}^n$, with entries $k_i = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_{n+1}\|^2)$ +* The bottom-right entry is $k(\mathbf{x}_{n+1}, \mathbf{x}_{n+1}) = 1$ + +Let $\mathbf{c} \in \mathbb{R}^{n+1}$, split as: + +$$ +\mathbf{c} = \begin{bmatrix} \mathbf{a} \\ b \end{bmatrix}, \quad \mathbf{a} \in \mathbb{R}^n, \ b \in \mathbb{R} +$$ + +Then: + +$$ +\mathbf{c}^\top K_{n+1} \mathbf{c} = +\begin{bmatrix} \mathbf{a}^\top & b \end{bmatrix} +\begin{bmatrix} +K_n & \mathbf{k} \\ +\mathbf{k}^\top & 1 +\end{bmatrix} +\begin{bmatrix} \mathbf{a} \\ b \end{bmatrix} += \mathbf{a}^\top K_n \mathbf{a} + 2b \mathbf{k}^\top \mathbf{a} + b^2 +$$ + +Let’s define: + +$$ +f(b) = \mathbf{a}^\top K_n \mathbf{a} + 2b \mathbf{k}^\top \mathbf{a} + b^2 += \left( b + \mathbf{k}^\top \mathbf{a} \right)^2 + \left( \mathbf{a}^\top K_n \mathbf{a} - (\mathbf{k}^\top \mathbf{a})^2 \right) +$$ + +Note: + +* The first term $\left(b + \mathbf{k}^\top \mathbf{a} \right)^2 \geq 0$ +* By the **Cauchy-Schwarz inequality**, if $K_n$ is a Gram matrix (as is the case here), then: + + $$ + (\mathbf{k}^\top \mathbf{a})^2 \leq \mathbf{a}^\top K_n \mathbf{a} + \Rightarrow \mathbf{a}^\top K_n \mathbf{a} - (\mathbf{k}^\top \mathbf{a})^2 \geq 0 + $$ + +✅ Therefore, $f(b) \geq 0$ for all $\mathbf{a}, b$, i.e., $K_{n+1}$ is PSD. + +--- + +### ✅ Conclusion + +By induction, all RBF kernel matrices $K_n \in \mathbb{R}^{n \times n}$ are **positive semi-definite** for all $n$. + +--- + +### 📦 Summary + +**Theorem**: RBF kernel matrices are positive semi-definite for all n and all γ > 0. + +**Proof**: By induction on the number of data points n, using the structure of the kernel matrix +and properties of quadratic forms and Cauchy-Schwarz inequality. + +## EXpanding the Cauchy-Schwarz step in the proof +### 🎯 The Step in Question + +In the inductive proof of PSD for the RBF kernel matrix, we reached this expression for any vector $\mathbf{c} = \begin{bmatrix} \mathbf{a} \\ b \end{bmatrix} \in \mathbb{R}^{n+1}$: + +$$ +\mathbf{c}^\top K_{n+1} \mathbf{c} = \left( b + \mathbf{k}^\top \mathbf{a} \right)^2 + \left( \mathbf{a}^\top K_n \mathbf{a} - (\mathbf{k}^\top \mathbf{a})^2 \right) +$$ + +We want to argue that: + +$$ +\mathbf{a}^\top K_n \mathbf{a} - (\mathbf{k}^\top \mathbf{a})^2 \geq 0 +$$ + +This is the **Cauchy-Schwarz step** — and here’s what it means. + +--- + +## 🧠 Setting + +* $K_n \in \mathbb{R}^{n \times n}$ is an **RBF kernel matrix**: + + $$ + K_n = \left[ \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2) \right]_{i,j=1}^n + $$ +* The vector $\mathbf{k} \in \mathbb{R}^n$ has entries $k_i = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_{n+1}\|^2)$ + +We assume (by the inductive hypothesis) that $K_n$ is **positive semi-definite**, which means it is a **Gram matrix**: it can be written as + +$$ +K_n = \Phi \Phi^\top +$$ + +for some (possibly infinite-dimensional) feature map $\phi(\mathbf{x})$, where: + +$$ +\Phi = +\begin{bmatrix} +\phi(\mathbf{x}_1)^\top \\ +\vdots \\ +\phi(\mathbf{x}_n)^\top +\end{bmatrix} +\in \mathbb{R}^{n \times d} +$$ + +and $\phi(\mathbf{x}_i) \in \mathbb{R}^d$ or a Hilbert space. + +--- + +## ✅ Step Explained Using Inner Products + +Let’s define: + +* $\mathbf{u} = \sum_{i=1}^n a_i \phi(\mathbf{x}_i)$ +* $\mathbf{v} = \phi(\mathbf{x}_{n+1})$ + +Then: + +* $\mathbf{a}^\top K_n \mathbf{a} = \|\mathbf{u}\|^2$ +* $\mathbf{k}^\top \mathbf{a} = \langle \mathbf{u}, \mathbf{v} \rangle$, since $k_i = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_{n+1}) \rangle$ + +Now apply the **Cauchy–Schwarz inequality** in the inner product space: + +$$ +|\langle \mathbf{u}, \mathbf{v} \rangle|^2 \leq \|\mathbf{u}\|^2 \cdot \|\mathbf{v}\|^2 +$$ + +In our case: + +* $\mathbf{a}^\top K_n \mathbf{a} = \|\mathbf{u}\|^2$ +* $(\mathbf{k}^\top \mathbf{a})^2 = |\langle \mathbf{u}, \mathbf{v} \rangle|^2$ +* $\|\mathbf{v}\|^2 = k(\mathbf{x}_{n+1}, \mathbf{x}_{n+1}) = 1$ + +So: + +$$ +(\mathbf{k}^\top \mathbf{a})^2 \leq \mathbf{a}^\top K_n \mathbf{a} +\quad \Rightarrow \quad +\mathbf{a}^\top K_n \mathbf{a} - (\mathbf{k}^\top \mathbf{a})^2 \geq 0 +$$ + +✅ This guarantees that the second term in our decomposition is **non-negative**, which is what we needed to conclude PSD. + +--- + +### 📌 Summary + +* The RBF kernel matrix $K_n$ is a **Gram matrix**: $K_n = \Phi \Phi^\top$ +* So any quadratic form $\mathbf{a}^\top K_n \mathbf{a}$ is a **squared norm**: $\|\sum a_i \phi(\mathbf{x}_i)\|^2$ +* The dot product with $\phi(\mathbf{x}_{n+1})$ is **bounded** by Cauchy-Schwarz: + + $$ + (\mathbf{k}^\top \mathbf{a})^2 = |\langle \mathbf{u}, \mathbf{v} \rangle|^2 \leq \|\mathbf{u}\|^2 + $$ + diff --git a/drafts/chapter_decompositions/spectral_theorem_self-adjoint.md b/drafts/chapter_decompositions/spectral_theorem_self-adjoint.md new file mode 100644 index 0000000..6818c98 --- /dev/null +++ b/drafts/chapter_decompositions/spectral_theorem_self-adjoint.md @@ -0,0 +1,273 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# 📜 Spectral Theorem for Compact Self-Adjoint Operators + +:::{prf:theorem} Compact, Self-Adjoint Linear Operator +:label: def-compact-selfadjoint-operator +:nonumber: + +Let $ \mathcal{H} $ be a Hilbert space. + +A linear operator $ T : \mathcal{H} \to \mathcal{H} $ is called a **compact, self-adjoint linear operator** if it satisfies the following properties: + +1. **Linearity**: + + $$ + T(\alpha f + \beta g) = \alpha T(f) + \beta T(g) + \quad \text{for all } f, g \in \mathcal{H}, \ \alpha, \beta \in \mathbb{R} \text{ (or } \mathbb{C} \text{)} + $$ + +2. **Self-Adjointness**: + + $$ + \langle T f, g \rangle = \langle f, T g \rangle + \quad \text{for all } f, g \in \mathcal{H} + $$ + +3. **Compactness**: + For every bounded sequence $ \{f_n\} \subset \mathcal{H} $, the sequence $ \{T f_n\} $ has a **convergent subsequence** in $ \mathcal{H} $. +::: + +Here is a clear and formal example — written as a MyST proof block — of a **compact, self-adjoint linear operator** on the Hilbert space $L^2([0, 1])$, using an **integral operator with a continuous symmetric kernel**. + +--- + + +:::{prf:theorem} Integral Operator on $ L^2([0, 1]) $ +:label: ex-integral-operator-compact-selfadjoint +:nonumber: + +Let $ \mathcal{H} = L^2([0, 1]) $ and let $ k : [0, 1] \times [0, 1] \to \mathbb{R} $ be a **continuous**, **symmetric** function, i.e., + +$$ +k(x, y) = k(y, x) \quad \text{for all } x, y \in [0, 1] +$$ + +Define the operator $ T : \mathcal{H} \to \mathcal{H} $ by: + +$$ +(Tf)(x) = \int_0^1 k(x, y) f(y) \, dy +$$ + +Then $ T $ is a **compact, self-adjoint linear operator**: + +- **Linearity**: follows directly from the linearity of the integral. +- **Self-adjointness**: for all $ f, g \in L^2([0, 1]) $, + +$$ +\langle T f, g \rangle = \int_0^1 \left( \int_0^1 k(x, y) f(y) \, dy \right) g(x) \, dx += \int_0^1 f(y) \left( \int_0^1 k(x, y) g(x) \, dx \right) dy += \langle f, T g \rangle +$$ + +by symmetry of $ k(x, y) $. + +- **Compactness**: Since $ k $ is continuous on a compact domain $ [0, 1]^2 $, the operator $ T $ is compact (by the Arzelà–Ascoli theorem or the Hilbert–Schmidt theorem). + +Thus, $ T $ satisfies all the conditions of a compact, self-adjoint linear operator. +::: + + +:::{prf:theorem} RBF Kernel Operator on $ L^2([0, 1]) $ +:label: ex-rbf-kernel-operator +:nonumber: + +Let $ \mathcal{H} = L^2([0, 1]) $, and let $ \gamma > 0 $. Define the kernel: + +$$ +k(x, y) = \exp(-\gamma (x - y)^2) +$$ + +This is the **Radial Basis Function (RBF) kernel**, which is: + +- **continuous** on $ [0, 1]^2 $, +- **symmetric**, i.e., $ k(x, y) = k(y, x) $, +- **positive definite**, meaning it induces a positive semi-definite kernel matrix for any finite sample. + +Then the integral operator + +$$ +(Tf)(x) = \int_0^1 \exp(-\gamma (x - y)^2) f(y) \, dy +$$ + +defines a **compact, self-adjoint linear operator** on $ L^2([0, 1]) $. + +::: + +:::{prf:theorem} Brownian Motion Kernel Operator on $ L^2([0, 1]) $ +:label: ex-min-kernel-operator +:nonumber: + +Let $ k(x, y) = \min(x, y) $, defined on $ [0, 1] \times [0, 1] $. This kernel is: + +- **continuous** and **symmetric**: $ \min(x, y) = \min(y, x) $ +- **positive semi-definite**: it corresponds to the covariance function of standard Brownian motion. + +The integral operator: + +$$ +(Tf)(x) = \int_0^1 \min(x, y) f(y) \, dy +$$ + +is known as the **Volterra operator** associated with Brownian motion. It is: + +- **linear** +- **self-adjoint** (via symmetry of $ \min(x, y) $) +- **compact**, since it is a Hilbert–Schmidt operator with square-integrable kernel. + +Thus, it is a **compact, self-adjoint linear operator** on $ L^2([0, 1]) $. + +::: + + + + +Let $\mathcal{H}$ be a real or complex **Hilbert space**, and let +$T : \mathcal{H} \to \mathcal{H}$ be a **compact, self-adjoint linear operator**. + +> Then: +> +> 1. There exists an **orthonormal basis** $\{\phi_i\}_{i \in \mathbb{N}}$ of $\overline{\operatorname{im}(T)} \subseteq \mathcal{H}$ consisting of **eigenvectors of $T$**. +> +> 2. The corresponding eigenvalues $\{\lambda_i\} \subset \mathbb{R}$ are real, with $\lambda_i \to 0$. +> +> 3. $T$ has at most countably many non-zero eigenvalues, and each non-zero eigenvalue has **finite multiplicity**. +> +> 4. For all $f \in \mathcal{H}$, we have: +> +> $$ +> T f = \sum_{i=1}^\infty \lambda_i \langle f, \phi_i \rangle \phi_i +> $$ +> +> where the sum converges in norm (i.e., in $\mathcal{H}$). + +--- + +## 🧠 Intuition + +* Compactness of $T$ is like “finite rank behavior” at infinity. +* Self-adjointness ensures that the eigenvalues are real, and eigenvectors for distinct eigenvalues are orthogonal. +* The spectrum of $T$ consists of **eigenvalues only**, accumulating at 0. +* We can **diagonalize** $T$ in an orthonormal eigenbasis — exactly like symmetric matrices. + +--- + +## ✍️ Sketch of the Proof + +We split the proof into a sequence of known results. + +--- + +### 1. **Existence of a Maximum Eigenvalue** + +Let $T$ be compact and self-adjoint. Define: + +$$ +\lambda_1 = \sup_{\|f\| = 1} \langle Tf, f \rangle +$$ + +This is the **Rayleigh quotient**, and it gives the largest eigenvalue in magnitude. The supremum is **attained** (due to compactness), and the maximizer $f_1$ satisfies: + +$$ +Tf_1 = \lambda_1 f_1 +$$ + +--- + +### 2. **Orthogonalization and Iteration (like Gram-Schmidt)** + +Define $\mathcal{H}_1 = \{f \in \mathcal{H} : \langle f, f_1 \rangle = 0\}$. Restrict $T$ to $\mathcal{H}_1$, where it remains compact and self-adjoint. Then find the next eigenpair $(\lambda_2, f_2)$, and repeat. + +This gives an **orthonormal sequence** of eigenfunctions $\{f_i\}$ with real eigenvalues $\lambda_i \to 0$, due to compactness. + +--- + +### 3. **Convergence of Spectral Expansion** + +For any $f \in \mathcal{H}$, let: + +$$ +f = \sum_{i=1}^\infty \langle f, \phi_i \rangle \phi_i + f_\perp +$$ + +where $f_\perp \in \ker(T)$. Then: + +$$ +Tf = \sum_{i=1}^\infty \lambda_i \langle f, \phi_i \rangle \phi_i +$$ + +The convergence is in $\mathcal{H}$-norm, using Parseval's identity and the fact that $\lambda_i \to 0$. + +--- + +### ✅ Summary Box + +**Spectral Theorem (Compact Self-Adjoint Operators)** + +Let $ T : \mathcal{H} \to \mathcal{H} $ be compact and self-adjoint. + +Then there exists an orthonormal basis $ \{\phi_i\} \subset \mathcal{H} $ consisting of eigenvectors of $ T $, with corresponding real eigenvalues $ \lambda_i \to 0 $, such that: + +$$ +T f = \sum_{i=1}^\infty \lambda_i \langle f, \phi_i \rangle \phi_i +\quad \text{for all } f \in \mathcal{H} +$$ + + +--- + +This result is the infinite-dimensional generalization of the fact that a real symmetric matrix has an orthonormal eigenbasis and can be diagonalized. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Define domain +n = 200 +x = np.linspace(0, 1, n) +X, Y = np.meshgrid(x, x) + +# Define RBF kernel +gamma = 50 +rbf_kernel = np.exp(-gamma * (X - Y) ** 2) + +# Define min kernel +min_kernel = np.minimum(X, Y) + +# Plotting both kernels side-by-side +fig, axs = plt.subplots(1, 2, figsize=(14, 5)) + +# Plot RBF kernel +im0 = axs[0].imshow(rbf_kernel, extent=[0,1,0,1], origin='lower', cmap='viridis') +axs[0].set_title('RBF Kernel $k(x,y) = \exp(-\\gamma (x-y)^2)$') +axs[0].set_xlabel('x') +axs[0].set_ylabel('y') +fig.colorbar(im0, ax=axs[0]) + +# Plot min kernel +im1 = axs[1].imshow(min_kernel, extent=[0,1,0,1], origin='lower', cmap='viridis') +axs[1].set_title('Min Kernel $k(x,y) = \min(x, y)$') +axs[1].set_xlabel('x') +axs[1].set_ylabel('y') +fig.colorbar(im1, ax=axs[1]) + +plt.tight_layout() +plt.show() +``` +This visualization shows two symmetric, continuous kernels defined on $[0, 1]^2$, each inducing a compact, self-adjoint integral operator on $L^2([0, 1])$: + +* **Left panel**: The RBF kernel $k(x, y) = \exp(-\gamma(x - y)^2)$, concentrated along the diagonal where $x \approx y$, modeling local similarity. +* **Right panel**: The Brownian motion kernel $k(x, y) = \min(x, y)$, forming a triangular structure that accumulates information from the origin. + +Both kernels generate PSD Gram matrices and operators with eigenfunction decompositions — perfect for illustrating Mercer's theorem in practice. From 53e548fd6185d27e4b3900b9302057d32b901e84 Mon Sep 17 00:00:00 2001 From: clippert Date: Tue, 20 May 2025 05:17:19 -0400 Subject: [PATCH 25/43] added back fixed version of Rayleih quotients no warnings compilation --- book/_toc.yml | 2 +- .../Rayleigh_quotients.md | 231 ++++++++++++++++++ 2 files changed, 232 insertions(+), 1 deletion(-) create mode 100644 book/chapter_decompositions/Rayleigh_quotients.md diff --git a/book/_toc.yml b/book/_toc.yml index 7e54824..e040d36 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -86,7 +86,7 @@ parts: - file: chapter_decompositions/eigenvectors # end week 05 - file: chapter_decompositions/orthogonal_matrices - file: chapter_decompositions/symmetric_matrices - # - file: chapter_decompositions/Rayleigh_quotients # skip for now + - file: chapter_decompositions/Rayleigh_quotients # skip for now - file: chapter_decompositions/psd_matrices - file: chapter_decompositions/pca # PCA as example for the eigenvalue decomposition of a psd matrix title: Principal Components Analysis diff --git a/book/chapter_decompositions/Rayleigh_quotients.md b/book/chapter_decompositions/Rayleigh_quotients.md new file mode 100644 index 0000000..9dcec84 --- /dev/null +++ b/book/chapter_decompositions/Rayleigh_quotients.md @@ -0,0 +1,231 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Rayleigh Quotients + +There turns out to be an interesting connection between the quadratic form of a symmetric matrix and its eigenvalues. +This connection is provided by the **Rayleigh quotient** + +> $$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}$$ + +The Rayleigh quotient has a couple of important properties: + +:::{prf:lemma} Properties of the Rayleigh Quotient +:label: trm-Rayleigh-properties +:nonumber: + +(i) **Scale invariance**: for any vector $\mathbf{x} \neq \mathbf{0}$ + and any scalar $\alpha \neq 0$, + $R_\mathbf{A}(\mathbf{x}) = R_\mathbf{A}(\alpha\mathbf{x})$. + +(ii) If $\mathbf{x}$ is an eigenvector of $\mathbf{A}$ with eigenvalue + $\lambda$, then $R_\mathbf{A}(\mathbf{x}) = \lambda$. +::: + +:::{prf:proof} +(i) + + $$R_\mathbf{A}(\alpha\mathbf{x}) = \frac{(\alpha\mathbf{x})^{\!\top\!}\mathbf{A}(\alpha\mathbf{x})}{(\alpha\mathbf{x})^{\!\top\!}(\alpha\mathbf{x})} = \frac{\alpha^2}{\alpha^2}\frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}=R_\mathbf{A}(\mathbf{x}).$$ + +(ii) Let $\mathbf{x}$ be an eigenvector of $\mathbf{A}$ with eigenvalue + $\lambda$, then + + $$R_\mathbf{A}(\mathbf{x})= \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}} = \frac{\mathbf{x}^{\!\top\!}(\lambda\mathbf{x})}{\mathbf{x}^{\!\top\!}\mathbf{x}}=\lambda\frac{\mathbf{x}^{\!\top\!}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}} = \lambda.$$ +::: + +We can further show that the Rayleigh quotient is bounded by the largest +and smallest eigenvalues of $\mathbf{A}$. + +But first we will show a useful special case of the final result. + +:::{prf:theorem} Bound Rayleigh Quotient +:label: trm-bound-Rayleigh-quotient +:nonumber: + +For any $\mathbf{x}$ such that $\|\mathbf{x}\|_2 = 1$, + +$$\lambda_{\min}(\mathbf{A}) \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$$ + +with equality if and only if $\mathbf{x}$ is a corresponding eigenvector. +::: + +:::{prf:proof} + +We show only the $\max$ case because the argument for the +$\min$ case is entirely analogous. + +Since $\mathbf{A}$ is symmetric, we can decompose it as +$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$. + +Then use +the change of variable $\mathbf{y} = \mathbf{Q}^{\!\top\!}\mathbf{x}$, +noting that the relationship between $\mathbf{x}$ and $\mathbf{y}$ is +one-to-one and that $\|\mathbf{y}\|_2 = 1$ since $\mathbf{Q}$ is +orthogonal. + +Hence + +$$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \max_{\|\mathbf{y}\|_2 = 1} \mathbf{y}^{\!\top\!}\mathbf{\Lambda}\mathbf{y} = \max_{y_1^2+\dots+y_n^2=1} \sum_{i=1}^n \lambda_i y_i^2$$ + +Written this way, it is clear that $\mathbf{y}$ maximizes this +expression exactly if and only if it satisfies +$\sum_{i \in I} y_i^2 = 1$ where +$I = \{i : \lambda_i = \max_{j=1,\dots,n} \lambda_j = \lambda_{\max}(\mathbf{A})\}$ +and $y_j = 0$ for $j \not\in I$. + +That is, $I$ contains the index or +indices of the largest eigenvalue. + +In this case, the maximal value of +the expression is + +$$\sum_{i=1}^n \lambda_i y_i^2 = \sum_{i \in I} \lambda_i y_i^2 = \lambda_{\max}(\mathbf{A}) \sum_{i \in I} y_i^2 = \lambda_{\max}(\mathbf{A})$$ + +Then writing $\mathbf{q}_1, \dots, \mathbf{q}_n$ for the columns of +$\mathbf{Q}$, we have + +$$\mathbf{x} = \mathbf{Q}\mathbf{Q}^{\!\top\!}\mathbf{x} = \mathbf{Q}\mathbf{y} = \sum_{i=1}^n y_i\mathbf{q}_i = \sum_{i \in I} y_i\mathbf{q}_i$$ + +where we have used the matrix-vector product identity. + +Recall that $\mathbf{q}_1, \dots, \mathbf{q}_n$ are eigenvectors of +$\mathbf{A}$ and form an orthonormal basis for $\mathbb{R}^n$. + +Therefore by construction, the set $\{\mathbf{q}_i : i \in I\}$ forms an +orthonormal basis for the eigenspace of $\lambda_{\max}(\mathbf{A})$. + +Hence $\mathbf{x}$, which is a linear combination of these, lies in that +eigenspace and thus is an eigenvector of $\mathbf{A}$ corresponding to +$\lambda_{\max}(\mathbf{A})$. + +We have shown that +$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \lambda_{\max}(\mathbf{A})$, +from which we have the general inequality +$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$ +for all unit-length $\mathbf{x}$. ◻ +::: + +By the scale invariance of the Rayleigh quotient, we immediately have as +a corollary + +:::{prf:theorem} Min-Max Theorem +:label: trm-min-max +:nonumber: + +For all $\mathbf{x} \neq \mathbf{0}$, + +$$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ + +with equality if and only if $\mathbf{x}$ is a corresponding +eigenvector. +::: + +:::{prf:proof} + +Let $\mathbf{x}\neq \boldsymbol{0},$ then + +$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}} = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\|\mathbf{x}\|^2} = (\frac{\mathbf{x}}{\|\mathbf{x}\|})^{\!\top\!}\mathbf{A}(\frac{\mathbf{x}}{\|\mathbf{x}\|})$ + +Thus, minimimum and maximum of the Rayleigh quotient are identical to minimum and maximum of the squared form $\mathbf{y}\mathbf{A}\mathbf{y}$ for the unit-norm vector $\mathbf{y}=\mathbf{x}/\|\mathbf{x}\|$: + +$$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ + +◻ +::: + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Define symmetric matrix +A = np.array([[2, 1], + [1, 3]]) + +# Eigenvalues and eigenvectors +eigvals, eigvecs = np.linalg.eigh(A) +λ_min, λ_max = eigvals + +# Generate unit circle points +theta = np.linspace(0, 2*np.pi, 300) +circle = np.stack((np.cos(theta), np.sin(theta))) + +# Rayleigh quotient computation +R = np.einsum('ij,ji->i', circle.T @ A, circle) # x^T A x +R /= np.einsum('ij,ji->i', circle.T, circle) # x^T x + +# Rayleigh extrema +idx_min = np.argmin(R) +idx_max = np.argmax(R) +x_min = circle[:, idx_min] +x_max = circle[:, idx_max] + +# Prepare grid for quadratic form level sets +x = np.linspace(-2, 2, 400) +y = np.linspace(-2, 2, 400) +X, Y = np.meshgrid(x, y) +XY = np.stack((X, Y), axis=-1) +Z = np.einsum('...i,ij,...j->...', XY, A, XY) +levels = np.linspace(np.min(Z), np.max(Z), 20) + +# Create combined figure +fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) + +# Left: Rayleigh quotient on unit circle +sc = ax1.scatter(circle[0], circle[1], c=R, cmap='viridis', s=10) +ax1.quiver(0, 0, x_min[0], x_min[1], color='red', scale=1, scale_units='xy', angles='xy', label='argmin R(x)') +ax1.quiver(0, 0, x_max[0], x_max[1], color='orange', scale=1, scale_units='xy', angles='xy', label='argmax R(x)') +for i in range(2): + eigvec = eigvecs[:, i] + ax1.quiver(0, 0, eigvec[0], eigvec[1], color='black', alpha=0.5, scale=1, scale_units='xy', angles='xy', width=0.008) +ax1.set_title("Rayleigh Quotient on the Unit Circle") +ax1.set_aspect('equal') +ax1.set_xlim(-1.1, 1.1) +ax1.set_ylim(-1.1, 1.1) +ax1.grid(True) +ax1.legend() +plt.colorbar(sc, ax=ax1, label="Rayleigh Quotient $R_A(\\mathbf{x})$") + +# Right: Level sets of quadratic form +contour = ax2.contour(X, Y, Z, levels=levels, cmap='viridis') +ax2.clabel(contour, inline=True, fontsize=8, fmt="%.1f") +ax2.set_title("Level Sets of $\\mathbf{x}^\\top \\mathbf{A} \\mathbf{x}$") +ax2.set_xlabel("$x_1$") +ax2.set_ylabel("$x_2$") +ax2.axhline(0, color='gray', lw=0.5) +ax2.axvline(0, color='gray', lw=0.5) +for i in range(2): + vec = eigvecs[:, i] * np.sqrt(eigvals[i]) + ax2.quiver(0, 0, vec[0], vec[1], color='red', scale=1, scale_units='xy', angles='xy', width=0.01, label=f"$\\mathbf{{q}}_{i+1}$") +ax2.set_aspect('equal') +ax2.legend() + +plt.suptitle("Rayleigh Quotient and Quadratic Form Level Sets", fontsize=16) +plt.tight_layout(rect=[0, 0, 1, 0.93]) +plt.show() +``` + +This combined visualization brings together the **Rayleigh quotient** and the **level sets of the quadratic form** $\mathbf{x}^\top \mathbf{A} \mathbf{x}$: + +* **Left panel**: Rayleigh quotient $R_\mathbf{A}(\mathbf{x})$ on the unit circle + + * Color shows how the value varies with direction. + * Extremes occur at eigenvector directions (marked with arrows). + +* **Right panel**: Level sets (contours) of the quadratic form + + * Elliptical shapes aligned with eigenvectors. + * Red vectors indicate principal axes (scaled eigenvectors). + +Together, these panels illustrate how the **direction of a vector determines how strongly it is scaled** by the symmetric matrix, and how this scaling relates to the matrix's **eigenstructure**. + +✅ As guaranteed by the **Min–Max Theorem**, the maximum and minimum of the Rayleigh quotient occur precisely at the **eigenvectors corresponding to the largest and smallest eigenvalues**. From 85f8c3460cf02cbd32be62bbe64e72b17568a35f Mon Sep 17 00:00:00 2001 From: clippert Date: Tue, 20 May 2025 11:34:24 -0400 Subject: [PATCH 26/43] added stubs of matrix_norms and pseudoinverse --- book/_toc.yml | 4 +-- book/chapter_decompositions/matrix_norms.md | 13 +++++++ book/chapter_decompositions/pseudoinverse.md | 37 ++++++++++++++++++++ 3 files changed, 52 insertions(+), 2 deletions(-) create mode 100644 book/chapter_decompositions/matrix_norms.md create mode 100644 book/chapter_decompositions/pseudoinverse.md diff --git a/book/_toc.yml b/book/_toc.yml index e040d36..a4e56cc 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -93,8 +93,8 @@ parts: - file: chapter_decompositions/svd # - file: chapter_decompositions/big_picture # - file: chapter_decompositions/RBF_kernel_Positive_Definite -# - file: chapter_decompositions/pseudoinverse -# - file: chapter_decompositions/matrix_norms + - file: chapter_decompositions/pseudoinverse + - file: chapter_decompositions/matrix_norms # - file: chapter_convexity/overview_convexity # title: Convexity # sections: diff --git a/book/chapter_decompositions/matrix_norms.md b/book/chapter_decompositions/matrix_norms.md new file mode 100644 index 0000000..3c28d81 --- /dev/null +++ b/book/chapter_decompositions/matrix_norms.md @@ -0,0 +1,13 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Matrix Norms diff --git a/book/chapter_decompositions/pseudoinverse.md b/book/chapter_decompositions/pseudoinverse.md new file mode 100644 index 0000000..1b016db --- /dev/null +++ b/book/chapter_decompositions/pseudoinverse.md @@ -0,0 +1,37 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Moore-Penrose Pseudoinverse +The Moore-Penrose pseudoinverse is a generalization of the matrix inverse that can be applied to non-square or singular matrices. It is denoted as $ A^+ $ for a matrix $ A $. The pseudoinverse satisfies the following properties: +1. **Existence**: The pseudoinverse exists for any matrix $ A $. +2. **Uniqueness**: The pseudoinverse is unique. +3. **Properties**: + - $ A A^+ A = A $ + - $ A^+ A A^+ = A^+ $ + - $ (A A^+)^\top = A A^+ $ + - $ (A^+ A)^\top = A^+ A $ +4. **Rank**: The rank of $ A^+ $ is equal to the rank of $ A $. +5. **Singular Value Decomposition (SVD)**: The pseudoinverse can be computed using the singular value decomposition of $ A $. If $ A = U \Sigma V^\top $, where $ U $ and $ V $ are orthogonal matrices and $ \Sigma $ is a diagonal matrix with singular values, then: + + $$ + A^+ = V \Sigma^+ U^\top + $$ + where $ \Sigma^+ $ is obtained by taking the reciprocal of the non-zero singular values in $ \Sigma $ and transposing the resulting matrix. +6. **Applications**: The pseudoinverse is used in various applications, including solving linear systems, least squares problems, and in machine learning algorithms such as linear regression. +7. **Least Squares Solution**: The pseudoinverse provides a least squares solution to the equation $ Ax = b $ when $ A $ is not square or has no unique solution. The least squares solution is given by: + + $$ + x = A^+ b + $$ +8. **Geometric Interpretation**: The pseudoinverse can be interpreted geometrically as the projection of a vector onto the column space of $ A $. +9. **Computational Considerations**: The computation of the pseudoinverse can be done efficiently using numerical methods, such as the SVD, especially for large matrices. +10. **Limitations**: The pseudoinverse may not be suitable for all applications, especially when the matrix is ill-conditioned or has a high condition number. From fd4934df98dd81ed758266ac293b3868606f6d72 Mon Sep 17 00:00:00 2001 From: clippert Date: Tue, 20 May 2025 11:50:05 -0400 Subject: [PATCH 27/43] stubs for convexity and mercers theorem --- book/_toc.yml | 10 +++++----- {drafts => book}/chapter_convexity/convexity.md | 0 .../chapter_convexity/overview_convexity.md | 0 .../RBF_kernel_Positive_Definite.md | 0 4 files changed, 5 insertions(+), 5 deletions(-) rename {drafts => book}/chapter_convexity/convexity.md (100%) rename {drafts => book}/chapter_convexity/overview_convexity.md (100%) rename {drafts => book}/chapter_decompositions/RBF_kernel_Positive_Definite.md (100%) diff --git a/book/_toc.yml b/book/_toc.yml index a4e56cc..09b603e 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -92,13 +92,13 @@ parts: title: Principal Components Analysis - file: chapter_decompositions/svd # - file: chapter_decompositions/big_picture -# - file: chapter_decompositions/RBF_kernel_Positive_Definite + - file: chapter_decompositions/RBF_kernel_Positive_Definite - file: chapter_decompositions/pseudoinverse - file: chapter_decompositions/matrix_norms -# - file: chapter_convexity/overview_convexity -# title: Convexity -# sections: -# - file: chapter_convexity/convexity + - file: chapter_convexity/overview_convexity + title: Convexity + sections: + - file: chapter_convexity/convexity # continue with second order optimization # title: Second-Order Optimization # - file: chapter_calculus/newtons_method diff --git a/drafts/chapter_convexity/convexity.md b/book/chapter_convexity/convexity.md similarity index 100% rename from drafts/chapter_convexity/convexity.md rename to book/chapter_convexity/convexity.md diff --git a/drafts/chapter_convexity/overview_convexity.md b/book/chapter_convexity/overview_convexity.md similarity index 100% rename from drafts/chapter_convexity/overview_convexity.md rename to book/chapter_convexity/overview_convexity.md diff --git a/drafts/chapter_decompositions/RBF_kernel_Positive_Definite.md b/book/chapter_decompositions/RBF_kernel_Positive_Definite.md similarity index 100% rename from drafts/chapter_decompositions/RBF_kernel_Positive_Definite.md rename to book/chapter_decompositions/RBF_kernel_Positive_Definite.md From e255adba4e8f47f9daf1bd303de9afa28aca12b3 Mon Sep 17 00:00:00 2001 From: clippert Date: Tue, 20 May 2025 11:52:21 -0400 Subject: [PATCH 28/43] stubs for representer theorem --- book/_toc.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/book/_toc.yml b/book/_toc.yml index 09b603e..5e87562 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -93,6 +93,7 @@ parts: - file: chapter_decompositions/svd # - file: chapter_decompositions/big_picture - file: chapter_decompositions/RBF_kernel_Positive_Definite + - file: chapter_decompositions/representer_theorem - file: chapter_decompositions/pseudoinverse - file: chapter_decompositions/matrix_norms - file: chapter_convexity/overview_convexity From 5d16a17a24bbf3ee8e5b7ddf691928aaf3588071 Mon Sep 17 00:00:00 2001 From: Arman Beykmohammadi Date: Wed, 21 May 2025 22:17:03 +0200 Subject: [PATCH 29/43] week 3 solutions added to the book --- book/_toc.yml | 2 + book/appendix/Exercise Sheet 3 Solutions.md | 225 ++++++++++++++++++++ 2 files changed, 227 insertions(+) create mode 100644 book/appendix/Exercise Sheet 3 Solutions.md diff --git a/book/_toc.yml b/book/_toc.yml index 1a7b5e4..1867fe4 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -170,6 +170,8 @@ parts: title: Exercise Sheet 1 Solutions - file: appendix/Exercise Sheet 2 Solutions.md title: Exercise Sheet 2 Solutions + - file: appendix/Exercise Sheet 3 Solutions.md + title: Exercise Sheet 3 Solutions # sections: # - file: appendix/proof_vector_spaces # title: Vector Spaces diff --git a/book/appendix/Exercise Sheet 3 Solutions.md b/book/appendix/Exercise Sheet 3 Solutions.md new file mode 100644 index 0000000..a7780a3 --- /dev/null +++ b/book/appendix/Exercise Sheet 3 Solutions.md @@ -0,0 +1,225 @@ +# Exercise Sheet 3 Solutions + +### 1. +#### (a) +Let + +\[ +f : \mathbb{R}^2 \to \mathbb{R}, \quad f(x, y) = 9x^2 - y^3 + 9xy +\] + +We are asked to compute the **Hessian matrix** at the point \( (x, y) = (3, -3) \). + + +*Step 1: Compute second-order partial derivatives* + +To compute the Hessian matrix, we first compute the first-order partial derivatives: + +\[ +\frac{\partial f}{\partial x} = 18x + 9y, \quad +\frac{\partial f}{\partial y} = -3y^2 + 9x +\] + +Then we compute the second-order partial derivatives: + +\[ +\frac{\partial^2 f}{\partial x^2} = 18, \quad +\frac{\partial^2 f}{\partial y^2} = -6y, \quad +\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x} = 9 +\] + +At the point \( (3, -3) \), we evaluate: + +\[ +\frac{\partial^2 f}{\partial x^2}(3, -3) = 18, \quad +\frac{\partial^2 f}{\partial y^2}(3, -3) = -6(-3) = 18, \quad +\frac{\partial^2 f}{\partial x \partial y}(3, -3) = 9 +\] + +*Step 2: Form the Hessian matrix* + +```math +H_{(3, -3)} = +\begin{bmatrix} +\frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ +\frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} +\end{bmatrix} += +\begin{bmatrix} +18 & 9 \\ +9 & 18 +\end{bmatrix} +``` + +#### (b) +We recall the following definitions and propositions: + +- **Definition**: A symmetric matrix \( A \) is **positive definite** if for all non-zero vectors \( a \), we have \( a^T A a > 0 \). +- **Proposition**: A symmetric matrix is positive definite **if and only if** all its **eigenvalues** are positive. + + +To compute the eigenvalues, solve the characteristic equation: + +\[ +\det(H - \lambda I) = 0 +\Rightarrow +\begin{vmatrix} +18 - \lambda & 9 \\ +9 & 18 - \lambda +\end{vmatrix} += (18 - \lambda)^2 - 81 = 0 +\] + +Simplifying: + +\[ +(18 - \lambda)^2 = 81 \Rightarrow 18 - \lambda = \pm 9 +\Rightarrow \lambda = 9, \ 27 +\] + + +Since both eigenvalues are **positive**, the Hessian matrix at the point \( (3, -3) \) is **positive definite**. Therefore, \( f(x, y) \) has a **local minimum** at this point. + + +### 2. + +Let \( f(x) = x \cdot \ln(x) \) defined on the interval \( [1, e^2] \). + +#### (a) +To apply the Mean Value Theorem (MVT), we must verify that: + +- \( f \) is **continuous** on \( [1, e^2] \) +- \( f \) is **differentiable** on \( (1, e^2) \) + +Since \( f(x) = x \ln(x) \) is a product of continuous and differentiable functions for \( x > 0 \), both conditions are satisfied. + + +#### (b) + +We compute: + +\[ +f(e^2) = e^2 \cdot \ln(e^2) = e^2 \cdot 2 = 2e^2 +\] +\[ +f(1) = 1 \cdot \ln(1) = 0 +\] + +Hence, the average rate of change is: + +\[ +\frac{f(e^2) - f(1)}{e^2 - 1} = \frac{2e^2}{e^2 - 1} +\] + +Next, compute the derivative: + +\[ +f'(x) = \frac{d}{dx}[x \ln(x)] = \ln(x) + 1 +\] + +We solve: + +\[ +f'(c) = \ln(c) + 1 = \frac{2e^2}{e^2 - 1} +\Rightarrow \ln(c) = \frac{2e^2}{e^2 - 1} - 1 = \frac{e^2 + 1}{e^2 - 1} +\Rightarrow c = \exp\left( \frac{e^2 + 1}{e^2 - 1} \right) +\] + +#### (c) +The Mean Value Theorem states that there exists a point \( c \in (1, e^2) \) where the **instantaneous rate of change** \( f'(c) \) equals the **average rate of change** over the interval: + +\[ +f'(c) = \frac{f(e^2) - f(1)}{e^2 - 1} +\] + +Geometrically, this means the **tangent line** to the curve at \( x = c \) is **parallel** to the **secant line** connecting the endpoints \( (1, f(1)) \) and \( (e^2, f(e^2)) \). + + +### 3. +#### (a) + +We compute the derivatives at \( x = 0 \): + +- \( f(x) = \arctan(x) \) +- \( f'(x) = \frac{1}{1+x^2} \Rightarrow f'(0) = 1 \) +- \( f''(x) = \frac{-2x}{(1+x^2)^2} \Rightarrow f''(0) = 0 \) +- \( f^{(3)}(x) = \frac{6x^2 - 2}{(1 + x^2)^3} \Rightarrow f^{(3)}(0) = -2 \) + +Hence, the third-degree Taylor polynomial is: + +\[ +P_3(x) = f(0) + f'(0)x + \frac{f''(0)}{2!}x^2 + \frac{f^{(3)}(0)}{3!}x^3 += 0 + x + 0 - \frac{2}{6}x^3 = x - \frac{1}{3}x^3 +\] + +#### (b) +The Lagrange remainder is: + +\[ +R_3(x) = \frac{f^{(4)}(c)}{4!} x^4 = \frac{f^{(4)}(c)}{24} x^4 \quad \text{for some } c \in (0, x) +\] + +We previously found: + +\[ +f^{(4)}(c) = \frac{24c(1 - c^2)}{(1 + c^2)^4} +\] + +So, + +\[ +R_3(x) = \frac{c(1 - c^2)}{(1 + c^2)^4} x^4 \quad \text{for some } c \in (0, x) +\] + +#### (c) + +Our goal is to show: + +```math +|R_3(x)| = \left| \frac{c(1 - c^2)}{(1 + c^2)^4} x^4 \right| \leq \frac{x^4}{4(1 + c^2)^2} +\quad \text{for } c \in (0, 1) +``` + +We simplify the absolute value of the remainder: + +```math +|R_3(x)| = \frac{c(1 - c^2)}{(1 + c^2)^4} x^4 +``` + +So we now want to **prove** that: + +```math +\frac{c(1 - c^2)}{(1 + c^2)^4} \leq \frac{1}{4(1 + c^2)^2} +``` + +Now we multiply both sides by \( (1 + c^2)^4 \) (which is strictly positive): + +```math +4c(1 - c^2) \leq (1 + c^2)^2 +``` + +Now expand both sides: + +**Left-hand side:** +```math +4c(1 - c^2) = 4c - 4c^3 +``` + +**Right-hand side:** +```math +(1 + c^2)^2 = 1 + 2c^2 + c^4 +``` + +So we want to show: +```math +4c - 4c^3 \leq 1 + 2c^2 + c^4 +\quad \Leftrightarrow \quad +c^4 + 4c^3 + 2c^2 - 4c + 1 \geq 0 +``` + +Now factor the left-hand side: +```math +c^4 + 4c^3 + 2c^2 - 4c + 1 = (c^2 + 2c - 1)^2 \geq 0 +``` + +This inequality clearly holds for all c including \( c \in (0, 1) \), so the bound is valid. From cc9cfcbb7d27bbd33e9700724ae683f1c879d161 Mon Sep 17 00:00:00 2001 From: clippert Date: Thu, 22 May 2025 16:46:11 -0400 Subject: [PATCH 30/43] separated convex sets and functions into files --- book/_toc.yml | 3 +- .../{convexity.md => convex_functions.md} | 42 +------------------ book/chapter_convexity/convex_sets | 31 ++++++++++++++ book/chapter_convexity/overview_convexity.md | 9 +++- 4 files changed, 42 insertions(+), 43 deletions(-) rename book/chapter_convexity/{convexity.md => convex_functions.md} (89%) create mode 100644 book/chapter_convexity/convex_sets diff --git a/book/_toc.yml b/book/_toc.yml index 5e87562..21097ea 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -99,7 +99,8 @@ parts: - file: chapter_convexity/overview_convexity title: Convexity sections: - - file: chapter_convexity/convexity + - file: chapter_convexity/convex_sets + - file: chapter_convexity/convex_functions # continue with second order optimization # title: Second-Order Optimization # - file: chapter_calculus/newtons_method diff --git a/book/chapter_convexity/convexity.md b/book/chapter_convexity/convex_functions.md similarity index 89% rename from book/chapter_convexity/convexity.md rename to book/chapter_convexity/convex_functions.md index 5f78c83..2ca0b2e 100644 --- a/book/chapter_convexity/convexity.md +++ b/book/chapter_convexity/convex_functions.md @@ -1,44 +1,4 @@ -## Convexity - -**Convexity** is a term that pertains to both sets and functions. For -functions, there are different degrees of convexity, and how convex a -function is tells us a lot about its minima: do they exist, are they -unique, how quickly can we find them using optimization algorithms, etc. -In this section, we present basic results regarding convexity, strict -convexity, and strong convexity. - -### Convex sets - -::: center -![image](../figures/convex-set.png) -A convex set -::: - -::: center -![image](../figures/nonconvex-set.png) -A non-convex set -::: - -A set $\mathcal{X} \subseteq \mathbb{R}^d$ is **convex** if - -$$t\mathbf{x} + (1-t)\mathbf{y} \in \mathcal{X}$$ - -for all -$\mathbf{x}, \mathbf{y} \in \mathcal{X}$ and all $t \in [0,1]$. - -Geometrically, this means that all the points on the line segment -between any two points in $\mathcal{X}$ are also in $\mathcal{X}$. See -Figure [1](#fig:convexset){reference-type="ref" -reference="fig:convexset"} for a visual. - -Why do we care whether or not a set is convex? We will see later that -the nature of minima can depend greatly on whether or not the feasible -set is convex. Undesirable pathological results can occur when we allow -the feasible set to be arbitrary, so for proofs we will need to assume -that it is convex. Fortunately, we often want to minimize over all of -$\mathbb{R}^d$, which is easily seen to be a convex set. - -### Basics of convex functions +# Basics of convex functions In the remainder of this section, assume $f : \mathbb{R}^d \to \mathbb{R}$ unless otherwise noted. We'll start diff --git a/book/chapter_convexity/convex_sets b/book/chapter_convexity/convex_sets new file mode 100644 index 0000000..4b3fa4c --- /dev/null +++ b/book/chapter_convexity/convex_sets @@ -0,0 +1,31 @@ +# Convex sets + +::: center +![image](../figures/convex-set.png) +A convex set +::: + +::: center +![image](../figures/nonconvex-set.png) +A non-convex set +::: + +A set $\mathcal{X} \subseteq \mathbb{R}^d$ is **convex** if + +$$t\mathbf{x} + (1-t)\mathbf{y} \in \mathcal{X}$$ + +for all +$\mathbf{x}, \mathbf{y} \in \mathcal{X}$ and all $t \in [0,1]$. + +Geometrically, this means that all the points on the line segment +between any two points in $\mathcal{X}$ are also in $\mathcal{X}$. See +Figure [1](#fig:convexset){reference-type="ref" +reference="fig:convexset"} for a visual. + +Why do we care whether or not a set is convex? We will see later that +the nature of minima can depend greatly on whether or not the feasible +set is convex. Undesirable pathological results can occur when we allow +the feasible set to be arbitrary, so for proofs we will need to assume +that it is convex. Fortunately, we often want to minimize over all of +$\mathbb{R}^d$, which is easily seen to be a convex set. + diff --git a/book/chapter_convexity/overview_convexity.md b/book/chapter_convexity/overview_convexity.md index 42f5a7c..547fb0c 100644 --- a/book/chapter_convexity/overview_convexity.md +++ b/book/chapter_convexity/overview_convexity.md @@ -1,5 +1,12 @@ # Convexity -Convexity is a property of a function that describes the shape of its graph. + +**Convexity** is a term that pertains to both sets and functions. For +functions, there are different degrees of convexity, and how convex a +function is tells us a lot about its minima: do they exist, are they +unique, how quickly can we find them using optimization algorithms, etc. + +In this section, we present basic results regarding convexity, strict +convexity, and strong convexity. ```{tableofcontents} ``` \ No newline at end of file From 31a921aef0be5db90bd2e77a631ac50016b55d03 Mon Sep 17 00:00:00 2001 From: clippert Date: Thu, 22 May 2025 16:47:06 -0400 Subject: [PATCH 31/43] separated convex sets and functions into files --- book/chapter_convexity/{convex_sets => convex_sets.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename book/chapter_convexity/{convex_sets => convex_sets.md} (100%) diff --git a/book/chapter_convexity/convex_sets b/book/chapter_convexity/convex_sets.md similarity index 100% rename from book/chapter_convexity/convex_sets rename to book/chapter_convexity/convex_sets.md From ae56088416f20ee6942f12a299a10a0d191f2f30 Mon Sep 17 00:00:00 2001 From: clippert Date: Fri, 23 May 2025 10:15:47 +0200 Subject: [PATCH 32/43] fixed propostion label --- book/chapter_convexity/convex_functions.md | 154 +++++++++++++++------ book/chapter_convexity/convex_sets.md | 15 +- book/chapter_decompositions/pca.md | 13 +- 3 files changed, 127 insertions(+), 55 deletions(-) diff --git a/book/chapter_convexity/convex_functions.md b/book/chapter_convexity/convex_functions.md index 2ca0b2e..f689893 100644 --- a/book/chapter_convexity/convex_functions.md +++ b/book/chapter_convexity/convex_functions.md @@ -19,8 +19,7 @@ A function $f$ is **strongly convex with parameter $m$** (or $$\mathbf{x} \mapsto f(\mathbf{x}) - \frac{m}{2}\|\mathbf{x}\|_2^2$$ -is -convex. +is convex. These conditions are given in increasing order of strength; strong convexity implies strict convexity which implies convexity. @@ -39,7 +38,7 @@ Strict convexity means that the graph of $f$ lies strictly above the line segment, except at the segment endpoints. (So actually the function in the figure appears to be strictly convex.) -### Consequences of convexity +## Consequences of convexity Why do we care if a function is (strictly/strongly) convex? @@ -47,16 +46,23 @@ Basically, our various notions of convexity have implications about the nature of minima. It should not be surprising that the stronger conditions tell us more about the minima. -*Proposition.* -Let $\mathcal{X}$ be a convex set. If $f$ is convex, then any local -minimum of $f$ in $\mathcal{X}$ is also a global minimum. +:::{prf:proposition} Minima of convex functions +:label: prop-convex-minima +:nonumber: +Let $\mathcal{X}$ be a convex set. + +If $f$ is convex, then any local minimum of $f$ in $\mathcal{X}$ is also a global minimum. +::: + +:::{prf:proof} +Suppose $f$ is convex, and let $\mathbf{x}^*$ be a local +minimum of $f$ in $\mathcal{X}$. +Then for some neighborhood $N \subseteq \mathcal{X}$ about $\mathbf{x}^*$, we have +$f(\mathbf{x}) \geq f(\mathbf{x}^*)$ for all $\mathbf{x} \in N$. -*Proof.* Suppose $f$ is convex, and let $\mathbf{x}^*$ be a local -minimum of $f$ in $\mathcal{X}$. Then for some neighborhood -$N \subseteq \mathcal{X}$ about $\mathbf{x}^*$, we have -$f(\mathbf{x}) \geq f(\mathbf{x}^*)$ for all $\mathbf{x} \in N$. Suppose +Suppose towards a contradiction that there exists $\tilde{\mathbf{x}} \in \mathcal{X}$ such that $f(\tilde{\mathbf{x}}) < f(\mathbf{x}^*)$. @@ -78,16 +84,22 @@ above inequality, a contradiction. It follows that $f(\mathbf{x}^*) \leq f(\mathbf{x})$ for all $\mathbf{x} \in \mathcal{X}$, so $\mathbf{x}^*$ is a global minimum of $f$ in $\mathcal{X}$. ◻ +::: + +:::{prf:proposition} Minima stricly convex functions +:label: prop-minima-striclty-convex +:nonumber: +Let $\mathcal{X}$ be a convex set. -*Proposition.* -Let $\mathcal{X}$ be a convex set. If $f$ is strictly convex, then there +If $f$ is strictly convex, then there exists at most one local minimum of $f$ in $\mathcal{X}$. Consequently, if it exists it is the unique global minimum of $f$ in $\mathcal{X}$. +::: +:::{prf:proof} - -*Proof.* The second sentence follows from the first, so all we must show +The second sentence follows from the first, so all we must show is that if a local minimum exists in $\mathcal{X}$ then it is unique. Suppose $\mathbf{x}^*$ is a local minimum of $f$ in $\mathcal{X}$, and @@ -105,12 +117,14 @@ of $f$, $$f(\mathbf{x}(t)) < tf(\mathbf{x}^*) + (1-t)f(\tilde{\mathbf{x}}) = tf(\mathbf{x}^*) + (1-t)f(\mathbf{x}^*) = f(\mathbf{x}^*)$$ -for all $t \in (0,1)$. But this contradicts the fact that $\mathbf{x}^*$ +for all $t \in (0,1)$. + +But this contradicts the fact that $\mathbf{x}^*$ is a global minimum. Therefore if $\tilde{\mathbf{x}}$ is a local minimum of $f$ in $\mathcal{X}$, then $\tilde{\mathbf{x}} = \mathbf{x}^*$, so $\mathbf{x}^*$ is the unique minimum in $\mathcal{X}$. ◻ - +::: It is worthwhile to examine how the feasible set affects the optimization problem. We will see why the assumption that $\mathcal{X}$ @@ -118,6 +132,7 @@ is convex is needed in the results above. Consider the function $f(x) = x^2$, which is a strictly convex function. The unique global minimum of this function in $\mathbb{R}$ is $x = 0$. + But let's see what happens when we change the feasible set $\mathcal{X}$. @@ -138,7 +153,7 @@ $\mathcal{X}$. non-convex, and we can see that there are two global minima ($x = \pm 1$). -### Showing that a function is convex +## Showing that a function is convex Hopefully the previous section has convinced the reader that convexity is an important property. Next we turn to the issue of showing that a @@ -146,12 +161,17 @@ function is (strictly/strongly) convex. It is of course possible (in principle) to directly show that the condition in the definition holds, but this is usually not the easiest way. -*Proposition.* +:::{prf:proposition} Norms +:label: prop-norms-convex +:nonumber: + Norms are convex. +::: +:::{prf:proof} -*Proof.* Let $\|\cdot\|$ be a norm on a vector space $V$. Then for all +Let $\|\cdot\|$ be a norm on a vector space $V$. Then for all $\mathbf{x}, \mathbf{y} \in V$ and $t \in [0,1]$, $$\|t\mathbf{x} + (1-t)\mathbf{y}\| \leq \|t\mathbf{x}\| + \|(1-t)\mathbf{y}\| = |t|\|\mathbf{x}\| + |1-t|\|\mathbf{y}\| = t\|\mathbf{x}\| + (1-t)\|\mathbf{y}\|$$ @@ -159,21 +179,31 @@ $$\|t\mathbf{x} + (1-t)\mathbf{y}\| \leq \|t\mathbf{x}\| + \|(1-t)\mathbf{y}\| = where we have used respectively the triangle inequality, the homogeneity of norms, and the fact that $t$ and $1-t$ are nonnegative. Hence $\|\cdot\|$ is convex. ◻ +::: + +:::{prf:proposition} Gradient of Convex Functions +:label: prop-convex-functions-graph +:nonumber: +Suppose $f$ is differentiable. -*Proposition.* -Suppose $f$ is differentiable. Then $f$ is convex if and only if +Then $f$ is convex if and only if $$f(\mathbf{y}) \geq f(\mathbf{x}) + \langle \nabla f(\mathbf{x}), \mathbf{y} - \mathbf{x} \rangle$$ for all $\mathbf{x}, \mathbf{y} \in \operatorname{dom} f$. +::: +:::{prf:proof} -*Proof.* To-do. ◻ +To-do. ◻ +::: +:::{prf:proposition} Hessian of Convex Functions +:label: prop-Hessian-convex +:nonumber: -*Proposition.* Suppose $f$ is twice differentiable. Then (i) $f$ is convex if and only if $\nabla^2 f(\mathbf{x}) \succeq 0$ for @@ -185,18 +215,24 @@ Suppose $f$ is twice differentiable. Then (iii) $f$ is $m$-strongly convex if and only if $\nabla^2 f(\mathbf{x}) \succeq mI$ for all $\mathbf{x} \in \operatorname{dom} f$. +::: +:::{prf:proof} +Omitted. ◻ +::: -*Proof.* Omitted. ◻ - +:::{prf:proposition} Scaling Convex Functions +:label: prop-scaling-convex-functions +:nonumber: -*Proposition.* If $f$ is convex and $\alpha \geq 0$, then $\alpha f$ is convex. +::: +:::{prf:proof} -*Proof.* Suppose $f$ is convex and $\alpha \geq 0$. Then for all +Suppose $f$ is convex and $\alpha \geq 0$. Then for all $\mathbf{x}, \mathbf{y} \in \operatorname{dom}(\alpha f) = \operatorname{dom} f$, $$\begin{aligned} @@ -207,15 +243,24 @@ $$\begin{aligned} \end{aligned}$$ so $\alpha f$ is convex. ◻ +::: -*Proposition.* -If $f$ and $g$ are convex, then $f+g$ is convex. Furthermore, if $g$ is +:::{prf:proposition} Sum of Convex Functions +:label: prop-sum-convex-functions +:nonumber: + +If $f$ and $g$ are convex, then $f+g$ is convex. + +Furthermore, if $g$ is strictly convex, then $f+g$ is strictly convex, and if $g$ is $m$-strongly convex, then $f+g$ is $m$-strongly convex. +::: +:::{prf:proof} -*Proof.* Suppose $f$ and $g$ are convex. Then for all +Suppose $f$ and $g$ are convex. +Then for all $\mathbf{x}, \mathbf{y} \in \operatorname{dom} (f+g) = \operatorname{dom} f \cap \operatorname{dom} g$, $$\begin{aligned} @@ -234,34 +279,44 @@ convex. If $g$ is $m$-strongly convex, then the function $h(\mathbf{x}) \equiv g(\mathbf{x}) - \frac{m}{2}\|\mathbf{x}\|_2^2$ is -convex, so $f+h$ is convex. But +convex, so $f+h$ is convex. + +But $$(f+h)(\mathbf{x}) \equiv f(\mathbf{x}) + h(\mathbf{x}) \equiv f(\mathbf{x}) + g(\mathbf{x}) - \frac{m}{2}\|\mathbf{x}\|_2^2 \equiv (f+g)(\mathbf{x}) - \frac{m}{2}\|\mathbf{x}\|_2^2$$ so $f+g$ is $m$-strongly convex. ◻ +::: +:::{prf:proposition} Weighted Sum of Convex Functions +:label: prop-convex-functions-weighted-sum +:nonumber: -*Proposition.* If $f_1, \dots, f_n$ are convex and $\alpha_1, \dots, \alpha_n \geq 0$, then $$\sum_{i=1}^n \alpha_i f_i$$ is convex. +::: +:::{prf:proof} +Follows from the previous two propositions by induction. ◻ +::: +:::{prf:proposition} Combination of Affine and Convex Functions +:label: prop-linear-convex +:nonumber: -*Proof.* Follows from the previous two propositions by induction. ◻ - - -*Proposition.* If $f$ is convex, then $g(\mathbf{x}) \equiv f(\mathbf{A}\mathbf{x} + \mathbf{b})$ is convex for any appropriately-sized $\mathbf{A}$ and $\mathbf{b}$. +::: +:::{prf:proof} -*Proof.* Suppose $f$ is convex and $g$ is defined like so. Then for all +Suppose $f$ is convex and $g$ is defined like so. Then for all $\mathbf{x}, \mathbf{y} \in \operatorname{dom} g$, $$\begin{aligned} @@ -274,15 +329,20 @@ g(t\mathbf{x} + (1-t)\mathbf{y}) &= f(\mathbf{A}(t\mathbf{x} + (1-t)\mathbf{y}) \end{aligned}$$ Thus $g$ is convex. ◻ +::: +:::{prf:proposition} Maximum of Convex Functions +:label: prop-max-convex-functions +:nonumber: -*Proposition.* If $f$ and $g$ are convex, then $h(\mathbf{x}) \equiv \max\{f(\mathbf{x}), g(\mathbf{x})\}$ is convex. +::: +:::{prf:proof} -*Proof.* Suppose $f$ and $g$ are convex and $h$ is defined like so. Then +Suppose $f$ and $g$ are convex and $h$ is defined like so. Then for all $\mathbf{x}, \mathbf{y} \in \operatorname{dom} h$, $$\begin{aligned} @@ -299,7 +359,7 @@ $\max\{a,b\} \leq \max\{c,d\}$. In the second inequality we have used the fact that $\max\{a+b, c+d\} \leq \max\{a,c\} + \max\{b,d\}$. Thus $h$ is convex. ◻ - +::: ### Examples @@ -312,22 +372,26 @@ Functions that are convex but not strictly convex: (i) $f(\mathbf{x}) = \mathbf{w}^{\!\top\!}\mathbf{x} + \alpha$ for any $\mathbf{w} \in \mathbb{R}^d, \alpha \in \mathbb{R}$. Such a function is called an **affine function**, and it is both convex and - concave. (In fact, a function is affine if and only if it is both - convex and concave.) Note that linear functions and constant + concave. + (In fact, a function is affine if and only if it is both convex and concave.) + Note that linear functions and constant functions are special cases of affine functions. (ii) $f(\mathbf{x}) = \|\mathbf{x}\|_1$ Functions that are strictly but not strongly convex: -(i) $f(x) = x^4$. This example is interesting because it is strictly +(i) $f(x) = x^4$. +This example is interesting because it is strictly convex but you cannot show this fact via a second-order argument (since $f''(0) = 0$). -(ii) $f(x) = \exp(x)$. This example is interesting because it's bounded +(ii) $f(x) = \exp(x)$. +This example is interesting because it's bounded below but has no local minimum. -(iii) $f(x) = -\log x$. This example is interesting because it's +(iii) $f(x) = -\log x$. +This example is interesting because it's strictly convex but not bounded below. Functions that are strongly convex: diff --git a/book/chapter_convexity/convex_sets.md b/book/chapter_convexity/convex_sets.md index 4b3fa4c..8601fda 100644 --- a/book/chapter_convexity/convex_sets.md +++ b/book/chapter_convexity/convex_sets.md @@ -18,14 +18,15 @@ for all $\mathbf{x}, \mathbf{y} \in \mathcal{X}$ and all $t \in [0,1]$. Geometrically, this means that all the points on the line segment -between any two points in $\mathcal{X}$ are also in $\mathcal{X}$. See -Figure [1](#fig:convexset){reference-type="ref" +between any two points in $\mathcal{X}$ are also in $\mathcal{X}$. + +See Figure [1](#fig:convexset){reference-type="ref" reference="fig:convexset"} for a visual. -Why do we care whether or not a set is convex? We will see later that -the nature of minima can depend greatly on whether or not the feasible -set is convex. Undesirable pathological results can occur when we allow -the feasible set to be arbitrary, so for proofs we will need to assume -that it is convex. Fortunately, we often want to minimize over all of +Why do we care whether or not a set is convex? We will see later that the nature of minima can depend greatly on whether or not the feasible set is convex. +Undesirable pathological results can occur when we allow +the feasible set to be arbitrary, so for proofs we will need to assume that it is convex. + +Fortunately, we often want to minimize over all of $\mathbb{R}^d$, which is easily seen to be a convex set. diff --git a/book/chapter_decompositions/pca.md b/book/chapter_decompositions/pca.md index 4190b33..cc06f65 100644 --- a/book/chapter_decompositions/pca.md +++ b/book/chapter_decompositions/pca.md @@ -205,21 +205,28 @@ This is optimal because trace is maximized by choosing eigenvectors with **large ## PCA algorithm step by step 1. Calculate the mean of the data + $$ \mathbf{\bar{x}} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{x}_i $$ 2. Calculate the covariance matrix $\mathbf{S}$ of the data: + $$ \mathbf{S} = \frac{1}{N-1} \sum_{i=1}^{N} (\mathbf{x}_i - \mathbf{\bar{x}})(\mathbf{x}_i - \mathbf{\bar{x}})^T $$ Both the mean and the covariance matrix are calculated by `empirical_covariance` function. 3. Calculate the eigenvalues $\lambda_i$ and eigenvectors $\mathbf{u}_i$ of the covariance matrix $\mathbf{S}$ -4. Sort the eigenvalues in descending order and then sort the eigenvectors accordingly. Create a principal components matrix $\mathbf{U}$ by taking the first $k$ eigenvectors, where $k$ is the number of dimensions we want to keep. - This step is implemented in the `fit` method of the `PCA` class. - 5. To project the data onto the new space, we can use the following formula: + +4. Sort the eigenvalues in descending order and then sort the eigenvectors accordingly. +Create a principal components matrix $\mathbf{U}$ by taking the first $k$ eigenvectors, where $k$ is the number of dimensions we want to keep. +This step is implemented in the `fit` method of the `PCA` class. + +5. To project the data onto the new space, we can use the following formula: + $$ \mathbf{Y} = \mathbf{X} \cdot \mathbf{U} $$ This step is implemented in the `transform` method of the `PCA` class. 6. To reconstruct the data, we can use the following formula: + $$ \mathbf{\tilde{X}} = \mathbf{Y} \cdot \mathbf{U}^T + \mathbf{\bar{x}} $$ This step is implemented in the `inverse_transform` method of the `PCA` class. From dad61a0f3f095f799f19a12636ee758ce68bd28f Mon Sep 17 00:00:00 2001 From: clippert Date: Fri, 23 May 2025 10:33:14 +0200 Subject: [PATCH 33/43] fixed propostion label --- book/chapter_convexity/convex_functions.md | 100 +++++++++++++++++++-- 1 file changed, 91 insertions(+), 9 deletions(-) diff --git a/book/chapter_convexity/convex_functions.md b/book/chapter_convexity/convex_functions.md index f689893..585f30e 100644 --- a/book/chapter_convexity/convex_functions.md +++ b/book/chapter_convexity/convex_functions.md @@ -1,3 +1,15 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: math4ml + language: python + name: python3 +--- # Basics of convex functions In the remainder of this section, assume @@ -24,19 +36,89 @@ is convex. These conditions are given in increasing order of strength; strong convexity implies strict convexity which implies convexity. - ::: center -![What convex functions look like](../figures/convex-function.png) -What convex functions look like -::: + + +## Geometric interpretation +The following figure illustrates the three types of convexity: Geometrically, convexity means that the line segment between two points -on the graph of $f$ lies on or above the graph itself. See Figure -[2](#fig:convexfunction){reference-type="ref" -reference="fig:convexfunction"} for a visual. +on the graph of $f$ lies on or above the graph itself. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Define a convex function +f = lambda x: x**2 + +# Define x values and compute y +x = np.linspace(-2, 2, 400) +y = f(x) + +# Choose two points on the graph +x1, x2 = -1.5, 1.0 +y1, y2 = f(x1), f(x2) + +# Compute the line segment between the two points +t = np.linspace(0, 1, 100) +xt = t * x1 + (1 - t) * x2 +yt_line = t * y1 + (1 - t) * y2 + +# Plot the function and the line segment +plt.figure(figsize=(8, 6)) +plt.plot(x, y, label=r'$f(x) = x^2$', color='blue') +plt.plot(xt, yt_line, 'r--', label='Line segment') +plt.plot([x1, x2], [y1, y2], 'ro') # endpoints +plt.title("Geometric Interpretation of Convexity") +plt.xlabel("x") +plt.ylabel("f(x)") +plt.legend() +plt.grid(True) +plt.tight_layout() +plt.show() +``` Strict convexity means that the graph of $f$ lies strictly above the -line segment, except at the segment endpoints. (So actually the function -in the figure appears to be strictly convex.) +line segment, except at the segment endpoints. +(So actually the function in the figure appears to be strictly convex.) + + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Define x values +x = np.linspace(-2, 2, 400) + +# Define three functions: convex, strictly convex, and strongly convex +f1 = lambda x: np.abs(x) # convex but not strictly convex +f2 = lambda x: x**4 # strictly convex but not strongly convex +f3 = lambda x: x**2 + 1 # strongly convex + +# Evaluate functions +y1 = f1(x) +y2 = f2(x) +y3 = f3(x) + +# Plot the functions +plt.figure(figsize=(10, 6)) +plt.plot(x, y1, label=r'$f(x) = |x|$ (Convex)', linestyle='--') +plt.plot(x, y2, label=r'$f(x) = x^4$ (Strictly Convex)', linestyle='-.') +plt.plot(x, y3, label=r'$f(x) = x^2 + 1$ (Strongly Convex)', linestyle='-') +plt.title("Examples of Convex, Strictly Convex, and Strongly Convex Functions") +plt.xlabel("x") +plt.ylabel("f(x)") +plt.legend() +plt.grid(True) +plt.tight_layout() +plt.show() +``` +* A **convex but not strictly convex** function $f(x) = |x|$ +* A **strictly convex but not strongly convex** function $f(x) = x^4$ +* A **strongly convex** function $f(x) = x^2 + 1$ + ## Consequences of convexity From d69dfe2e2146faf6f5e783e8a10fcaabeda34300 Mon Sep 17 00:00:00 2001 From: clippert Date: Fri, 23 May 2025 10:36:39 +0200 Subject: [PATCH 34/43] created figures for convexity and removed all errors and warnings --- book/chapter_convexity/convex_sets.md | 93 ++++++++++++++++++++++++--- 1 file changed, 83 insertions(+), 10 deletions(-) diff --git a/book/chapter_convexity/convex_sets.md b/book/chapter_convexity/convex_sets.md index 8601fda..74fe709 100644 --- a/book/chapter_convexity/convex_sets.md +++ b/book/chapter_convexity/convex_sets.md @@ -1,14 +1,89 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: math4ml + language: python + name: python3 +--- # Convex sets -::: center -![image](../figures/convex-set.png) -A convex set -::: +```{code-cell} ipython3 +:tags: [hide-input] +import matplotlib.pyplot as plt +import numpy as np -::: center -![image](../figures/nonconvex-set.png) -A non-convex set -::: +# Generate a convex polygon (e.g., a convex hull of some points) +points = np.array([ + [1, 1], [2, 3], [4, 4], [6, 3], [5, 1], [3, 0] +]) + +# Choose two points inside the polygon +A = np.array([2.5, 2]) +B = np.array([4.5, 2.5]) + +# Line segment between A and B +t = np.linspace(0, 1, 100) +segment = np.outer(1 - t, A) + np.outer(t, B) + +# Plot the convex set and the line segment +plt.figure(figsize=(8, 6)) +plt.fill(points[:, 0], points[:, 1], alpha=0.3, label="Convex Set", edgecolor='blue') +plt.plot(segment[:, 0], segment[:, 1], 'r--', label="Line segment AB") +plt.plot(*A, 'ro', label="Point A") +plt.plot(*B, 'go', label="Point B") + +plt.title("A Convex Set") +plt.xlabel("x") +plt.ylabel("y") +plt.axis("equal") +plt.grid(True) +plt.legend() +plt.tight_layout() +plt.show() + +``` + + +```{code-cell} ipython3 +:tags: [hide-input] +import matplotlib.pyplot as plt +import numpy as np + +# Define a non-convex polygon (e.g., a simple star shape or concave polygon) +points = np.array([ + [1, 1], [2, 3], [3, 1.5], [4, 3], [5, 1], [3, 0] +]) + +# Choose two points inside the set where the connecting line goes outside +A = np.array([2, 2]) +B = np.array([4, 2]) + +# Line segment between A and B +t = np.linspace(0, 1, 100) +segment = np.outer(1 - t, A) + np.outer(t, B) + +# Plot the non-convex set and the line segment +plt.figure(figsize=(8, 6)) +plt.fill(points[:, 0], points[:, 1], alpha=0.3, label="Non-Convex Set", edgecolor='blue') +plt.plot(segment[:, 0], segment[:, 1], 'r--', label="Line segment AB") +plt.plot(*A, 'ro', label="Point A") +plt.plot(*B, 'go', label="Point B") + +plt.title("A Non-Convex Set") +plt.xlabel("x") +plt.ylabel("y") +plt.axis("equal") +plt.grid(True) +plt.legend() +plt.tight_layout() +plt.show() + +``` A set $\mathcal{X} \subseteq \mathbb{R}^d$ is **convex** if @@ -20,8 +95,6 @@ $\mathbf{x}, \mathbf{y} \in \mathcal{X}$ and all $t \in [0,1]$. Geometrically, this means that all the points on the line segment between any two points in $\mathcal{X}$ are also in $\mathcal{X}$. -See Figure [1](#fig:convexset){reference-type="ref" -reference="fig:convexset"} for a visual. Why do we care whether or not a set is convex? We will see later that the nature of minima can depend greatly on whether or not the feasible set is convex. Undesirable pathological results can occur when we allow From 902b8594c1a7a1baa4f157e50477e90b7bd0d356 Mon Sep 17 00:00:00 2001 From: clippert Date: Sun, 25 May 2025 12:08:13 +0200 Subject: [PATCH 35/43] changed format of proof in svd --- book/chapter_convexity/convex_sets.md | 3 +++ book/chapter_decompositions/svd.md | 16 ++++++++++++---- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/book/chapter_convexity/convex_sets.md b/book/chapter_convexity/convex_sets.md index 74fe709..69cecba 100644 --- a/book/chapter_convexity/convex_sets.md +++ b/book/chapter_convexity/convex_sets.md @@ -47,6 +47,8 @@ plt.tight_layout() plt.show() ``` +This figure visualizes a **convex set**: a polygon where the line segment connecting any two points within the set (e.g., points A and B) lies entirely inside the set. The red dashed line confirms this key geometric property of convex sets. + ```{code-cell} ipython3 @@ -84,6 +86,7 @@ plt.tight_layout() plt.show() ``` +This figure illustrates a **non-convex set**: a shape where the line segment between two points inside the set (A and B) partially lies **outside** the set. This violation of the convexity condition distinguishes non-convex sets from convex ones. A set $\mathcal{X} \subseteq \mathbb{R}^d$ is **convex** if diff --git a/book/chapter_decompositions/svd.md b/book/chapter_decompositions/svd.md index 60bd36a..adaa756 100644 --- a/book/chapter_decompositions/svd.md +++ b/book/chapter_decompositions/svd.md @@ -3,7 +3,9 @@ Singular value decomposition (SVD) is a widely applicable tool in linear algebra. Its strength stems partially from the fact that *every matrix* $\mathbf{A} \in \mathbb{R}^{m \times n}$ has an SVD (even non-square -matrices)! The decomposition goes as follows: +matrices)! + +The decomposition goes as follows: $$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!}$$ @@ -76,7 +78,10 @@ immediately obvious, but the sum of outer products is actually equivalent to an appropriate matrix-matrix product! We formalize this statement as -*Proposition.* +:::{prf:proposition} +:label: prop-sum-outer-products +:nonumber: + Let $\mathbf{a}_1, \dots, \mathbf{a}_k \in \mathbb{R}^m$ and $\mathbf{b}_1, \dots, \mathbf{b}_k \in \mathbb{R}^n$. Then @@ -85,8 +90,11 @@ $$\sum_{\ell=1}^k \mathbf{a}_\ell\mathbf{b}_\ell^{\!\top\!} = \mathbf{A}\mathbf{ where $$\mathbf{A} = \begin{bmatrix}\mathbf{a}_1 & \cdots & \mathbf{a}_k\end{bmatrix}, \hspace{0.5cm} \mathbf{B} = \begin{bmatrix}\mathbf{b}_1 & \cdots & \mathbf{b}_k\end{bmatrix}$$ +::: -*Proof.* For each $(i,j)$, we have +:::{prf:proof} + +For each $(i,j)$, we have $$\left[\sum_{\ell=1}^k \mathbf{a}_\ell\mathbf{b}_\ell^{\!\top\!}\right]_{ij} = \sum_{\ell=1}^k [\mathbf{a}_\ell\mathbf{b}_\ell^{\!\top\!}]_{ij} = \sum_{\ell=1}^k [\mathbf{a}_\ell]_i[\mathbf{b}_\ell]_j = \sum_{\ell=1}^k A_{i\ell}B_{j\ell}$$ @@ -95,4 +103,4 @@ the $i$th row of $\mathbf{A}$ and the $j$th row of $\mathbf{B}$, or equivalently the $j$th column of $\mathbf{B}^{\!\top\!}$. Hence by the definition of matrix multiplication, it is equal to $[\mathbf{A}\mathbf{B}^{\!\top\!}]_{ij}$. ◻ - +::: From fef720f7ec1d6057a936bf8c389a85275eb1e3a0 Mon Sep 17 00:00:00 2001 From: clippert Date: Sun, 25 May 2025 21:39:41 +0200 Subject: [PATCH 36/43] material for w06 --- book/_toc.yml | 22 +- book/chapter_decompositions/matrix_norms.md | 327 ++++++++++++++++++ .../orthogonal_projections.md | 206 +++++++++++ book/chapter_decompositions/pseudoinverse.md | 26 +- 4 files changed, 558 insertions(+), 23 deletions(-) create mode 100644 book/chapter_decompositions/orthogonal_projections.md diff --git a/book/_toc.yml b/book/_toc.yml index 21097ea..98438c0 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -86,21 +86,23 @@ parts: - file: chapter_decompositions/eigenvectors # end week 05 - file: chapter_decompositions/orthogonal_matrices - file: chapter_decompositions/symmetric_matrices - - file: chapter_decompositions/Rayleigh_quotients # skip for now + - file: chapter_decompositions/Rayleigh_quotients + - file: chapter_decompositions/matrix_norms - file: chapter_decompositions/psd_matrices - file: chapter_decompositions/pca # PCA as example for the eigenvalue decomposition of a psd matrix title: Principal Components Analysis - file: chapter_decompositions/svd # - - file: chapter_decompositions/big_picture - - file: chapter_decompositions/RBF_kernel_Positive_Definite - - file: chapter_decompositions/representer_theorem +# - file: chapter_decompositions/RBF_kernel_Positive_Definite - file: chapter_decompositions/pseudoinverse - - file: chapter_decompositions/matrix_norms - - file: chapter_convexity/overview_convexity - title: Convexity - sections: - - file: chapter_convexity/convex_sets - - file: chapter_convexity/convex_functions + - file: chapter_decompositions/orthogonal_projections + - file: chapter_decompositions/big_picture + title: Fundamental Subspaces +# - file: chapter_decompositions/representer_theorem +# - file: chapter_convexity/overview_convexity +# title: Convexity +# sections: +# - file: chapter_convexity/convex_sets +# - file: chapter_convexity/convex_functions # continue with second order optimization # title: Second-Order Optimization # - file: chapter_calculus/newtons_method diff --git a/book/chapter_decompositions/matrix_norms.md b/book/chapter_decompositions/matrix_norms.md index 3c28d81..36ebd8b 100644 --- a/book/chapter_decompositions/matrix_norms.md +++ b/book/chapter_decompositions/matrix_norms.md @@ -11,3 +11,330 @@ kernelspec: name: python3 --- # Matrix Norms + +Matrix norms provide a way to measure the "size" or "magnitude" of a matrix. They are used throughout machine learning and numerical analysis—for example, to quantify approximation error, assess convergence in optimization algorithms, or bound the spectral properties of linear transformations. + +## Definition + +A **matrix norm** is a function $ \|\cdot\| : \mathbb{R}^{m \times n} \to \mathbb{R} $ satisfying the following properties for all matrices $ \mathbf{A}, +\mathbf{B} \in \mathbb{R}^{m \times n} $ and all scalars $ \alpha \in \mathbb{R} $: + +1. **Non-negativity**: $ \|\mathbf{A}\| \geq 0 $ +2. **Definiteness**: $ \|\mathbf{A}\| = 0 \iff \mathbf{A} = 0 $ +3. **Homogeneity**: $ \|\alpha \mathbf{A}\| = |\alpha| \cdot \|\mathbf{A}\| $ +4. **Triangle inequality**: $ \|\mathbf{A} + \mathbf{B}\| \leq \|\mathbf{A}\| + \|\mathbf{B}\| $ + + +## Common Matrix Norms + +### 1. **Frobenius Norm** + +Defined by: + +$$ +\|\mathbf{A}\|_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{\mathrm{tr}(\mathbf{A}^\top \mathbf{A})} +$$ + +It treats the matrix as a vector in $ \mathbb{R}^{mn} $. + +### 2. **Induced (Operator) Norms** + +Given a vector norm $ \|\cdot\| $, the **induced matrix norm** is: + +$$ +\|\mathbf{A}\| = \sup_{\mathbf{x} \neq 0} \frac{\|\mathbf{A} \mathbf{x}\|}{\|\mathbf{x}\|} = \sup_{\|\mathbf{x}\| = 1} \|\mathbf{A} \mathbf{x}\| +$$ + +Examples: +- **Spectral norm**: Induced by the Euclidean norm $ \|\cdot\|_2 $. + + Equal to the largest singular value of $ \mathbf{A}. $ +- **$ \ell_1 $ norm**: Maximum absolute column sum. +- **$ \ell_\infty $ norm**: Maximum absolute row sum. + + +## Properties + +- Induced norms satisfy the **submultiplicative property**: + +$$ +\|\mathbf{A}\mathbf{B}\| \leq \|\mathbf{A}\| \cdot \|\mathbf{B}\| +$$ + +- For the Frobenius norm: + +$$ +\|\mathbf{A}\mathbf{B}\|_F \leq \|\mathbf{A}\|_F \cdot \|\mathbf{B}\|_F +$$ + +- All norms on a finite-dimensional vector space are equivalent (they define the same topology), but may differ in scaling. + + +## Applications in Machine Learning + +- In **optimization**, norms define constraints (e.g., Lasso uses $ \ell_1 $-norm penalty). +- In **regularization**, norms quantify complexity of parameter matrices (e.g., weight decay with $ \ell_2 $-norm). +- In **spectral methods**, matrix norms bound approximation error (e.g., spectral norm bounds for generalization). + + +## Visual Comparison (2D case) + +In 2D, vector norms induce different geometries: +- $ \ell_2 $: circular level sets +- $ \ell_1 $: diamond-shaped level sets +- $ \ell_\infty $: square level sets + +This influences which directions are favored in optimization and which vectors are "small" under a given norm. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Define example matrix A +A = np.array([[2, 1], + [1, 3]]) + +# Create unit circles in different norms +theta = np.linspace(0, 2 * np.pi, 400) +circle = np.stack([np.cos(theta), np.sin(theta)], axis=1) + +# l1 unit ball boundary (diamond) +l1_vectors = [] +for v in circle: + norm = np.sum(np.abs(v)) + l1_vectors.append(v / norm) +l1_vectors = np.array(l1_vectors) + +# l2 unit ball (circle) +l2_vectors = circle + +# linf unit ball boundary (square) +linf_vectors = [] +for v in circle: + norm = np.max(np.abs(v)) + linf_vectors.append(v / norm) +linf_vectors = np.array(linf_vectors) + +# Apply matrix A to each set +l1_transformed = l1_vectors @ A.T +l2_transformed = l2_vectors @ A.T +linf_transformed = linf_vectors @ A.T + +# Plot +fig, ax = plt.subplots(1, 3, figsize=(15, 5)) + +# l1 norm effect +ax[0].plot(l1_vectors[:, 0], l1_vectors[:, 1], label='Original') +ax[0].plot(l1_transformed[:, 0], l1_transformed[:, 1], label='Transformed') +ax[0].set_title(r'$\ell_1$ Norm Unit Ball $\rightarrow A$') +ax[0].axis('equal') +ax[0].grid(True) +ax[0].legend() + +# l2 norm effect +ax[1].plot(l2_vectors[:, 0], l2_vectors[:, 1], label='Original') +ax[1].plot(l2_transformed[:, 0], l2_transformed[:, 1], label='Transformed') +ax[1].set_title(r'$\ell_2$ Norm Unit Ball $\rightarrow A$') +ax[1].axis('equal') +ax[1].grid(True) +ax[1].legend() + +# linf norm effect +ax[2].plot(linf_vectors[:, 0], linf_vectors[:, 1], label='Original') +ax[2].plot(linf_transformed[:, 0], linf_transformed[:, 1], label='Transformed') +ax[2].set_title(r'$\ell_\infty$ Norm Unit Ball $\rightarrow A$') +ax[2].axis('equal') +ax[2].grid(True) +ax[2].legend() + +plt.tight_layout() +plt.show() + +``` + + +Let’s give formal **definitions and proofs** for several commonly used **induced matrix norms**, also known as **operator norms**, derived from vector norms. + +--- + +## 📚 Setting + +Let $\|\cdot\|$ be a **vector norm** on $\mathbb{R}^n$, and define the **induced matrix norm** for $\mathbf{A} \in \mathbb{R}^{m \times n}$ as: + +$$ +\|\mathbf{A}\| = \sup_{\mathbf{x} \neq 0} \frac{\|\mathbf{A} \mathbf{x}\|}{\|\mathbf{x}\|} = \sup_{\|\mathbf{x}\| = 1} \|\mathbf{A} \mathbf{x}\| +$$ + +We’ll now state and prove specific formulas for induced norms when the underlying vector norm is: + +* $\ell_1$ +* $\ell_\infty$ +* $\ell_2$ (spectral norm) + +--- + +## 1. Induced $\ell_1$ Norm + +**Claim**: +If $\|\mathbf{x}\| = \|\mathbf{x}\|_1$, then: + +$$ +\|\mathbf{A}\|_1 = \max_{1 \leq j \leq n} \sum_{i=1}^m |A_{ij}| +\quad \text{(maximum absolute column sum)} +$$ + +You're absolutely right — the proof is missing a key justification: **how to construct a vector $\mathbf{x}$ that attains the bound**, i.e., that the triangle inequality becomes an equality. + +Here's the improved proof with that step made explicit. + +--- + +## 1. Induced $\ell_1$ Norm + +**Claim**: +If $\|\mathbf{x}\| = \|\mathbf{x}\|_1$, then: + +$$ +\|\mathbf{A}\|_1 = \max_{1 \leq j \leq n} \sum_{i=1}^m |A_{ij}| +\quad \text{(maximum absolute column sum)} +$$ + +--- + +:::{prf:proof} + +Let $\mathbf{A} = [a_{ij}] \in \mathbb{R}^{m \times n}$. + +Then by the definition of the induced norm: + +$$ +\|\mathbf{A}\|_1 = \sup_{\|\mathbf{x}\|_1 = 1} \|\mathbf{A} \mathbf{x}\|_1 += \sup_{\|\mathbf{x}\|_1 = 1} \sum_{i=1}^m \left| \sum_{j=1}^n a_{ij} x_j \right| +$$ + +Apply the triangle inequality inside the absolute value: + +$$ +\leq \sup_{\|\mathbf{x}\|_1 = 1} \sum_{i=1}^m \sum_{j=1}^n |a_{ij}| \cdot |x_j| += \sup_{\|\mathbf{x}\|_1 = 1} \sum_{j=1}^n |x_j| \left( \sum_{i=1}^m |a_{ij}| \right) +$$ + +Let us define the **column sums**: + +$$ +c_j := \sum_{i=1}^m |a_{ij}| +$$ + +Then the expression becomes: + +$$ +\sum_{j=1}^n |x_j| c_j \leq \max_j c_j \cdot \sum_{j=1}^n |x_j| = \max_j c_j +$$ + +since $\sum_{j=1}^n |x_j| = \|\mathbf{x}\|_1 = 1$, and this is a convex combination of the $c_j$. + +### Attainment of the Maximum + +Let $j^* \in \{1, \dots, n\}$ be the index of the column with maximum sum: + +$$ +c_{j^*} = \max_j \sum_i |a_{ij}| +$$ + +Now choose the **standard basis vector** $\mathbf{e}_{j^*} \in \mathbb{R}^n$, where: + +$$ +(\mathbf{e}_{j^*})_j = \begin{cases} +1, & j = j^* \\\\ +0, & j \neq j^* +\end{cases} +$$ + +Then $\|\mathbf{e}_{j^*}\|_1 = 1$, and: + +$$ +\|\mathbf{A} \mathbf{e}_{j^*}\|_1 = \sum_{i=1}^m \left| a_{i j^*} \right| = c_{j^*} +$$ + +So the upper bound is **achieved**, and we conclude: + +$$ +\|\mathbf{A}\|_1 = \max_j \sum_i |a_{ij}| +$$ + +QED. +::: +--- + +## 2. Induced $\ell_\infty$ Norm + +**Claim**: +If $\|\mathbf{x}\| = \|\mathbf{x}\|_\infty$, then: + +$$ +\|\mathbf{A}\|_\infty = \max_{1 \leq i \leq m} \sum_{j=1}^n |A_{ij}| +\quad \text{(maximum absolute row sum)} +$$ + +:::{prf:proof} + +Let $\|\mathbf{x}\|_\infty = 1$. + +Then: + +$$ +\|A \mathbf{x}\|_\infty = \max_{i} \left| \sum_j a_{ij} x_j \right| +\leq \max_i \sum_j |a_{ij}||x_j| \leq \max_i \sum_j |a_{ij}| +$$ + +Equality is achieved by choosing $x_j = \operatorname{sign}(a_{ij^*})$ at the row $i^*$ with largest sum. So: + +$$ +\|\mathbf{A}\|_\infty = \max_i \sum_j |a_{ij}| +$$ + +QED. +::: + +## 3. Induced $\ell_2$ Norm (Spectral Norm) + +**Claim**: +If $\|\cdot\| = \|\cdot\|_2$, then: + +$$ +\|\mathbf{A}\|_2 = \sigma_{\max}(\mathbf{A}) = \sqrt{\lambda_{\max}(\mathbf{A}^\top \mathbf{A})} +$$ + +where $\sigma_{\max}(\mathbf{A})$ is the **largest singular value** of $\mathbf{A}$, and $\lambda_{\max}$ denotes the largest eigenvalue. + +:::{prf:proof} + +$$ +\|\mathbf{A}\|_2 = \sup_{\|\mathbf{x}\|_2 = 1} \|\mathbf{A} \mathbf{x}\|_2 += \sup_{\|\mathbf{x}\|_2 = 1} \sqrt{(\mathbf{A} \mathbf{x})^\top (\mathbf{A} \mathbf{x})} += \sup_{\|\mathbf{x}\|_2 = 1} \sqrt{\mathbf{x}^\top \mathbf{A}^\top \mathbf{A} \mathbf{x}} +$$ + +This is the **Rayleigh quotient** of $\mathbf{A}^\top \mathbf{A}$, a symmetric PSD matrix. + +So: + +$$ +\|\mathbf{A}\|_2 = \sqrt{\lambda_{\max}(\mathbf{A}^\top \mathbf{A})} +$$ + +QED. +::: +--- + +## Summary Table + +| Vector Norm | Induced Matrix Norm | | +| ------------- | ----------------------------|----------------------------- | +| $\ell_1$ | Max column sum: | $\max_j \sum_i a_{ij}$ | +| $\ell_\infty$ | Max row sum: | $\max_i \sum_j a_{ij}$ | +| $\ell_2$ | Largest singular value: |$\sqrt{\lambda_{\max}(A^\top A)}$ | + + + diff --git a/book/chapter_decompositions/orthogonal_projections.md b/book/chapter_decompositions/orthogonal_projections.md new file mode 100644 index 0000000..52de739 --- /dev/null +++ b/book/chapter_decompositions/orthogonal_projections.md @@ -0,0 +1,206 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Orthogonal projections + +We now consider a particular kind of optimization problem that is +particularly well-understood and can often be solved in closed form: +given some point $\mathbf{x}$ in an inner product space $V$, find the +closest point to $\mathbf{x}$ in a subspace $S$ of $V$. This process is +referred to as **projection onto a subspace**. + +The following diagram should make it geometrically clear that, at least +in Euclidean space, the solution is intimately related to orthogonality +and the Pythagorean theorem: + +```{code-cell} ipython3 +:tags: [hide-input] +# Re-import required packages due to kernel reset +import numpy as np +import matplotlib.pyplot as plt + +# Define subspace S spanned by vector e1 +e1 = np.array([1, 2]) +e1 = e1 / np.linalg.norm(e1) # normalize to make it orthonormal + +# Define arbitrary point x not in the subspace +x = np.array([2, 1]) + +# Compute projection of x onto the subspace spanned by e1 +x_proj = np.dot(x, e1) * e1 + +# Define a second point y in the subspace (for triangle) +y = 3 * e1 + +# Set up plot +fig, ax = plt.subplots(figsize=(6, 6)) + +# Draw vectors +origin = np.array([0, 0]) +ax.quiver(*origin, *x, angles='xy', scale_units='xy', scale=1, color='blue', label=r'$\mathbf{x}$') +ax.quiver(*origin, *x_proj, angles='xy', scale_units='xy', scale=1, color='green', label=r'$\mathbf{y}^* = P\mathbf{x}$') +ax.quiver(*origin, *y, angles='xy', scale_units='xy', scale=1, color='gray', alpha=0.5, label=r'$\mathbf{y} \in S$') + +# Draw dashed lines to form triangle +ax.plot([x[0], x_proj[0]], [x[1], x_proj[1]], 'k--', lw=1) +ax.plot([y[0], x[0]], [y[1], x[1]], 'k--', lw=1) +ax.plot([y[0], x_proj[0]], [y[1], x_proj[1]], 'k--', lw=1) + +# Annotate +ax.text(*(x + 0.2), r'$\mathbf{x}$', fontsize=12) +ax.text(*(x_proj + 0.2), r'$\mathbf{y}^*$', fontsize=12) +ax.text(*(y + 0.2), r'$\mathbf{y}$', fontsize=12) + +# Draw subspace line +line_extent = np.linspace(-3, 3, 100) +s_line = np.outer(line_extent, e1) +ax.plot(s_line[:, 0], s_line[:, 1], 'r-', lw=1, label=r'Subspace $S$') + +# Formatting +ax.set_xlim(-1, 4) +ax.set_ylim(-1, 4) +ax.set_aspect('equal') +ax.grid(True) +ax.legend() +ax.set_title(r"Orthogonal Projection of $\mathbf{x}$ onto Subspace $S$") + +plt.tight_layout() +plt.show() +``` +In this diagram, the blue vector $\mathbf{x}$ is an arbitrary point in the +inner product space $V$, the green vector $\mathbf{y}^* = P\mathbf{x}$ is +the projection of $\mathbf{x}$ onto the subspace $S$, and the gray vector +$\mathbf{y}$ is an arbitrary point in $S$. The dashed lines form a right +triangle with $\mathbf{x}$, $\mathbf{y}^*$, and $\mathbf{y}$ as vertices. +The right triangle formed by these three points illustrates the +relationship between the projection and orthogonality: the line segment +from $\mathbf{x}$ to $\mathbf{y}^*$ is perpendicular to the subspace $S$, +and the distance from $\mathbf{x}$ to $\mathbf{y}^*$ is the shortest +distance from $\mathbf{x}$ to any point in $S$. This is a direct +consequence of the Pythagorean theorem, which states that in a right +triangle, the square of the length of the hypotenuse (in this case, +$\|\mathbf{x}-\mathbf{y}\|$) is equal to the sum of the squares of the +lengths of the other two sides (in this case, $\|\mathbf{x}-\mathbf{y}^*\|$ and $\|\mathbf{y}^*-\mathbf{y}\|$). + + +Here $\mathbf{y}$ is an arbitrary element of the subspace $S$, and +$\mathbf{y}^*$ is the point in $S$ such that $\mathbf{x}-\mathbf{y}^*$ +is perpendicular to $S$. The hypotenuse of a right triangle (in this +case $\|\mathbf{x}-\mathbf{y}\|$) is always longer than either of the +legs (in this case $\|\mathbf{x}-\mathbf{y}^*\|$ and +$\|\mathbf{y}^*-\mathbf{y}\|$), and when $\mathbf{y} \neq \mathbf{y}^*$ +there always exists such a triangle between $\mathbf{x}$, $\mathbf{y}$, +and $\mathbf{y}^*$. + +Our intuition from Euclidean space suggests that the closest point to +$\mathbf{x}$ in $S$ has the perpendicularity property described above, +and we now show that this is indeed the case. + +:::{prf:proposition} +:label: prop-unique-minimizer +:nonumber: +Suppose $\mathbf{x} \in V$ and $\mathbf{y} \in S$. Then $\mathbf{y}^*$ +is the unique minimizer of $\|\mathbf{x}-\mathbf{y}\|$ over +$\mathbf{y} \in S$ if and only if $\mathbf{x}-\mathbf{y}^* \perp S$. +::: + + +:::{prf:proof} +$(\implies)$ Suppose $\mathbf{y}^*$ is the unique minimizer of +$\|\mathbf{x}-\mathbf{y}\|$ over $\mathbf{y} \in S$. That is, +$\|\mathbf{x}-\mathbf{y}^*\| \leq \|\mathbf{x}-\mathbf{y}\|$ for all +$\mathbf{y} \in S$, with equality only if $\mathbf{y} = \mathbf{y}^*$. +Fix $\mathbf{v} \in S$ and observe that + +$$\begin{aligned} +g(t) &:= \|\mathbf{x}-\mathbf{y}^*+t\mathbf{v}\|^2 \\ +&= \langle \mathbf{x}-\mathbf{y}^*+t\mathbf{v}, \mathbf{x}-\mathbf{y}^*+t\mathbf{v} \rangle \\ +&= \langle \mathbf{x}-\mathbf{y}^*, \mathbf{x}-\mathbf{y}^* \rangle - 2t\langle \mathbf{x}-\mathbf{y}^*, \mathbf{v} \rangle + t^2\langle \mathbf{v}, \mathbf{v} \rangle \\ +&= \|\mathbf{x}-\mathbf{y}^*\|^2 - 2t\langle \mathbf{x}-\mathbf{y}^*, \mathbf{v} \rangle + t^2\|\mathbf{v}\|^2 +\end{aligned}$$ + +must have a minimum at $t = 0$ as a consequence of this +assumption. Thus + +$$0 = g'(0) = \left.-2\langle \mathbf{x}-\mathbf{y}^*, \mathbf{v} \rangle + 2t\|\mathbf{v}\|^2\right|_{t=0} = -2\langle \mathbf{x}-\mathbf{y}^*, \mathbf{v} \rangle$$ + +giving $\mathbf{x}-\mathbf{y}^* \perp \mathbf{v}$. Since $\mathbf{v}$ +was arbitrary in $S$, we have $\mathbf{x}-\mathbf{y}^* \perp S$ as +claimed. + +$(\impliedby)$ Suppose $\mathbf{x}-\mathbf{y}^* \perp S$. Observe that +for any $\mathbf{y} \in S$, $\mathbf{y}^*-\mathbf{y} \in S$ because +$\mathbf{y}^* \in S$ and $S$ is closed under subtraction. Under the +hypothesis, $\mathbf{x}-\mathbf{y}^* \perp \mathbf{y}^*-\mathbf{y}$, so +by the Pythagorean theorem, + +$$\|\mathbf{x}-\mathbf{y}\| = \|\mathbf{x}-\mathbf{y}^*+\mathbf{y}^*-\mathbf{y}\| = \|\mathbf{x}-\mathbf{y}^*\| + \|\mathbf{y}^*-\mathbf{y}\| \geq \|\mathbf{x} - \mathbf{y}^*\|$$ + +and in fact the inequality is strict when $\mathbf{y} \neq \mathbf{y}^*$ +since this implies $\|\mathbf{y}^*-\mathbf{y}\| > 0$. Thus +$\mathbf{y}^*$ is the unique minimizer of $\|\mathbf{x}-\mathbf{y}\|$ +over $\mathbf{y} \in S$. ◻ +::: + +Since a unique minimizer in $S$ can be found for any $\mathbf{x} \in V$, +we can define an operator + +$$P\mathbf{x} = \operatorname{argmin}_{\mathbf{y} \in S} \|\mathbf{x}-\mathbf{y}\|$$ + +Observe that $P\mathbf{y} = \mathbf{y}$ for any $\mathbf{y} \in S$, +since $\mathbf{y}$ has distance zero from itself and every other point +in $S$ has positive distance from $\mathbf{y}$. Thus +$P(P\mathbf{x}) = P\mathbf{x}$ for any $\mathbf{x}$ (i.e., $P^2 = P$) +because $P\mathbf{x} \in S$. The identity $P^2 = P$ is actually one of +the defining properties of a **projection**, the other being linearity. + +An immediate consequence of the previous result is that +$\mathbf{x} - P\mathbf{x} \perp S$ for any $\mathbf{x} \in V$, and +conversely that $P$ is the unique operator that satisfies this property +for all $\mathbf{x} \in V$. For this reason, $P$ is known as an +**orthogonal projection**. + +If we choose an orthonormal basis for the target subspace $S$, it is +possible to write down a more specific expression for $P$. + +:::{prf:proposition} +:label: prop-orthonormal-basis-projection +:nonumber: + +If $\mathbf{e}_1, \dots, \mathbf{e}_m$ is an orthonormal basis for $S$, +then + +$$P\mathbf{x} = \sum_{i=1}^m \langle \mathbf{x}, \mathbf{e}_i \rangle\mathbf{e}_i$$ +::: + + +:::{prf:proof} +Let $\mathbf{e}_1, \dots, \mathbf{e}_m$ be an orthonormal basis +for $S$, and suppose $\mathbf{x} \in V$. Then for all $j = 1, \dots, m$, + +$$\begin{aligned} +\left\langle \mathbf{x}-\sum_{i=1}^m \langle \mathbf{x}, \mathbf{e}_i \rangle\mathbf{e}_i, \mathbf{e}_j \right\rangle &= \langle \mathbf{x}, \mathbf{e}_j \rangle - \sum_{i=1}^m \langle \mathbf{x}, \mathbf{e}_i \rangle\underbrace{\langle \mathbf{e}_i, \mathbf{e}_j \rangle}_{\delta_{ij}} \\ +&= \langle \mathbf{x}, \mathbf{e}_j \rangle - \langle \mathbf{x}, \mathbf{e}_j \rangle \\ +&= 0 +\end{aligned}$$ + +We have shown that the claimed expression, call it +$\tilde{P}\mathbf{x}$, satisfies +$\mathbf{x} - \tilde{P}\mathbf{x} \perp \mathbf{e}_j$ for every element +$\mathbf{e}_j$ of the orthonormal basis for $S$. It follows (by +linearity of the inner product) that +$\mathbf{x} - \tilde{P}\mathbf{x} \perp S$, so the previous result +implies $P = \tilde{P}$. ◻ +::: + +The fact that $P$ is a linear operator (and thus a proper projection, as +earlier we showed $P^2 = P$) follows readily from this result. diff --git a/book/chapter_decompositions/pseudoinverse.md b/book/chapter_decompositions/pseudoinverse.md index 1b016db..fb20d77 100644 --- a/book/chapter_decompositions/pseudoinverse.md +++ b/book/chapter_decompositions/pseudoinverse.md @@ -11,27 +11,27 @@ kernelspec: name: python3 --- # Moore-Penrose Pseudoinverse -The Moore-Penrose pseudoinverse is a generalization of the matrix inverse that can be applied to non-square or singular matrices. It is denoted as $ A^+ $ for a matrix $ A $. The pseudoinverse satisfies the following properties: -1. **Existence**: The pseudoinverse exists for any matrix $ A $. +The Moore-Penrose pseudoinverse is a generalization of the matrix inverse that can be applied to non-square or singular matrices. It is denoted as $ \mathbf{A}^+ $ for a matrix $ \mathbf{A} $. The pseudoinverse satisfies the following properties: +1. **Existence**: The pseudoinverse exists for any matrix $ \mathbf{A} $. 2. **Uniqueness**: The pseudoinverse is unique. 3. **Properties**: - - $ A A^+ A = A $ - - $ A^+ A A^+ = A^+ $ - - $ (A A^+)^\top = A A^+ $ - - $ (A^+ A)^\top = A^+ A $ -4. **Rank**: The rank of $ A^+ $ is equal to the rank of $ A $. -5. **Singular Value Decomposition (SVD)**: The pseudoinverse can be computed using the singular value decomposition of $ A $. If $ A = U \Sigma V^\top $, where $ U $ and $ V $ are orthogonal matrices and $ \Sigma $ is a diagonal matrix with singular values, then: + - $ \mathbf{A} \mathbf{A}^+ \mathbf{A} = \mathbf{A} $ + - $ \mathbf{A}^+ \mathbf{A} \mathbf{A}^+ = \mathbf{A}^+ $ + - $ (\mathbf{A} \mathbf{A}^+)^\top = \mathbf{A} \mathbf{A}^+ $ + - $ (\mathbf{A}^+ \mathbf{A})^\top = \mathbf{A}^+ \mathbf{A} $ +4. **Rank**: The rank of $ \mathbf{A}^+ $ is equal to the rank of $ \mathbf{A} $. +5. **Singular Value Decomposition (SVD)**: The pseudoinverse can be computed using the singular value decomposition of $ \mathbf{A} $. If $ \mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top $, where $ \mathbf{U} $ and $ \mathbf{V} $ are orthogonal matrices and $ \boldsymbol{\Sigma} $ is a diagonal matrix with singular values, then: $$ - A^+ = V \Sigma^+ U^\top + \mathbf{A}^+ = \mathbf{V} \boldsymbol{\Sigma}^+ \mathbf{U}^\top $$ - where $ \Sigma^+ $ is obtained by taking the reciprocal of the non-zero singular values in $ \Sigma $ and transposing the resulting matrix. + where $ \boldsymbol{\Sigma}^+ $ is obtained by taking the reciprocal of the non-zero singular values in $ \boldsymbol{\Sigma} $ and transposing the resulting matrix. 6. **Applications**: The pseudoinverse is used in various applications, including solving linear systems, least squares problems, and in machine learning algorithms such as linear regression. -7. **Least Squares Solution**: The pseudoinverse provides a least squares solution to the equation $ Ax = b $ when $ A $ is not square or has no unique solution. The least squares solution is given by: +7. **Least Squares Solution**: The pseudoinverse provides a least squares solution to the equation $ \mathbf{A}\mathbf{x} = \mathbf{b} $ when $ \mathbf{A} $ is not square or has no unique solution. The least squares solution is given by: $$ - x = A^+ b + \mathbf{x} = \mathbf{A}^+ \mathbf{b} $$ -8. **Geometric Interpretation**: The pseudoinverse can be interpreted geometrically as the projection of a vector onto the column space of $ A $. +8. **Geometric Interpretation**: The pseudoinverse can be interpreted geometrically as the projection of a vector onto the column space of $ \mathbf{A} $. 9. **Computational Considerations**: The computation of the pseudoinverse can be done efficiently using numerical methods, such as the SVD, especially for large matrices. 10. **Limitations**: The pseudoinverse may not be suitable for all applications, especially when the matrix is ill-conditioned or has a high condition number. From 366b36af9776a63ef9c3ee507e25606d8a3f586e Mon Sep 17 00:00:00 2001 From: clippert Date: Sun, 25 May 2025 23:10:28 +0200 Subject: [PATCH 37/43] added collaborative filtering example --- book/chapter_decompositions/matrix_norms.md | 201 +++++++++++++++++--- 1 file changed, 174 insertions(+), 27 deletions(-) diff --git a/book/chapter_decompositions/matrix_norms.md b/book/chapter_decompositions/matrix_norms.md index 36ebd8b..e2680c0 100644 --- a/book/chapter_decompositions/matrix_norms.md +++ b/book/chapter_decompositions/matrix_norms.md @@ -24,6 +24,7 @@ A **matrix norm** is a function $ \|\cdot\| : \mathbb{R}^{m \times n} \to \mathb 3. **Homogeneity**: $ \|\alpha \mathbf{A}\| = |\alpha| \cdot \|\mathbf{A}\| $ 4. **Triangle inequality**: $ \|\mathbf{A} + \mathbf{B}\| \leq \|\mathbf{A}\| + \|\mathbf{B}\| $ +These are the **minimal axioms** for a matrix norm — analogous to vector norms. ## Common Matrix Norms @@ -53,7 +54,13 @@ Examples: - **$ \ell_\infty $ norm**: Maximum absolute row sum. -## Properties +## Submultiplicativity + +The **submultiplicative property is an additional structure**, not a required axiom. Many useful matrix norms (especially induced norms) **do** satisfy it, but not all matrix norms do. + +When a matrix norm satisfies it, we say it is a: + +> **Submultiplicative matrix norm** - Induced norms satisfy the **submultiplicative property**: @@ -67,14 +74,15 @@ $$ \|\mathbf{A}\mathbf{B}\|_F \leq \|\mathbf{A}\|_F \cdot \|\mathbf{B}\|_F $$ -- All norms on a finite-dimensional vector space are equivalent (they define the same topology), but may differ in scaling. +| Norm | Submultiplicative? | Notes | +| ------------------------------ | ------------------ | ---------------------------------- | +| Frobenius norm $\|\cdot\|_F$ | ✅ Yes | But not induced from a vector norm | +| Induced norms (e.g., spectral) | ✅ Yes | Always submultiplicative | +| Entrywise max norm | ❌ No | Not submultiplicative in general | -## Applications in Machine Learning -- In **optimization**, norms define constraints (e.g., Lasso uses $ \ell_1 $-norm penalty). -- In **regularization**, norms quantify complexity of parameter matrices (e.g., weight decay with $ \ell_2 $-norm). -- In **spectral methods**, matrix norms bound approximation error (e.g., spectral norm bounds for generalization). +- All norms on a finite-dimensional vector space are equivalent (they define the same topology), but may differ in scaling. ## Visual Comparison (2D case) @@ -86,6 +94,15 @@ In 2D, vector norms induce different geometries: This influences which directions are favored in optimization and which vectors are "small" under a given norm. +Here is a visual comparison of how different induced norms transform unit circles in 2D space under a linear transformation defined by a matrix $ A $: + +$$ +\mathbf{A} = \begin{bmatrix} +2 & 1 \\ +1 & 3 +\end{bmatrix} +$$ + ```{code-cell} ipython3 :tags: [hide-input] import numpy as np @@ -156,10 +173,6 @@ plt.show() Let’s give formal **definitions and proofs** for several commonly used **induced matrix norms**, also known as **operator norms**, derived from vector norms. ---- - -## 📚 Setting - Let $\|\cdot\|$ be a **vector norm** on $\mathbb{R}^n$, and define the **induced matrix norm** for $\mathbf{A} \in \mathbb{R}^{m \times n}$ as: $$ @@ -184,23 +197,6 @@ $$ \quad \text{(maximum absolute column sum)} $$ -You're absolutely right — the proof is missing a key justification: **how to construct a vector $\mathbf{x}$ that attains the bound**, i.e., that the triangle inequality becomes an equality. - -Here's the improved proof with that step made explicit. - ---- - -## 1. Induced $\ell_1$ Norm - -**Claim**: -If $\|\mathbf{x}\| = \|\mathbf{x}\|_1$, then: - -$$ -\|\mathbf{A}\|_1 = \max_{1 \leq j \leq n} \sum_{i=1}^m |A_{ij}| -\quad \text{(maximum absolute column sum)} -$$ - ---- :::{prf:proof} @@ -310,6 +306,10 @@ where $\sigma_{\max}(\mathbf{A})$ is the **largest singular value** of $\mathbf{ :::{prf:proof} +Let $\|\mathbf{x}\|_2 = 1$. + +Then: + $$ \|\mathbf{A}\|_2 = \sup_{\|\mathbf{x}\|_2 = 1} \|\mathbf{A} \mathbf{x}\|_2 = \sup_{\|\mathbf{x}\|_2 = 1} \sqrt{(\mathbf{A} \mathbf{x})^\top (\mathbf{A} \mathbf{x})} @@ -337,4 +337,151 @@ QED. | $\ell_2$ | Largest singular value: |$\sqrt{\lambda_{\max}(A^\top A)}$ | +## Applications in Machine Learning +- In **optimization**, norms define constraints (e.g., Lasso uses $ \ell_1 $-norm penalty). +- In **regularization**, norms quantify complexity of parameter matrices (e.g., weight decay with $ \ell_2 $-norm). +- In **spectral methods**, matrix norms bound approximation error (e.g., spectral norm bounds for generalization). + + +Certainly! Here's a concise and precise introduction paragraph for your textbook or lecture notes: + +--- + +## Collaborative Filtering and Matrix Factorization + +**Collaborative filtering** is a foundational technique in recommendation systems, where the goal is to predict a user's preference for items based on observed interactions (such as ratings, clicks, or purchases). The key assumption underlying collaborative filtering is that **user preferences and item characteristics lie in a shared low-dimensional latent space**. That is, although we observe only sparse user-item interactions, there exists a hidden structure — often of low rank — that explains these patterns. + +A common model formalizes this intuition by representing the user-item rating matrix $R \in \mathbb{R}^{m \times n}$ as the product of two low-rank matrices: + +$$ +R \approx UV^\top +$$ + +where $U \in \mathbb{R}^{m \times k}$ encodes latent user features and $V \in \mathbb{R}^{n \times k}$ encodes latent item features, for some small $k \ll \min(m, n)$. The model is typically fit by **minimizing the squared error** over observed entries, together with regularization to prevent overfitting: + +$$ +\min_{U, V} \sum_{(i,j) \in \Omega} (R_{ij} - U_i^\top V_j)^2 + \lambda (\|U\|_F^2 + \|V\|_F^2) +$$ + +where $\Omega \subset [m] \times [n]$ is the set of observed ratings, and $\| \cdot \|_F$ is the Frobenius norm. This formulation implicitly assumes that **missing ratings are missing at random** and that users with similar latent profiles tend to rate items similarly — an assumption that allows the model to generalize from sparse data. + +```{code-cell} ipython3 +class MatrixFactorization: + def __init__(self, k=2, steps=1000, lam=0.1): + """ + Initializes the matrix factorization model. + + Parameters: + - k (int): number of latent features + - steps (int): number of ALS iterations + - lam (float): regularization strength + """ + self.k = k + self.steps = steps + self.lam = lam + self.U = None + self.V = None + + def fit(self, R, mask): + """ + Fit the model to the observed rating matrix using ALS. + + Parameters: + - R (ndarray): observed rating matrix (with zeros for missing entries) + - mask (ndarray): boolean matrix where True indicates an observed entry + """ + num_users, num_items = R.shape + self.U = np.random.randn(num_users, self.k) + self.V = np.random.randn(num_items, self.k) + + for step in range(self.steps): + # Update U + for i in range(num_users): + V_masked = self.V[mask[i, :]] + R_i = R[i, mask[i, :]] + if len(R_i) > 0: + A = V_masked.T @ V_masked + self.lam * np.eye(self.k) + b = V_masked.T @ R_i + self.U[i] = np.linalg.solve(A, b) + # Update V + for j in range(num_items): + U_masked = self.U[mask[:, j]] + R_j = R[mask[:, j], j] + if len(R_j) > 0: + A = U_masked.T @ U_masked + self.lam * np.eye(self.k) + b = U_masked.T @ R_j + self.V[j] = np.linalg.solve(A, b) + + def predict(self): + """ + Returns the full reconstructed rating matrix. + """ + return self.U @ self.V.T + + def predict_single(self, user_idx, item_idx): + """ + Predict a single rating for a user-item pair. + + Parameters: + - user_idx (int): index of the user + - item_idx (int): index of the item + + Returns: + - float: predicted rating + """ + return self.U[user_idx] @ self.V[item_idx] +``` + + +This example demonstrates **collaborative filtering via matrix factorization** using the **Frobenius norm** to minimize reconstruction error: + +```{code-cell} ipython3 +:tags: [hide-input] +# Re-import necessary packages after kernel reset +import numpy as np +import matplotlib.pyplot as plt + +# Set random seed for reproducibility +np.random.seed(42) + +# Generate a low-rank user-item matrix (simulating ratings) +num_users = 10 +num_items = 8 +rank = 2 # desired low-rank structure + +# Latent user and item factors +U_true = np.random.randn(num_users, rank) +V_true = np.random.randn(num_items, rank) + +# Generate full rating matrix (low-rank) +R_true = U_true @ V_true.T + +# Simulate missing entries by masking some values +mask = np.random.rand(num_users, num_items) < 0.5 +R_observed = R_true * mask + +model = MatrixFactorization(k=rank, steps=1000, lam=0.1) +model.fit(R_observed, mask) +R_pred = model.predict() + +# Plotting the true, observed, and predicted matrices +fig, axs = plt.subplots(1, 3, figsize=(15, 4)) +im0 = axs[0].imshow(R_true, cmap='coolwarm', vmin=-5, vmax=5) +axs[0].set_title("True Rating Matrix") +im1 = axs[1].imshow(np.where(mask, R_observed, np.nan), cmap='coolwarm', vmin=-5, vmax=5) +axs[1].set_title("Observed Ratings (with Missing)") +im2 = axs[2].imshow(R_pred, cmap='coolwarm', vmin=-5, vmax=5) +axs[2].set_title("Predicted Ratings via MF") + +for ax in axs: + ax.set_xlabel("Items") + ax.set_ylabel("Users") + +fig.colorbar(im2, ax=axs, orientation='vertical', fraction=0.02, pad=0.04) + +plt.show() +``` +* **Left panel**: The true user-item rating matrix (low-rank structure). +* **Middle panel**: The observed entries, with \~50% missing. +* **Right panel**: The matrix reconstructed via alternating least squares (ALS). From 35f51f4dee73fbb60bd492c7756c990089a5f7c2 Mon Sep 17 00:00:00 2001 From: clippert Date: Mon, 26 May 2025 08:53:52 +0200 Subject: [PATCH 38/43] bugfix yAy --- book/chapter_decompositions/Rayleigh_quotients.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/chapter_decompositions/Rayleigh_quotients.md b/book/chapter_decompositions/Rayleigh_quotients.md index 9dcec84..80312fe 100644 --- a/book/chapter_decompositions/Rayleigh_quotients.md +++ b/book/chapter_decompositions/Rayleigh_quotients.md @@ -135,7 +135,7 @@ Let $\mathbf{x}\neq \boldsymbol{0},$ then $R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}} = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\|\mathbf{x}\|^2} = (\frac{\mathbf{x}}{\|\mathbf{x}\|})^{\!\top\!}\mathbf{A}(\frac{\mathbf{x}}{\|\mathbf{x}\|})$ -Thus, minimimum and maximum of the Rayleigh quotient are identical to minimum and maximum of the squared form $\mathbf{y}\mathbf{A}\mathbf{y}$ for the unit-norm vector $\mathbf{y}=\mathbf{x}/\|\mathbf{x}\|$: +Thus, minimimum and maximum of the Rayleigh quotient are identical to minimum and maximum of the squared form $\mathbf{y}^\top\mathbf{A}\mathbf{y}$ for the unit-norm vector $\mathbf{y}=\mathbf{x}/\|\mathbf{x}\|$: $$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ From 26454d3577489015efd7b7f6c6964b8552848bb7 Mon Sep 17 00:00:00 2001 From: clippert Date: Tue, 27 May 2025 23:10:41 +0200 Subject: [PATCH 39/43] material wednesday --- book/chapter_decompositions/big_picture.md | 94 +++++- .../orthogonal_projections.md | 282 +++++++++++++++--- book/chapter_decompositions/pseudoinverse.md | 220 ++++++++++++-- book/chapter_decompositions/svd.md | 31 +- 4 files changed, 557 insertions(+), 70 deletions(-) diff --git a/book/chapter_decompositions/big_picture.md b/book/chapter_decompositions/big_picture.md index 4f6735e..f8b3f7e 100644 --- a/book/chapter_decompositions/big_picture.md +++ b/book/chapter_decompositions/big_picture.md @@ -12,12 +12,96 @@ kernelspec: --- # The fundamental subspaces of a matrix -The fundamental subspaces of a matrix $A$ are the four subspaces associated with the matrix and its transpose. These subspaces are important in linear algebra and numerical analysis, particularly in the context of solving linear systems and eigenvalue problems. +The fundamental subspaces of a matrix $\mathbf{A}$ are the four subspaces associated with the matrix and its transpose. +These subspaces are important in linear algebra and numerical analysis, particularly in the context of solving linear systems and eigenvalue problems. +We also provide the projections onto these subspaces, which are useful for various applications such as least squares problems and dimensionality reduction. The proof of these projection formulas relies on the properties of the Moore-Penrose pseudoinverse and the orthogonal projections onto subspaces. -1. **Column Space (Range) of A**: The column space of a matrix $A$ is the set of all possible linear combinations of its columns. It represents the span of the columns of $A$ and is denoted as $\text{Col}(A)$ or $\text{Range}(A)$. +We denote the matrix $\mathbf{A}$ as an $m \times n$ matrix, where $m$ is the number of rows and $n$ is the number of columns. -2. **Null Space (Kernel) of A**: The null space of a matrix $A$ is the set of all vectors $\mathbf{x}$ such that $A\mathbf{x} = \mathbf{0}$. It represents the solutions to the homogeneous equation associated with $A$ and is denoted as $\text{Null}(A)$ or $\text{Ker}(A)$. +The four fundamental subspaces are: -3. **Row Space of A**: The row space of a matrix $A$ is the set of all possible linear combinations of its rows. It is equivalent to the column space of its transpose, $A^\top$, and is denoted as $\text{Row}(A)$ or $\text{Col}(A^\top)$. +## 1. **Column Space (Range) of $\mathbf{A}$**: +The column space of a matrix $\mathbf{A}$ is the set of all possible linear combinations of its columns. It represents the span of the columns of $\mathbf{A}$ and is denoted as $\text{Col}(\mathbf{A})$ or $\text{Range}(\mathbf{A})$. -4. **Left Null Space (Kernel) of A**: The left null space of a matrix $A$ is the set of all vectors $\mathbf{y}$ such that $A^\top\mathbf{y} = \mathbf{0}$. It represents the solutions to the homogeneous equation associated with $A^\top$ and is denoted as $\text{Null}(A^\top)$ or $\text{Ker}(A^\top)$. +:::{prf:lemma} Projection onto the Column Space +:label: trm-projection-column-space +:nonumber: + +The projection of a vector $\mathbf{b}$ onto the column space of a matrix $\mathbf{A}$ is given by: + +$$ +\mathbf{P}_{\text{Col}(\mathbf{A})}(\mathbf{b}) = \mathbf{A}\mathbf{A}^+ \mathbf{b} +$$ +::: + +## 2. **Null Space (Kernel) of $\mathbf{A}$**: +The null space of a matrix $\mathbf{A}$ is the set of all vectors $\mathbf{x}$ such that $\mathbf{A}\mathbf{x} = \mathbf{0}$. It represents the solutions to the homogeneous equation associated with $\mathbf{A}$ and is denoted as $\text{Null}(\mathbf{A})$ or $\text{Ker}(\mathbf{A})$. + +:::{prf:lemma} Projection onto the Null Space +:label: trm-projection-null-space +:nonumber: + +The projection of a vector $\mathbf{b}$ onto the null space of a matrix $\mathbf{A}$ is given by: + +$$ +\mathbf{P}_{\text{Null}(\mathbf{A})}(\mathbf{b}) = \left(\mathbf{I} - \mathbf{P}_{\text{Col}(\mathbf{A})}\right)(\mathbf{b}) = \mathbf{b} - \mathbf{A}\mathbf{A}^+ \mathbf{b} +$$ +::: + +## 3. **Row Space of $\mathbf{A}$**: +The row space of a matrix $\mathbf{A}$ is the set of all possible linear combinations of its rows. It is equivalent to the column space of its transpose, $\mathbf{A}^\top$, and is denoted as $\text{Row}(\mathbf{A})$ or $\text{Col}(\mathbf{A}^\top)$. + +:::{prf:lemma} Projection onto the Row Space +:label: trm-projection-row-space +:nonumber: + +The projection of a vector $\mathbf{b}$ onto the row space of a matrix $\mathbf{A}$ is given by: + +$$ +\mathbf{P}_{\text{Row}(\mathbf{A})}(\mathbf{b}) = \mathbf{A}^+\mathbf{A}\mathbf{b} +$$ +::: + + +## 4. **Left Null Space (Kernel) of $\mathbf{A}$**: +The left null space of a matrix $\mathbf{A}$ is the set of all vectors $\mathbf{y}$ such that $\mathbf{A}^\top\mathbf{y} = \mathbf{0}$. It represents the solutions to the homogeneous equation associated with $\mathbf{A}^\top$ and is denoted as $\text{Null}(\mathbf{A}^\top)$ or $\text{Ker}(\mathbf{A}^\top)$. + +:::{prf:lemma} Projection onto the Left Null Space +:label: trm-projection-left-null-space +:nonumber: + +The projection of a vector $\mathbf{b}$ onto the left null space of a matrix $\mathbf{A}$ is given by: + +$$ +\mathbf{P}_{\text{Null}(\mathbf{A}^\top)}(\mathbf{b}) = \left(\mathbf{I} - \mathbf{P}_{\text{Row}(\mathbf{A})}\right)(\mathbf{b}) = \mathbf{b} - \mathbf{A}^+\mathbf{A}\mathbf{b} +$$ +::: + + +## Singular Value Decomposition and the four fundamental subspaces +The SVD provides a powerful way to understand the four fundamental +subspaces of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$. + +The SVD of $\mathbf{A}$ is given by: + +$$ +\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!} +$$ + +where $\mathbf{U} \in \mathbb{R}^{m \times m}$ and $\mathbf{V} \in \mathbb{R}^{n \times n}$ are orthogonal matrices, and $\mathbf{\Sigma} \in \mathbb{R}^{m \times n}$ is a diagonal matrix with the singular values of $\mathbf{A}$ on its diagonal. + +:::{prf:lemma} SVD and the Four Fundamental Subspaces +:label: trm-svd-four-subspaces +:nonumber: + +The SVD of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ can be used to identify the four fundamental subspaces as follows: +1. **Column Space**: $\text{Col}(\mathbf{A}) = \text{span}(\mathbf{U}_r)$, where $\mathbf{U}_r$ consists of the first $r$ columns of $\mathbf{U}$ corresponding to non-zero singular values. +2. **Row Space**: $\text{Row}(\mathbf{A}) = \text{span}(\mathbf{V}_r)$, where $\mathbf{V}_r$ consists of the first $r$ columns of $\mathbf{V}$ corresponding to non-zero singular values. +3. **Null Space**: $\text{Null}(\mathbf{A}) = \text{span}(\mathbf{V}_{n-r})$, where $\mathbf{V}_{n-r}$ consists of the last $n-r$ columns of $\mathbf{V}$ corresponding to zero singular values. +4. **Left Null Space**: $\text{Null}(\mathbf{A}^\top) = \text{span}(\mathbf{U}_{m-r})$, where $\mathbf{U}_{m-r}$ consists of the last $m-r$ columns of $\mathbf{U}$ corresponding to zero singular values. +::: + +## Summary +The four fundamental subspaces of a matrix $\mathbf{A}$ are essential in understanding the structure of the matrix and its properties. +The projections onto these subspaces can be computed using the Moore-Penrose pseudoinverse, which provides a powerful tool for solving linear systems and performing dimensionality reduction. +The SVD further enhances our understanding by revealing the relationships between these subspaces through the orthogonal matrices and singular values. \ No newline at end of file diff --git a/book/chapter_decompositions/orthogonal_projections.md b/book/chapter_decompositions/orthogonal_projections.md index 52de739..f2ea2ae 100644 --- a/book/chapter_decompositions/orthogonal_projections.md +++ b/book/chapter_decompositions/orthogonal_projections.md @@ -12,11 +12,11 @@ kernelspec: --- # Orthogonal projections -We now consider a particular kind of optimization problem that is -particularly well-understood and can often be solved in closed form: -given some point $\mathbf{x}$ in an inner product space $V$, find the -closest point to $\mathbf{x}$ in a subspace $S$ of $V$. This process is -referred to as **projection onto a subspace**. +We now consider a particular kind of optimization problem referred to as **projection onto a subspace**: + +Given some point $\mathbf{x}$ in an inner product space $V$, find the +closest point to $\mathbf{x}$ in a subspace $S$ of $V$. + The following diagram should make it geometrically clear that, at least in Euclidean space, the solution is intimately related to orthogonality @@ -61,13 +61,13 @@ ax.text(*(x_proj + 0.2), r'$\mathbf{y}^*$', fontsize=12) ax.text(*(y + 0.2), r'$\mathbf{y}$', fontsize=12) # Draw subspace line -line_extent = np.linspace(-3, 3, 100) +line_extent = np.linspace(-10, 10, 100) s_line = np.outer(line_extent, e1) ax.plot(s_line[:, 0], s_line[:, 1], 'r-', lw=1, label=r'Subspace $S$') # Formatting -ax.set_xlim(-1, 4) -ax.set_ylim(-1, 4) +ax.set_xlim(-0.5, 3) +ax.set_ylim(-0.5, 3) ax.set_aspect('equal') ax.grid(True) ax.legend() @@ -77,21 +77,23 @@ plt.tight_layout() plt.show() ``` In this diagram, the blue vector $\mathbf{x}$ is an arbitrary point in the -inner product space $V$, the green vector $\mathbf{y}^* = P\mathbf{x}$ is +inner product space $V$, the green vector $\mathbf{y}^* = \mathbf{P}\mathbf{x}$ is the projection of $\mathbf{x}$ onto the subspace $S$, and the gray vector -$\mathbf{y}$ is an arbitrary point in $S$. The dashed lines form a right -triangle with $\mathbf{x}$, $\mathbf{y}^*$, and $\mathbf{y}$ as vertices. +$\mathbf{y}$ is an arbitrary point in $S$. + +The dashed lines form a right triangle with $\mathbf{x}$, $\mathbf{y}^*$, and $\mathbf{y}$ as vertices. The right triangle formed by these three points illustrates the relationship between the projection and orthogonality: the line segment from $\mathbf{x}$ to $\mathbf{y}^*$ is perpendicular to the subspace $S$, and the distance from $\mathbf{x}$ to $\mathbf{y}^*$ is the shortest -distance from $\mathbf{x}$ to any point in $S$. This is a direct +distance from $\mathbf{x}$ to any point in $S$. + +This is a direct consequence of the Pythagorean theorem, which states that in a right triangle, the square of the length of the hypotenuse (in this case, $\|\mathbf{x}-\mathbf{y}\|$) is equal to the sum of the squares of the lengths of the other two sides (in this case, $\|\mathbf{x}-\mathbf{y}^*\|$ and $\|\mathbf{y}^*-\mathbf{y}\|$). - Here $\mathbf{y}$ is an arbitrary element of the subspace $S$, and $\mathbf{y}^*$ is the point in $S$ such that $\mathbf{x}-\mathbf{y}^*$ is perpendicular to $S$. The hypotenuse of a right triangle (in this @@ -105,31 +107,38 @@ Our intuition from Euclidean space suggests that the closest point to $\mathbf{x}$ in $S$ has the perpendicularity property described above, and we now show that this is indeed the case. -:::{prf:proposition} +:::{prf:proposition} Ortogonal projection and unique minimizer :label: prop-unique-minimizer :nonumber: -Suppose $\mathbf{x} \in V$ and $\mathbf{y} \in S$. Then $\mathbf{y}^*$ +Let $S$ be a subspace of an inner product space $V$ and let $\mathbf{x} \in V$ and $\mathbf{y} \in S$. + +Then $\mathbf{y}^*$ is the unique minimizer of $\|\mathbf{x}-\mathbf{y}\|$ over $\mathbf{y} \in S$ if and only if $\mathbf{x}-\mathbf{y}^* \perp S$. ::: - :::{prf:proof} + $(\implies)$ Suppose $\mathbf{y}^*$ is the unique minimizer of -$\|\mathbf{x}-\mathbf{y}\|$ over $\mathbf{y} \in S$. That is, +$\|\mathbf{x}-\mathbf{y}\|$ over $\mathbf{y} \in S$. + +That is, $\|\mathbf{x}-\mathbf{y}^*\| \leq \|\mathbf{x}-\mathbf{y}\|$ for all $\mathbf{y} \in S$, with equality only if $\mathbf{y} = \mathbf{y}^*$. + Fix $\mathbf{v} \in S$ and observe that $$\begin{aligned} -g(t) &:= \|\mathbf{x}-\mathbf{y}^*+t\mathbf{v}\|^2 \\ +g(t) :&= \|\mathbf{x}-\mathbf{y}^*+t\mathbf{v}\|^2 \\ &= \langle \mathbf{x}-\mathbf{y}^*+t\mathbf{v}, \mathbf{x}-\mathbf{y}^*+t\mathbf{v} \rangle \\ &= \langle \mathbf{x}-\mathbf{y}^*, \mathbf{x}-\mathbf{y}^* \rangle - 2t\langle \mathbf{x}-\mathbf{y}^*, \mathbf{v} \rangle + t^2\langle \mathbf{v}, \mathbf{v} \rangle \\ &= \|\mathbf{x}-\mathbf{y}^*\|^2 - 2t\langle \mathbf{x}-\mathbf{y}^*, \mathbf{v} \rangle + t^2\|\mathbf{v}\|^2 \end{aligned}$$ must have a minimum at $t = 0$ as a consequence of this -assumption. Thus +assumption. + +Thus $$0 = g'(0) = \left.-2\langle \mathbf{x}-\mathbf{y}^*, \mathbf{v} \rangle + 2t\|\mathbf{v}\|^2\right|_{t=0} = -2\langle \mathbf{x}-\mathbf{y}^*, \mathbf{v} \rangle$$ @@ -137,16 +146,22 @@ giving $\mathbf{x}-\mathbf{y}^* \perp \mathbf{v}$. Since $\mathbf{v}$ was arbitrary in $S$, we have $\mathbf{x}-\mathbf{y}^* \perp S$ as claimed. -$(\impliedby)$ Suppose $\mathbf{x}-\mathbf{y}^* \perp S$. Observe that +$(\impliedby)$ Suppose $\mathbf{x}-\mathbf{y}^* \perp S$. + +Observe that for any $\mathbf{y} \in S$, $\mathbf{y}^*-\mathbf{y} \in S$ because -$\mathbf{y}^* \in S$ and $S$ is closed under subtraction. Under the +$\mathbf{y}^* \in S$ and $S$ is closed under subtraction. + +Under the hypothesis, $\mathbf{x}-\mathbf{y}^* \perp \mathbf{y}^*-\mathbf{y}$, so by the Pythagorean theorem, $$\|\mathbf{x}-\mathbf{y}\| = \|\mathbf{x}-\mathbf{y}^*+\mathbf{y}^*-\mathbf{y}\| = \|\mathbf{x}-\mathbf{y}^*\| + \|\mathbf{y}^*-\mathbf{y}\| \geq \|\mathbf{x} - \mathbf{y}^*\|$$ and in fact the inequality is strict when $\mathbf{y} \neq \mathbf{y}^*$ -since this implies $\|\mathbf{y}^*-\mathbf{y}\| > 0$. Thus +since this implies $\|\mathbf{y}^*-\mathbf{y}\| > 0$. + +Thus $\mathbf{y}^*$ is the unique minimizer of $\|\mathbf{x}-\mathbf{y}\|$ over $\mathbf{y} \in S$. ◻ ::: @@ -154,23 +169,27 @@ over $\mathbf{y} \in S$. ◻ Since a unique minimizer in $S$ can be found for any $\mathbf{x} \in V$, we can define an operator -$$P\mathbf{x} = \operatorname{argmin}_{\mathbf{y} \in S} \|\mathbf{x}-\mathbf{y}\|$$ +$$\mathbf{P}\mathbf{x} = \operatorname{argmin}_{\mathbf{y} \in S} \|\mathbf{x}-\mathbf{y}\|$$ -Observe that $P\mathbf{y} = \mathbf{y}$ for any $\mathbf{y} \in S$, +Observe that $\mathbf{P}\mathbf{y} = \mathbf{y}$ for any $\mathbf{y} \in S$, since $\mathbf{y}$ has distance zero from itself and every other point -in $S$ has positive distance from $\mathbf{y}$. Thus -$P(P\mathbf{x}) = P\mathbf{x}$ for any $\mathbf{x}$ (i.e., $P^2 = P$) -because $P\mathbf{x} \in S$. The identity $P^2 = P$ is actually one of +in $S$ has positive distance from $\mathbf{y}$. + +Thus +$\mathbf{\mathbf{P}}(\mathbf{\mathbf{P}}\mathbf{x}) = \mathbf{P}\mathbf{x}$ for any $\mathbf{x}$ (i.e., $\mathbf{P}^2 = \mathbf{P}$) +because $\mathbf{P}\mathbf{x} \in S$. + +The identity $\mathbf{P}^2 = \mathbf{P}$ is actually one of the defining properties of a **projection**, the other being linearity. An immediate consequence of the previous result is that -$\mathbf{x} - P\mathbf{x} \perp S$ for any $\mathbf{x} \in V$, and -conversely that $P$ is the unique operator that satisfies this property -for all $\mathbf{x} \in V$. For this reason, $P$ is known as an +$\mathbf{x} - \mathbf{P}\mathbf{x} \perp S$ for any $\mathbf{x} \in V$, and +conversely that $\mathbf{P}$ is the unique operator that satisfies this property +for all $\mathbf{x} \in V$. For this reason, $\mathbf{P}$ is known as an **orthogonal projection**. If we choose an orthonormal basis for the target subspace $S$, it is -possible to write down a more specific expression for $P$. +possible to write down a more specific expression for $\mathbf{P}$. :::{prf:proposition} :label: prop-orthonormal-basis-projection @@ -179,13 +198,15 @@ possible to write down a more specific expression for $P$. If $\mathbf{e}_1, \dots, \mathbf{e}_m$ is an orthonormal basis for $S$, then -$$P\mathbf{x} = \sum_{i=1}^m \langle \mathbf{x}, \mathbf{e}_i \rangle\mathbf{e}_i$$ +$$\mathbf{P}\mathbf{x} = \sum_{i=1}^m \langle \mathbf{x}, \mathbf{e}_i \rangle\mathbf{e}_i$$ ::: :::{prf:proof} Let $\mathbf{e}_1, \dots, \mathbf{e}_m$ be an orthonormal basis -for $S$, and suppose $\mathbf{x} \in V$. Then for all $j = 1, \dots, m$, +for $S$, and suppose $\mathbf{x} \in V$. + +Then for all $j = 1, \dots, m$, $$\begin{aligned} \left\langle \mathbf{x}-\sum_{i=1}^m \langle \mathbf{x}, \mathbf{e}_i \rangle\mathbf{e}_i, \mathbf{e}_j \right\rangle &= \langle \mathbf{x}, \mathbf{e}_j \rangle - \sum_{i=1}^m \langle \mathbf{x}, \mathbf{e}_i \rangle\underbrace{\langle \mathbf{e}_i, \mathbf{e}_j \rangle}_{\delta_{ij}} \\ @@ -194,13 +215,194 @@ $$\begin{aligned} \end{aligned}$$ We have shown that the claimed expression, call it -$\tilde{P}\mathbf{x}$, satisfies -$\mathbf{x} - \tilde{P}\mathbf{x} \perp \mathbf{e}_j$ for every element -$\mathbf{e}_j$ of the orthonormal basis for $S$. It follows (by +$\tilde{\mathbf{P}}\mathbf{x}$, satisfies +$\mathbf{x} - \tilde{\mathbf{P}}\mathbf{x} \perp \mathbf{e}_j$ for every element +$\mathbf{e}_j$ of the orthonormal basis for $S$. + +It follows (by linearity of the inner product) that -$\mathbf{x} - \tilde{P}\mathbf{x} \perp S$, so the previous result -implies $P = \tilde{P}$. ◻ +$\mathbf{x} - \tilde{\mathbf{P}}\mathbf{x} \perp S$. + +So the previous result +implies $\mathbf{P} = \tilde{\mathbf{P}}$. ◻ +::: + +The fact that $\mathbf{P}$ is a linear operator (and thus a proper projection, as +earlier we showed $\mathbf{P}^2 = \mathbf{P}$) follows readily from this result. + + +### **Matrix Representation of Projection Operators** + +Given a subspace $S \subset \mathbb{R}^n$, the **orthogonal projection** of a vector $\mathbf{x} \in \mathbb{R}^n$ onto $S$ is the unique vector $\mathbf{P}\mathbf{x} \in S$ such that: + +* $\mathbf{x} - \mathbf{P}\mathbf{x} \perp S$ (residual is orthogonal) +* $\mathbf{P}\mathbf{x} \in S$ (lies in the subspace) +* $\|\mathbf{x} - \mathbf{P}\mathbf{x}\|$ is minimized + +This leads us to define the projection operator $\mathbf{P} \in \mathbb{R}^{n \times n}$ as a **linear map** satisfying key properties — two of which are: + +* **idempotence** $(\mathbf{P}^2 = \mathbf{P})$ +* **symmetry** $(\mathbf{P}^\top = \mathbf{P})$ + +Let's now examine *why* they are essential. + +### Idempotence $\mathbf{P}^2 = \mathbf{P}$ is Required + +Idempotence ensures that once you've projected a vector onto the subspace, projecting it again **does nothing**: + +$$ +\mathbf{P}(\mathbf{P}\mathbf{x}) = \mathbf{P}\mathbf{x} +$$ + +### **Why it's required:** + +* Geometrically: The image $\mathbf{P}\mathbf{x}$ lies in the subspace. If projecting it again changed it, that would mean the subspace is not invariant under the projection — contradicting the notion of projection. +* Algebraically: If $\mathbf{P}^2 \neq \mathbf{P}$, then $\mathbf{P}$ is not consistent — it cannot define a *fixed* mapping to the subspace. + + +## Why Symmetry $\mathbf{P}^\top = \mathbf{P}$ is Required + +Symmetry ensures that the projection is **orthogonal**: the difference between $\mathbf{x}$ and its projection is orthogonal to the subspace: + +$$ +\langle \mathbf{x} - \mathbf{P}\mathbf{x}, \mathbf{P}\mathbf{x} \rangle = 0 +\quad \Leftrightarrow \quad +\mathbf{P}^\top = \mathbf{P} +$$ + +### **Why it's required:** + +* Without symmetry, $\mathbf{P}$ could project onto the subspace in a skewed or oblique manner — not orthogonally. +* Orthogonal projections are characterized by **minimal distance**, and this only occurs when the residual is orthogonal to the subspace. +* If $\mathbf{P} \neq \mathbf{P}^\top$, the projection may preserve direction, but **not minimize distance**. + +### **Geometric Consequence**: + +A non-symmetric idempotent matrix defines an **oblique projection**, which is still a projection but not orthogonal. It does not minimize distance to the subspace. + + +### Summary Table + +| Property | Meaning | Why Required | +| ------------ | ------------------------ | ---------------------------------------------------- | +| $\mathbf{P}^2 = \mathbf{P}$ | Idempotence / Stability | Ensures projecting twice gives same result | +| $\mathbf{P}^\top = \mathbf{P}$ | Symmetry / Orthogonality | Ensures projection is shortest-distance (orthogonal) | + +--- + + +### Basis Representation of Orthogonal Projection Matrices +Orthogonal projections can be expressed using matrices when the subspace is defined by a basis: + +If $S = \operatorname{span}(\mathbf{e}_1, \dots, \mathbf{e}_m)$, where the $\mathbf{e}_i$ are **orthonormal**, then the projection matrix is: + +$$ +\mathbf{P} = \sum_{i=1}^m \mathbf{e}_i \mathbf{e}_i^\top +$$ + +In matrix form, if $ E \in \mathbb{R}^{n \times m} $ has columns $\mathbf{e}_i$, then + +$$ +\mathbf{P} = EE^\top \quad \text{and} \quad \mathbf{P}\mathbf{x} = EE^\top \mathbf{x} +$$ + +:::{prf:theorem} Basis Representation of the Orthogonal Projection Matrix +:label: thm-orthogonal-projection-matrix +:nonumber: + +Let $\mathbf{e}_1, \dots, \mathbf{e}_m \in \mathbb{R}^n$ be orthonormal vectors, and define the matrix: + +$$ +E = [\mathbf{e}_1 \,\, \mathbf{e}_2 \,\, \cdots \,\, \mathbf{e}_m] \in \mathbb{R}^{n \times m} +$$ + +Then the matrix: + +$$ +\mathbf{P} = EE^\top \in \mathbb{R}^{n \times n} +$$ + +is the **orthogonal projection** onto the subspace $S = \operatorname{Col}(E) = \operatorname{span}(\mathbf{e}_1, \dots, \mathbf{e}_m)$. + +That is, for any $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{P}\mathbf{x} \in S$, and $\mathbf{x} - \mathbf{P}\mathbf{x} \perp S$. ::: -The fact that $P$ is a linear operator (and thus a proper projection, as -earlier we showed $P^2 = P$) follows readily from this result. +:::{prf:proof} + +Let’s verify the three key properties of orthogonal projections. + +--- + +### 1. **$\mathbf{P}$ is symmetric:** + +$$ +\mathbf{P}^\top = (EE^\top)^\top = EE^\top = \mathbf{P} \quad \text{✓} +$$ + +--- + +### 2. **$\mathbf{P}$ is idempotent:** + +$$ +\mathbf{P}^2 = (EE^\top)(EE^\top) = E(E^\top E)E^\top +$$ + +But since $\{\mathbf{e}_i\}$ are orthonormal, we have: + +$$ +E^\top E = I_m \Rightarrow \mathbf{P}^2 = E I E^\top = EE^\top = \mathbf{P} \quad \text{✓} +$$ + +--- + +### 3. **$\mathbf{P}\mathbf{x} \in S$ and $\mathbf{x} - \mathbf{P}\mathbf{x} \perp S$:** + +Let $\mathbf{x} \in \mathbb{R}^n$. Then: + +$$ +\mathbf{P}\mathbf{x} = EE^\top \mathbf{x} \in \operatorname{Col}(E) = S +$$ + +Let $\mathbf{v} \in S$, so $\mathbf{v} = E\mathbf{a}$ for some $\mathbf{a} \in \mathbb{R}^m$. Then: + +$$ +\langle \mathbf{x} - \mathbf{P}\mathbf{x}, \mathbf{v} \rangle += \langle \mathbf{x} - EE^\top \mathbf{x}, E\mathbf{a} \rangle += \langle \mathbf{x}, E\mathbf{a} \rangle - \langle EE^\top \mathbf{x}, E\mathbf{a} \rangle +$$ + +Use $\langle \mathbf{x}, E\mathbf{a} \rangle = \langle E^\top \mathbf{x}, \mathbf{a} \rangle$, and similarly for the second term: + +$$ += \langle E^\top \mathbf{x}, \mathbf{a} \rangle - \langle E^\top EE^\top \mathbf{x}, \mathbf{a} \rangle += \langle E^\top \mathbf{x}, \mathbf{a} \rangle - \langle E^\top \mathbf{x}, \mathbf{a} \rangle = 0 +$$ + +So: + +$$ +\mathbf{x} - \mathbf{P}\mathbf{x} \perp S \quad \text{✓} +$$ + +We conclude that $\mathbf{P} = EE^\top = \sum_{i=1}^m \mathbf{e}_i \mathbf{e}_i^\top$ is indeed the orthogonal projection onto the subspace spanned by $\{\mathbf{e}_1, \dots, \mathbf{e}_m\}$. + +::: + + +### 🎓 **Application Example: Least Squares Regression** +In least squares regression, we want to find the best-fitting line (or hyperplane) through a set of points. + +This can be framed as an orthogonal projection problem: + +Given a design matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ and target vector $\mathbf{y} \in \mathbb{R}^n$, the goal is to find coefficients $\boldsymbol{\beta} \in \mathbb{R}^d$ such that: +$$ +\hat{\boldsymbol{\beta}} = \operatorname{argmin}_{\boldsymbol{\beta}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 +$$ + +This is equivalent to projecting $\mathbf{y}$ onto the column space of $\mathbf{X}$, which can be expressed using the projection matrix: + +$$ +\mathbf{P} = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{X}\mathbf{X}^+ +$$ + +This projection minimizes the distance between $\mathbf{y}$ and the subspace spanned by the columns of $\mathbf{X}$, yielding the least squares solution. diff --git a/book/chapter_decompositions/pseudoinverse.md b/book/chapter_decompositions/pseudoinverse.md index fb20d77..17bde17 100644 --- a/book/chapter_decompositions/pseudoinverse.md +++ b/book/chapter_decompositions/pseudoinverse.md @@ -11,27 +11,207 @@ kernelspec: name: python3 --- # Moore-Penrose Pseudoinverse -The Moore-Penrose pseudoinverse is a generalization of the matrix inverse that can be applied to non-square or singular matrices. It is denoted as $ \mathbf{A}^+ $ for a matrix $ \mathbf{A} $. The pseudoinverse satisfies the following properties: -1. **Existence**: The pseudoinverse exists for any matrix $ \mathbf{A} $. -2. **Uniqueness**: The pseudoinverse is unique. -3. **Properties**: - - $ \mathbf{A} \mathbf{A}^+ \mathbf{A} = \mathbf{A} $ - - $ \mathbf{A}^+ \mathbf{A} \mathbf{A}^+ = \mathbf{A}^+ $ - - $ (\mathbf{A} \mathbf{A}^+)^\top = \mathbf{A} \mathbf{A}^+ $ - - $ (\mathbf{A}^+ \mathbf{A})^\top = \mathbf{A}^+ \mathbf{A} $ -4. **Rank**: The rank of $ \mathbf{A}^+ $ is equal to the rank of $ \mathbf{A} $. -5. **Singular Value Decomposition (SVD)**: The pseudoinverse can be computed using the singular value decomposition of $ \mathbf{A} $. If $ \mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top $, where $ \mathbf{U} $ and $ \mathbf{V} $ are orthogonal matrices and $ \boldsymbol{\Sigma} $ is a diagonal matrix with singular values, then: - - $$ - \mathbf{A}^+ = \mathbf{V} \boldsymbol{\Sigma}^+ \mathbf{U}^\top - $$ - where $ \boldsymbol{\Sigma}^+ $ is obtained by taking the reciprocal of the non-zero singular values in $ \boldsymbol{\Sigma} $ and transposing the resulting matrix. -6. **Applications**: The pseudoinverse is used in various applications, including solving linear systems, least squares problems, and in machine learning algorithms such as linear regression. -7. **Least Squares Solution**: The pseudoinverse provides a least squares solution to the equation $ \mathbf{A}\mathbf{x} = \mathbf{b} $ when $ \mathbf{A} $ is not square or has no unique solution. The least squares solution is given by: +The Moore-Penrose pseudoinverse is a generalization of the matrix inverse that can be applied to non-square or singular matrices. It is denoted as $ \mathbf{A}^+ $ for a matrix $ \mathbf{A} $. + +The pseudoinverse satisfies the following defining properties: +- $ \mathbf{A} \mathbf{A}^+ \mathbf{A} = \mathbf{A} $ +- $ \mathbf{A}^+ \mathbf{A} \mathbf{A}^+ = \mathbf{A}^+ $ + +From these properties, we can derive the following additional properties: +- $ (\mathbf{A} \mathbf{A}^+)^\top = \mathbf{A} \mathbf{A}^+ $ +- $ (\mathbf{A}^+ \mathbf{A})^\top = \mathbf{A}^+ \mathbf{A} $ +- **Existence**: The pseudoinverse exists for any matrix $ \mathbf{A} $. +- **Uniqueness**: The pseudoinverse is unique. +- **Rank**: The rank of $ \mathbf{A}^+ $ is equal to the rank of $ \mathbf{A} $. + +**Least Squares Solution**: The pseudoinverse provides a least squares solution to the equation $ \mathbf{A}\mathbf{x} = \mathbf{b} $ when $ \mathbf{A} $ is not square or has no unique solution. The least squares solution is given by: $$ \mathbf{x} = \mathbf{A}^+ \mathbf{b} $$ -8. **Geometric Interpretation**: The pseudoinverse can be interpreted geometrically as the projection of a vector onto the column space of $ \mathbf{A} $. -9. **Computational Considerations**: The computation of the pseudoinverse can be done efficiently using numerical methods, such as the SVD, especially for large matrices. -10. **Limitations**: The pseudoinverse may not be suitable for all applications, especially when the matrix is ill-conditioned or has a high condition number. +**Geometric Interpretation**: The pseudoinverse can be interpreted geometrically as the projection of a vector onto the column space of $ \mathbf{A} $. +**Computational Considerations**: The computation of the pseudoinverse can be done efficiently using numerical methods, such as the SVD, especially for large matrices. +**Limitations**: The pseudoinverse may not be suitable for all applications, especially when the matrix is ill-conditioned or has a high condition number. + + +## The Pseudoinverse in Linear Regression + +In linear regression, we often encounter the problem of finding the best-fitting line (or hyperplane) through a set of data points. The Moore-Penrose pseudoinverse provides a tool for solving this problem, especially when the design matrix is not square or is singular. + +### 1. **Ordinary Least Squares (OLS) Problem** + +Given data matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$, and target $\mathbf{y} \in \mathbb{R}^{n}$, the OLS problem is: + +**Objective**: Minimize the squared error between predictions and targets: + +$$ +\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} |\mathbf{X}\boldsymbol{\beta} - \mathbf{y}|^2 +$$ + +This is a quadratic problem with a closed-form solution if $ \mathbf{X}^\top \mathbf{X} $ is invertible: + +**OLS solution**: + +$$ +\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} +$$ + +--- + +### 2. **Observe: This Has the Structure of a Pseudoinverse** + +We now define: + +$$ +\mathbf{X}^+ := (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top +$$ + +We claim that $ \mathbf{X}^+ $ is the **Moore–Penrose pseudoinverse of $ \mathbf{X} $**, and it satisfies the four defining properties — **if $ \mathbf{X} $ has full column rank**: + +1. $ \mathbf{X}\mathbf{X}^+\mathbf{X} = \mathbf{X} $ +2. $ \mathbf{X}^+\mathbf{X}\mathbf{X}^+ = \mathbf{X}^+ $ +3. $ (\mathbf{X}\mathbf{X}^+)^\top = \mathbf{X}\mathbf{X}^+ $ +4. $ (\mathbf{X}^+\mathbf{X})^\top = \mathbf{X}^+\mathbf{X} $ + +--- + +### 3. **State the General Case: Unique Pseudoinverse Always Exists** + +Regardless of whether $ \mathbf{X} $ has full column rank or not, there is a **unique** matrix $ \mathbf{X}^+ \in \mathbb{R}^{d \times n} $ satisfying all four Moore–Penrose conditions. + +This is the **Moore–Penrose pseudoinverse**, and: + +$$ +\hat{\boldsymbol{\beta}} = \mathbf{X}^+ \mathbf{y} +$$ + +still gives the **minimum-norm least squares solution**, even if $\mathbf{X}$ is not full rank. + +--- + +### 4. **Numerical Example Using NumPy’s `pinv`** + +```{code-cell} ipython3 +import numpy as np +import matplotlib.pyplot as plt + +# Simulate linear data with collinearity +np.random.seed(1) +n, d = 100, 5 +X = np.random.randn(n, d) +X[:, 3] = X[:, 1] + 0.01 * np.random.randn(n) # make column 3 nearly linearly dependent on column 1 +beta_true = np.array([2.0, -1.0, 0.0, 0.5, 3.0]) +y = X @ beta_true + np.random.randn(n) * 0.5 + +# Compute OLS via pseudoinverse +X_pinv = np.linalg.pinv(X) # Moore–Penrose pseudoinverse +beta_hat = X_pinv @ y + +# Compare with np.linalg.lstsq (which uses SVD internally) +beta_lstsq, *_ = np.linalg.lstsq(X, y, rcond=None) + +print("True coefficients: ", beta_true) +print("Estimated (pinv): ", beta_hat) +print("Estimated (lstsq): ", beta_lstsq) + +# Prediction +y_hat = X @ beta_hat + +# Visualization +plt.scatter(y, y_hat, alpha=0.6) +plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', label='Ideal') +plt.xlabel('True y') +plt.ylabel(r'Predicted $\hat{y}$') +plt.title('Linear Regression via Pseudoinverse') +plt.legend() +plt.grid(True) +plt.show() +``` +--- + +The **OLS formula** is a special case of the **pseudoinverse**, valid under full column rank. +The **pseudoinverse is unique** and always provides a **least-squares solution**. + +Note that `numpy.linalg.pinv` computes the **Moore–Penrose pseudoinverse** using the **Singular Value Decomposition (SVD)**. This method is both **general** and **numerically stable**, making it well-suited for pseudoinverse computation even when the matrix is not full rank. + +## How `np.linalg.pinv` Works Internally + +Suppose you have a matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$. + +The pseudoinverse $\mathbf{X}^+ \in \mathbb{R}^{d \times n}$ is computed as follows: + +### **Step 1: Perform reduced SVD** + +```python +U, S, Vt = np.linalg.svd(X, full_matrices=False) +``` + +This gives: + +$$ +\mathbf{X} = \mathbf{U}_r \boldsymbol{\Sigma}_r \mathbf{V}_r^\top +$$ + +Where: + +* $\mathbf{U}_r \in \mathbb{R}^{n \times r}$, with orthonormal columns +* $\boldsymbol{\Sigma}_r \in \mathbb{R}^{r \times r}$, diagonal matrix with singular values $\sigma_1, \dots, \sigma_r$ +* $\mathbf{V}_r \in \mathbb{R}^{d \times r}$, with orthonormal columns +* $r = \text{rank}(\mathbf{X})$ + +--- + +### **Step 2: Invert the Non-Zero Singular Values** + +You construct the diagonal matrix $\boldsymbol{\Sigma}_r^+ \in \mathbb{R}^{r \times r}$ as: + +$$ +\Sigma^+_{ii} = \begin{cases} +1/\sigma_i & \text{if } \sigma_i > \text{rcond} \cdot \sigma_{\max} \\ +0 & \text{otherwise} +\end{cases} +$$ + +This step **thresholds small singular values** using the `rcond` parameter (default: machine epsilon). + +--- + +### **Step 3: Recompose the Pseudoinverse** + +The pseudoinverse is then: + +$$ +\mathbf{X}^+ = \mathbf{V}_r \boldsymbol{\Sigma}_r^+ \mathbf{U}_r^\top +$$ + +In code: + +```{code-cell} ipython3 +def pinv_manual(X, rcond=1e-15): + U, S, Vt = np.linalg.svd(X, full_matrices=False) + S_inv = np.array([1/s if s > rcond * S[0] else 0 for s in S]) + return Vt.T @ np.diag(S_inv) @ U.T +``` + +--- + +### ✅ Advantages of SVD-Based Pseudoinverse + +* **Numerically stable**: even if $\mathbf{X}^\top \mathbf{X}$ is ill-conditioned +* **General**: works for rank-deficient or rectangular matrices +* **Gives minimum-norm solution** to $\mathbf{X}\boldsymbol{\beta} = \mathbf{y}$ + +--- + +### 🧪 Check with NumPy + +You can verify this approach: + +```{code-cell} ipython3 +X = np.random.randn(5, 3) +X_pinv_np = np.linalg.pinv(X) +X_pinv_manual = pinv_manual(X) + +np.allclose(X_pinv_np, X_pinv_manual) # Should be True +``` + diff --git a/book/chapter_decompositions/svd.md b/book/chapter_decompositions/svd.md index adaa756..424ffa9 100644 --- a/book/chapter_decompositions/svd.md +++ b/book/chapter_decompositions/svd.md @@ -2,10 +2,11 @@ Singular value decomposition (SVD) is a widely applicable tool in linear algebra. Its strength stems partially from the fact that *every matrix* -$\mathbf{A} \in \mathbb{R}^{m \times n}$ has an SVD (even non-square +$\mathbf{A}$ has an SVD (even non-square matrices)! -The decomposition goes as follows: + +The decomposition of $\mathbf{A}\in \mathbb{R}^{m \times n}$ goes as follows: $$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!}$$ @@ -52,15 +53,22 @@ In the following, we present a number of important identities for the SVD. ### Matrix-vector product as linear combination of matrix columns -*Proposition.* +:::{prf:proposition} Matrix-vector product as linear combination of columns +:label: prop-matrix-vector-product +:nonumber: + Let $\mathbf{x} \in \mathbb{R}^n$ be a vector and $\mathbf{A} \in \mathbb{R}^{m \times n}$ a matrix with columns -$\mathbf{a}_1, \dots, \mathbf{a}_n$. Then +$\mathbf{a}_1, \dots, \mathbf{a}_n$. + +Then $$\mathbf{A}\mathbf{x} = \sum_{i=1}^n x_i\mathbf{a}_i$$ +::: This identity is extremely useful in understanding linear operators in -terms of their matrices' columns. The proof is very simple (consider +terms of their matrices' columns. +The proof is very simple (consider each element of $\mathbf{A}\mathbf{x}$ individually and expand by definitions) but it is a good exercise to convince yourself. @@ -104,3 +112,16 @@ equivalently the $j$th column of $\mathbf{B}^{\!\top\!}$. Hence by the definition of matrix multiplication, it is equal to $[\mathbf{A}\mathbf{B}^{\!\top\!}]_{ij}$. ◻ ::: + +## Reduced SVD +The SVD we have presented is the **full SVD**. +However, in many +applications, we are only interested in the **reduced SVD**. This is +the SVD where we only keep the first $r$ columns of $\mathbf{U}$ and +the first $r$ columns of $\mathbf{V}$, where $r$ is the rank of +$\mathbf{A}$. The reduced SVD is given by: + +$$\mathbf{A} = \mathbf{U}_r\mathbf{\Sigma}_r\mathbf{V}_r^{\!\top\!}$$ + +where $\mathbf{U}_r \in \mathbb{R}^{m \times r}$, $\mathbf{\Sigma}_r \in \mathbb{R}^{r \times r}$, and $\mathbf{V}_r \in \mathbb{R}^{n \times r}$. + From 413fd3330c059362c627ca74a4a9560796de1cb5 Mon Sep 17 00:00:00 2001 From: clippert Date: Thu, 29 May 2025 00:10:46 +0200 Subject: [PATCH 40/43] bugfix --- book/chapter_decompositions/matrix_norms.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/book/chapter_decompositions/matrix_norms.md b/book/chapter_decompositions/matrix_norms.md index e2680c0..9a354f9 100644 --- a/book/chapter_decompositions/matrix_norms.md +++ b/book/chapter_decompositions/matrix_norms.md @@ -344,8 +344,6 @@ QED. - In **spectral methods**, matrix norms bound approximation error (e.g., spectral norm bounds for generalization). -Certainly! Here's a concise and precise introduction paragraph for your textbook or lecture notes: - --- ## Collaborative Filtering and Matrix Factorization From cd702a38dc8eef8d06b8d55db3011b15880e3c85 Mon Sep 17 00:00:00 2001 From: clippert Date: Fri, 30 May 2025 12:51:08 +0200 Subject: [PATCH 41/43] mm --- .../RBF_kernel_Positive_Definite.md | 436 ------------------ book/chapter_decompositions/big_picture.md | 2 +- .../orthogonal_projections.md | 6 +- 3 files changed, 4 insertions(+), 440 deletions(-) delete mode 100644 book/chapter_decompositions/RBF_kernel_Positive_Definite.md diff --git a/book/chapter_decompositions/RBF_kernel_Positive_Definite.md b/book/chapter_decompositions/RBF_kernel_Positive_Definite.md deleted file mode 100644 index 6b5e410..0000000 --- a/book/chapter_decompositions/RBF_kernel_Positive_Definite.md +++ /dev/null @@ -1,436 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.16.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- -# RBF Kernel Positive Definite - -In this chapter, we will state and prove Mercer's theorem, showing that a set of kernel's, so called Mercer kernels exist that represent infinite dimensional reproducing kernel Hilbert spaces. Mercer kernels always produce positive definite kernel matrices. - -## Mercer's Theorem - -Mercer’s Theorem is a cornerstone in understanding **positive-definite kernels** and their representation in **reproducing kernel Hilbert spaces (RKHS)** — foundational for kernel methods like SVMs and kernel PCA. - -Below is a careful statement and proof outline of **Mercer’s Theorem**, suitable for a course that has covered eigenvalues, symmetric matrices, and function spaces. - ---- - -## 📜 Mercer’s Theorem (Simplified Version) - -Let $k : \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ be a **symmetric, continuous, positive semi-definite kernel** function on a **compact domain** $\mathcal{X} \subset \mathbb{R}^d$. - -> Then there exists an **orthonormal basis** $\{\phi_i\}_{i=1}^\infty$ of $L^2(\mathcal{X})$, and **non-negative eigenvalues** $\{\lambda_i\}_{i=1}^\infty$, such that: -> -> $$ -> k(x, x') = \sum_{i=1}^{\infty} \lambda_i \phi_i(x) \phi_i(x') \quad \text{with convergence in } L^2(\mathcal{X} \times \mathcal{X}) -> $$ -> -> Furthermore, the integral operator: -> -> $$ -> (Tf)(x) := \int_{\mathcal{X}} k(x, x') f(x') dx' -> $$ -> -> is **compact, self-adjoint**, and **positive semi-definite** on $L^2(\mathcal{X})$. - ---- - -## 🧠 Intuition - -Mercer’s Theorem says: - -* A symmetric, continuous, PSD kernel defines a **nice integral operator** on functions. -* That operator has a **spectral decomposition**, just like symmetric matrices do. -* The kernel function $k(x, x')$ can be written as a **sum over eigenfunctions** of this operator, just like how a Gram matrix can be decomposed as $K = \sum \lambda_i u_i u_i^\top$. - -This justifies using **feature maps** $\phi_i(x) = \sqrt{\lambda_i} \psi_i(x)$ and writing: - -$$ -k(x, x') = \langle \phi(x), \phi(x') \rangle_{\ell^2} -$$ - ---- - -## ✍️ Sketch of the Proof - -### Step 1: Define the Integral Operator - -Given a kernel $k(x, x')$, define an operator $T$ on $L^2(\mathcal{X})$ by: - -$$ -(Tf)(x) = \int_{\mathcal{X}} k(x, x') f(x') dx' -$$ - -* $T$ is **linear** -* $T$ is **self-adjoint** since $k(x, x') = k(x', x)$ -* $T$ is **compact**, due to continuity of $k$ on a compact domain - -### Step 2: Apply the Spectral Theorem for Compact Self-Adjoint Operators - -From functional analysis: - -* $T$ has an orthonormal basis of eigenfunctions $\{\phi_i\}_{i=1}^\infty$ -* Corresponding eigenvalues $\lambda_i \geq 0$ (since $T$ is PSD) - -### Step 3: Represent the Kernel - -Show that: - -$$ -k(x, x') = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(x') -$$ - -This expansion converges **absolutely and uniformly** on $\mathcal{X} \times \mathcal{X}$ if $k$ is continuous. - -### Step 4: Show PSD and Feature Map Representation - -From the expansion, define the map: - -$$ -\phi(x) := \left( \sqrt{\lambda_1} \phi_1(x), \sqrt{\lambda_2} \phi_2(x), \dots \right) -$$ - -Then: - -$$ -k(x, x') = \langle \phi(x), \phi(x') \rangle_{\ell^2} -$$ - -So the kernel is **an inner product in an infinite-dimensional Hilbert space** — justifying its use in kernel methods. - ---- - -## ✅ Summary Box - - -**Mercer's Theorem (simplified)** - -Let $ k(x, x') $ be a continuous, symmetric, positive semi-definite kernel on a compact domain $ \mathcal{X} \subset \mathbb{R}^d $. - -Then there exist orthonormal functions $ \phi_i \in L^2(\mathcal{X}) $ and eigenvalues $ \lambda_i \geq 0 $ such that: - -$$ -k(x, x') = \sum_{i=1}^{\infty} \lambda_i \phi_i(x) \phi_i(x') -$$ -with convergence in $ L^2(\mathcal{X} \times \mathcal{X}) $. - -Moreover, $ k $ defines a compact, self-adjoint, PSD operator on $ L^2(\mathcal{X}) $. - - - - -## 🧠 Setup: What is the RBF kernel? - -Let $\mathbf{x}_1, \dots, \mathbf{x}_n \in \mathbb{R}^d$ be a set of data points. - -The **RBF kernel** (also called Gaussian kernel) is defined as: - -$$ -k(\mathbf{x}, \mathbf{x}') = \exp\left(-\gamma \|\mathbf{x} - \mathbf{x}'\|^2\right) -\quad \text{for } \gamma > 0 -$$ - -The **RBF kernel matrix** $\mathbf{K} \in \mathbb{R}^{n \times n}$ has entries: - -$$ -\mathbf{K}_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) -$$ - ---- - -## ✅ Claim - -> The RBF kernel matrix $\mathbf{K}$ is **positive semi-definite** for all $\gamma > 0$, i.e., for any $\mathbf{c} \in \mathbb{R}^n$, - -$$ -\mathbf{c}^\top \mathbf{K} \mathbf{c} \geq 0 -$$ - -Moreover, if all $\mathbf{x}_i$ are distinct, then $\mathbf{K}$ is **positive definite**. - ---- - -## ✍️ Proof (via Mercer's Theorem / Fourier Representation) - -The RBF kernel is a special case of a **positive-definite kernel** as characterized by Mercer's theorem, but here’s a more constructive argument: - -### Step 1: Express the kernel as an inner product in an infinite-dimensional feature space. - -Let’s define the feature map $\phi: \mathbb{R}^d \to \ell^2$ via: - -$$ -\phi(\mathbf{x}) = \left( \sqrt{a_k} \, \psi_k(\mathbf{x}) \right)_{k=1}^{\infty} -$$ - -such that: - -$$ -k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle -$$ - -It is known (e.g., via Taylor expansion or Fourier basis) that the RBF kernel corresponds to an **inner product in an infinite-dimensional Hilbert space**, and hence: - -$$ -k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle -\Rightarrow -\mathbf{K}_{ij} = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle -$$ - -Then, for any $\mathbf{c} \in \mathbb{R}^n$: - -$$ -\mathbf{c}^\top \mathbf{K} \mathbf{c} -= \sum_{i,j=1}^n c_i c_j \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle -= \left\| \sum_{i=1}^n c_i \phi(\mathbf{x}_i) \right\|^2 \geq 0 -$$ - -✅ Hence, $\mathbf{K}$ is **positive semi-definite**. - ---- - -### 🚀 Positive definiteness - -If the $\mathbf{x}_i$ are **pairwise distinct**, then the feature vectors $\phi(\mathbf{x}_i)$ are **linearly independent** in the Hilbert space, and the only way for the sum to vanish is $\mathbf{c} = 0$. Hence: - -$$ -\mathbf{c}^\top \mathbf{K} \mathbf{c} > 0 \quad \text{for all } \mathbf{c} \ne 0 -$$ - -✅ So $\mathbf{K}$ is **positive definite** if all data points are distinct. - ---- - -### 📦 Summary - - -**Proposition**: The RBF kernel matrix $ \mathbf{K}_{ij} = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2) $ is positive semi-definite for all $\gamma > 0$, and positive definite if all $\mathbf{x}_i$ are distinct. - -**Proof sketch**: The kernel function is an inner product in a Hilbert space, so the Gram matrix $ \mathbf{K} $ has the form $ \mathbf{K} = \Phi \Phi^\top $, which is always PSD. - -## **proof by induction** that the **RBF kernel matrix is positive semi-definite**, based on verifying the PSD property for matrices of increasing size. This approach is constructive, concrete, and aligns well with students familiar with induction and Gram matrices. - - ---- - -## 🧩 Goal - -Let $K \in \mathbb{R}^{n \times n}$, with entries: - -$$ -K_{ij} = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2) -$$ - -We aim to prove: - -> For any $n \in \mathbb{N}$, $K$ is **positive semi-definite**, i.e., for all $\mathbf{c} \in \mathbb{R}^n$: - -$$ -\mathbf{c}^\top K \mathbf{c} \geq 0 -$$ - ---- - -## 🧠 Strategy: Induction on Matrix Size $n$ - -Let’s prove it by **induction on $n$**, the number of input points $\mathbf{x}_1, \dots, \mathbf{x}_n \in \mathbb{R}^d$. - ---- - -### 🧱 Base Case $n = 1$ - -We have: - -$$ -K = [1] \quad \text{since } \|\mathbf{x}_1 - \mathbf{x}_1\|^2 = 0 \Rightarrow K_{11} = \exp(0) = 1 -$$ - -Then for any $c \in \mathbb{R}$: - -$$ -c^\top K c = c^2 \cdot 1 = c^2 \geq 0 -$$ - -✅ Base case holds. - ---- - -### 🔁 Inductive Hypothesis - -Assume that for some $n$, the kernel matrix $K_n \in \mathbb{R}^{n \times n}$ formed from $\mathbf{x}_1, \dots, \mathbf{x}_n$ is **positive semi-definite**. - ---- - -### 🔄 Inductive Step: $n+1$ - -We add a new point $\mathbf{x}_{n+1}$ and form the $(n+1) \times (n+1)$ matrix $K_{n+1}$: - -$$ -K_{n+1} = -\begin{bmatrix} -K_n & \mathbf{k} \\ -\mathbf{k}^\top & 1 -\end{bmatrix} -$$ - -where: - -* $K_n \in \mathbb{R}^{n \times n}$ is the existing RBF matrix (assumed PSD) -* $\mathbf{k} \in \mathbb{R}^n$, with entries $k_i = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_{n+1}\|^2)$ -* The bottom-right entry is $k(\mathbf{x}_{n+1}, \mathbf{x}_{n+1}) = 1$ - -Let $\mathbf{c} \in \mathbb{R}^{n+1}$, split as: - -$$ -\mathbf{c} = \begin{bmatrix} \mathbf{a} \\ b \end{bmatrix}, \quad \mathbf{a} \in \mathbb{R}^n, \ b \in \mathbb{R} -$$ - -Then: - -$$ -\mathbf{c}^\top K_{n+1} \mathbf{c} = -\begin{bmatrix} \mathbf{a}^\top & b \end{bmatrix} -\begin{bmatrix} -K_n & \mathbf{k} \\ -\mathbf{k}^\top & 1 -\end{bmatrix} -\begin{bmatrix} \mathbf{a} \\ b \end{bmatrix} -= \mathbf{a}^\top K_n \mathbf{a} + 2b \mathbf{k}^\top \mathbf{a} + b^2 -$$ - -Let’s define: - -$$ -f(b) = \mathbf{a}^\top K_n \mathbf{a} + 2b \mathbf{k}^\top \mathbf{a} + b^2 -= \left( b + \mathbf{k}^\top \mathbf{a} \right)^2 + \left( \mathbf{a}^\top K_n \mathbf{a} - (\mathbf{k}^\top \mathbf{a})^2 \right) -$$ - -Note: - -* The first term $\left(b + \mathbf{k}^\top \mathbf{a} \right)^2 \geq 0$ -* By the **Cauchy-Schwarz inequality**, if $K_n$ is a Gram matrix (as is the case here), then: - - $$ - (\mathbf{k}^\top \mathbf{a})^2 \leq \mathbf{a}^\top K_n \mathbf{a} - \Rightarrow \mathbf{a}^\top K_n \mathbf{a} - (\mathbf{k}^\top \mathbf{a})^2 \geq 0 - $$ - -✅ Therefore, $f(b) \geq 0$ for all $\mathbf{a}, b$, i.e., $K_{n+1}$ is PSD. - ---- - -### ✅ Conclusion - -By induction, all RBF kernel matrices $K_n \in \mathbb{R}^{n \times n}$ are **positive semi-definite** for all $n$. - ---- - -### 📦 Summary - -**Theorem**: RBF kernel matrices are positive semi-definite for all n and all γ > 0. - -**Proof**: By induction on the number of data points n, using the structure of the kernel matrix -and properties of quadratic forms and Cauchy-Schwarz inequality. - -## EXpanding the Cauchy-Schwarz step in the proof -### 🎯 The Step in Question - -In the inductive proof of PSD for the RBF kernel matrix, we reached this expression for any vector $\mathbf{c} = \begin{bmatrix} \mathbf{a} \\ b \end{bmatrix} \in \mathbb{R}^{n+1}$: - -$$ -\mathbf{c}^\top K_{n+1} \mathbf{c} = \left( b + \mathbf{k}^\top \mathbf{a} \right)^2 + \left( \mathbf{a}^\top K_n \mathbf{a} - (\mathbf{k}^\top \mathbf{a})^2 \right) -$$ - -We want to argue that: - -$$ -\mathbf{a}^\top K_n \mathbf{a} - (\mathbf{k}^\top \mathbf{a})^2 \geq 0 -$$ - -This is the **Cauchy-Schwarz step** — and here’s what it means. - ---- - -## 🧠 Setting - -* $K_n \in \mathbb{R}^{n \times n}$ is an **RBF kernel matrix**: - - $$ - K_n = \left[ \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2) \right]_{i,j=1}^n - $$ -* The vector $\mathbf{k} \in \mathbb{R}^n$ has entries $k_i = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_{n+1}\|^2)$ - -We assume (by the inductive hypothesis) that $K_n$ is **positive semi-definite**, which means it is a **Gram matrix**: it can be written as - -$$ -K_n = \Phi \Phi^\top -$$ - -for some (possibly infinite-dimensional) feature map $\phi(\mathbf{x})$, where: - -$$ -\Phi = -\begin{bmatrix} -\phi(\mathbf{x}_1)^\top \\ -\vdots \\ -\phi(\mathbf{x}_n)^\top -\end{bmatrix} -\in \mathbb{R}^{n \times d} -$$ - -and $\phi(\mathbf{x}_i) \in \mathbb{R}^d$ or a Hilbert space. - ---- - -## ✅ Step Explained Using Inner Products - -Let’s define: - -* $\mathbf{u} = \sum_{i=1}^n a_i \phi(\mathbf{x}_i)$ -* $\mathbf{v} = \phi(\mathbf{x}_{n+1})$ - -Then: - -* $\mathbf{a}^\top K_n \mathbf{a} = \|\mathbf{u}\|^2$ -* $\mathbf{k}^\top \mathbf{a} = \langle \mathbf{u}, \mathbf{v} \rangle$, since $k_i = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_{n+1}) \rangle$ - -Now apply the **Cauchy–Schwarz inequality** in the inner product space: - -$$ -|\langle \mathbf{u}, \mathbf{v} \rangle|^2 \leq \|\mathbf{u}\|^2 \cdot \|\mathbf{v}\|^2 -$$ - -In our case: - -* $\mathbf{a}^\top K_n \mathbf{a} = \|\mathbf{u}\|^2$ -* $(\mathbf{k}^\top \mathbf{a})^2 = |\langle \mathbf{u}, \mathbf{v} \rangle|^2$ -* $\|\mathbf{v}\|^2 = k(\mathbf{x}_{n+1}, \mathbf{x}_{n+1}) = 1$ - -So: - -$$ -(\mathbf{k}^\top \mathbf{a})^2 \leq \mathbf{a}^\top K_n \mathbf{a} -\quad \Rightarrow \quad -\mathbf{a}^\top K_n \mathbf{a} - (\mathbf{k}^\top \mathbf{a})^2 \geq 0 -$$ - -✅ This guarantees that the second term in our decomposition is **non-negative**, which is what we needed to conclude PSD. - ---- - -### 📌 Summary - -* The RBF kernel matrix $K_n$ is a **Gram matrix**: $K_n = \Phi \Phi^\top$ -* So any quadratic form $\mathbf{a}^\top K_n \mathbf{a}$ is a **squared norm**: $\|\sum a_i \phi(\mathbf{x}_i)\|^2$ -* The dot product with $\phi(\mathbf{x}_{n+1})$ is **bounded** by Cauchy-Schwarz: - - $$ - (\mathbf{k}^\top \mathbf{a})^2 = |\langle \mathbf{u}, \mathbf{v} \rangle|^2 \leq \|\mathbf{u}\|^2 - $$ - diff --git a/book/chapter_decompositions/big_picture.md b/book/chapter_decompositions/big_picture.md index f8b3f7e..fe6bed7 100644 --- a/book/chapter_decompositions/big_picture.md +++ b/book/chapter_decompositions/big_picture.md @@ -104,4 +104,4 @@ The SVD of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ can be used to iden ## Summary The four fundamental subspaces of a matrix $\mathbf{A}$ are essential in understanding the structure of the matrix and its properties. The projections onto these subspaces can be computed using the Moore-Penrose pseudoinverse, which provides a powerful tool for solving linear systems and performing dimensionality reduction. -The SVD further enhances our understanding by revealing the relationships between these subspaces through the orthogonal matrices and singular values. \ No newline at end of file +The SVD further enhances our understanding by revealing the relationships between these subspaces through the orthogonal matrices and singular values. diff --git a/book/chapter_decompositions/orthogonal_projections.md b/book/chapter_decompositions/orthogonal_projections.md index f2ea2ae..3881026 100644 --- a/book/chapter_decompositions/orthogonal_projections.md +++ b/book/chapter_decompositions/orthogonal_projections.md @@ -231,7 +231,7 @@ The fact that $\mathbf{P}$ is a linear operator (and thus a proper projection, a earlier we showed $\mathbf{P}^2 = \mathbf{P}$) follows readily from this result. -### **Matrix Representation of Projection Operators** +## **Matrix Representation of Projection Operators** Given a subspace $S \subset \mathbb{R}^n$, the **orthogonal projection** of a vector $\mathbf{x} \in \mathbb{R}^n$ onto $S$ is the unique vector $\mathbf{P}\mathbf{x} \in S$ such that: @@ -291,7 +291,7 @@ A non-symmetric idempotent matrix defines an **oblique projection**, which is st --- -### Basis Representation of Orthogonal Projection Matrices +## Basis Representation of Orthogonal Projection Matrices Orthogonal projections can be expressed using matrices when the subspace is defined by a basis: If $S = \operatorname{span}(\mathbf{e}_1, \dots, \mathbf{e}_m)$, where the $\mathbf{e}_i$ are **orthonormal**, then the projection matrix is: @@ -389,7 +389,7 @@ We conclude that $\mathbf{P} = EE^\top = \sum_{i=1}^m \mathbf{e}_i \mathbf{e}_i^ ::: -### 🎓 **Application Example: Least Squares Regression** +### **Application Example: Least Squares Regression** In least squares regression, we want to find the best-fitting line (or hyperplane) through a set of points. This can be framed as an orthogonal projection problem: From 161f1a7adb7a4e522b83d0bca072d09c373a50a7 Mon Sep 17 00:00:00 2001 From: Christoph Lippert Date: Fri, 30 May 2025 13:10:02 +0200 Subject: [PATCH 42/43] fixed small bug in svd (unkown target name 5) --- book/chapter_decompositions/svd.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/book/chapter_decompositions/svd.md b/book/chapter_decompositions/svd.md index 424ffa9..f1a4e6c 100644 --- a/book/chapter_decompositions/svd.md +++ b/book/chapter_decompositions/svd.md @@ -39,13 +39,13 @@ $\mathbf{A}^{\!\top\!}\mathbf{A}$, and the columns of $\mathbf{U}$ (the **left-singular vectors** of $\mathbf{A}$) are eigenvectors of $\mathbf{A}\mathbf{A}^{\!\top\!}$. -The matrices $\mathbf{\Sigma}^{\!\top\!}\mathbf{\Sigma}$ and +The matrices $\mathbf{\Sigma}^{\top}\mathbf{\Sigma}$ and $\mathbf{\Sigma}\mathbf{\Sigma}^{\!\top\!}$ are not necessarily the same size, but both are diagonal with the squared singular values $\sigma_i^2$ on the diagonal (plus possibly some zeros). Thus the singular values of $\mathbf{A}$ are the square roots of the eigenvalues of $\mathbf{A}^{\!\top\!}\mathbf{A}$ (or equivalently, of -$\mathbf{A}\mathbf{A}^{\!\top\!}$)[^5]. +$\mathbf{A}\mathbf{A}^{\!\top\!}$). ## Some useful matrix identities From 79b31976164a3dc6af30a12271304ef8cd95fbbb Mon Sep 17 00:00:00 2001 From: Arman Beykmohammadi Date: Fri, 30 May 2025 18:16:24 +0200 Subject: [PATCH 43/43] Exercise 4 solutions added to the book --- book/_toc.yml | 2 + book/appendix/Exercise Sheet 4 Solutions.md | 393 ++++++++++++++++++++ 2 files changed, 395 insertions(+) create mode 100644 book/appendix/Exercise Sheet 4 Solutions.md diff --git a/book/_toc.yml b/book/_toc.yml index 1867fe4..398bf9e 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -172,6 +172,8 @@ parts: title: Exercise Sheet 2 Solutions - file: appendix/Exercise Sheet 3 Solutions.md title: Exercise Sheet 3 Solutions + - file: appendix/Exercise Sheet 4 Solutions.md + title: Exercise Sheet 4 Solutions # sections: # - file: appendix/proof_vector_spaces # title: Vector Spaces diff --git a/book/appendix/Exercise Sheet 4 Solutions.md b/book/appendix/Exercise Sheet 4 Solutions.md new file mode 100644 index 0000000..eb77c26 --- /dev/null +++ b/book/appendix/Exercise Sheet 4 Solutions.md @@ -0,0 +1,393 @@ +# Exercise Sheet 4 Solutions + +### 1. + +\[ +A = \begin{pmatrix} 3 & 2 \\ -2 & -1 \end{pmatrix} +\] + +**Characteristic Polynomial:** + +We compute the characteristic polynomial: + +\[ +p(\lambda) = \det(A - \lambda I) += \det \begin{pmatrix} 3 - \lambda & 2 \\ -2 & -1 - \lambda \end{pmatrix} +\] + +\[ += (3 - \lambda)(-1 - \lambda) - (-2)(2) += \lambda^2 - 2\lambda + 1 = (\lambda - 1)^2 +\] + +**Eigenvalues:** + +Solving the characteristic polynomial: + +\[ +(\lambda - 1)^2 = 0 \Rightarrow \lambda_1 = \lambda_2 = 1 +\] + +**Eigenvectors:** + +We solve: +\[ +(A - \lambda I)\vec{x} = 0 \Rightarrow +\begin{pmatrix} 2 & 2 \\ -2 & -2 \end{pmatrix} \vec{x} = \vec{0} +\] + +This reduces to: +\[ +x_1 + x_2 = 0 \Rightarrow \vec{x} = \alpha \begin{pmatrix} 1 \\ -1 \end{pmatrix} +\] + +So, the eigenvector is any scalar multiple of: + +\[ +\vec{x} = \begin{pmatrix} 1 \\ -1 \end{pmatrix} +\] + + +**Now, we solve them for B:** +\[ +B = \begin{pmatrix} +0 & 1 & 0 \\ +1 & 0 & 1 \\ +1 & 1 & 0 +\end{pmatrix} +\] + +**Characteristic Polynomial:** + +We compute the characteristic polynomial: + +\[ +p(\lambda) = \det(B - \lambda I) = +\det \begin{pmatrix} +-\lambda & 1 & 0 \\ +1 & -\lambda & 1 \\ +1 & 1 & -\lambda +\end{pmatrix} +\] + +Expanding along the first row: + +```math += -\lambda \cdot \det \begin{pmatrix} +-\lambda & 1 \\ +1 & -\lambda +\end{pmatrix} +- 1 \cdot \det \begin{pmatrix} +1 & 1 \\ +1 & -\lambda +\end{pmatrix} ++ 0 +``` + +\[ += -\lambda(\lambda^2 - 1) - (-\lambda - 1) += -\lambda^3 + \lambda + \lambda + 1 += -\lambda^3 + 2\lambda + 1 +\] + +So the characteristic polynomial is: + +\[ +p(\lambda) = -\lambda^3 + 2\lambda + 1 +\] + +**Eigenvalues:** + +Solving the characteristic polynomial: + +\[ +-\lambda^3 + 2\lambda + 1 = 0 +\] + +```math +-\left( \lambda + 1 \right)\left( \lambda - \frac{1 + \sqrt{5}}{2} \right)\left( \lambda - \frac{1 - \sqrt{5}}{2} \right) = 0 +``` + +so: + +\[ +\lambda_1 \approx -0.62, \quad +\lambda_2 = -1, \quad +\lambda_3 \approx 1.62 +\] + +**Eigenvectors:** + +To find the eigenvectors, we solve: + +\[ +(B - \lambda I)\vec{x} = 0 +\] + +**For** \( \lambda_1 \approx -0.62 \): + +\[ +\vec{v}_1 = \begin{pmatrix} +1 \\ +-0.62 \\ +-0.62 +\end{pmatrix} +\] + +**For** \( \lambda_2 = -1 \): + +\[ +\vec{v}_2 = \begin{pmatrix} +-1 \\ +1 \\ +0 +\end{pmatrix} +\] + +**For** \( \lambda_3 \approx 1.62 \): + +\[ +\vec{v}_3 = \begin{pmatrix} +1 \\ +1.62 \\ +1.62 +\end{pmatrix} +\] + +\(\blacksquare\) + +### 2. +We start by expressing the trace of \( ABC \): + +```math +\mathrm{tr}(ABC) = \sum_{i=1}^{n} (ABC)_{ii} +``` + +**Computing \( (ABC)_{ii} \):** + +We want to compute the entry in the \( i \)-th row and \( i \)-th column of the matrix product \( ABC \). + +1. First, compute the product \( AB \), which gives: + +```math +(AB)_{il} = \sum_{j=1}^{m} A_{ij} B_{jl} +``` + +2. Then compute \( ABC = (AB)C \). The \( (i, i) \)-th element of \( ABC \) is: + +```math +(ABC)_{ii} = \sum_{l=1}^{k} (AB)_{il} C_{li} +``` + +3. Substitute the expression for \( (AB)_{il} \): + +```math +(ABC)_{ii} = \sum_{l=1}^{k} \left( \sum_{j=1}^{m} A_{ij} B_{jl} \right) C_{li} +``` + +4. Rearranging the summation order: + +```math +(ABC)_{ii} = \sum_{j=1}^{m} \sum_{l=1}^{k} A_{ij} B_{jl} C_{li} +``` + +So, the trace becomes: + +```math +\mathrm{tr}(ABC) = \sum_{i=1}^{n} \sum_{j=1}^{m} \sum_{k=1}^{k} A_{ij} B_{jk} C_{ki} +``` + + +**Computing the trace of \( BCA \):** + +```math +\mathrm{tr}(BCA) = \sum_{j=1}^{m} (BCA)_{jj} +``` + +Expand the product: + +1. First compute \( BC \): + +```math +(BC)_{ji} = \sum_{k=1}^{k} B_{jk} C_{ki} +``` + +2. Then compute: + +```math +(BCA)_{jj} = \sum_{i=1}^{n} (BC)_{ji} A_{ij} = \sum_{i=1}^{n} \sum_{k=1}^{k} B_{jk} C_{ki} A_{ij} +``` + +3. So the trace is: + +```math +\mathrm{tr}(BCA) = \sum_{j=1}^{m} \sum_{k=1}^{k} \sum_{i=1}^{n} B_{jk} C_{ki} A_{ij} +``` + + +**Final Step:** + +Since scalar multiplication is commutative: + +```math +\sum_{i=1}^{n} \sum_{j=1}^{m} \sum_{k=1}^{k} A_{ij} B_{jk} C_{ki} += \sum_{j=1}^{m} \sum_{k=1}^{k} \sum_{i=1}^{n} B_{jk} C_{ki} A_{ij} +``` + +Therefore: + +```math +\mathrm{tr}(ABC) = \mathrm{tr}(BCA) +``` + +\(\blacksquare\) + +### 3. + +**Eigenvalues are the same** + +Let \( p(\lambda) \) be the characteristic polynomial of \( A \). By definition: + +```math +p(\lambda) = \det(A - \lambda I) +``` + +Now consider the transpose: + +```math +\det(A^\top - \lambda I) += \det((A - \lambda I)^\top) += \det(A - \lambda I) += p(\lambda) +``` + +So both \( A \) and \( A^\top \) have the **same characteristic polynomial**, which means they have the **same eigenvalues**, including their algebraic multiplicities. + + +**Eigenvectors may differ** +Although \( A \) and \( A^\top \) have the same eigenvalues, their eigenvectors **need not** be the same. + +To see this, consider a concrete example: + +Let + +```math +A = \begin{pmatrix} +1 & 1 \\ +0 & 1 +\end{pmatrix} +``` + +Then: + +```math +A^\top = \begin{pmatrix} +1 & 0 \\ +1 & 1 +\end{pmatrix} +``` + +The characteristic polynomial of both is: + +```math +\det(A - \lambda I) = (1 - \lambda)^2 +``` + +So they both have a **repeated eigenvalue** \( \lambda = 1 \). + +Now compute eigenvectors: + +- For \( A \), we solve: + +```math +(A - I)\vec{x} = \begin{pmatrix} +0 & 1 \\ +0 & 0 +\end{pmatrix} \vec{x} = 0 +\Rightarrow x_2 = 0 \Rightarrow \vec{x} = \begin{pmatrix} +1 \\ +0 +\end{pmatrix} +``` + +- For \( A^\top \), we solve: + +```math +(A^\top - I)\vec{x} = \begin{pmatrix} +0 & 0 \\ +1 & 0 +\end{pmatrix} \vec{x} = 0 +\Rightarrow x_1 = 0 \Rightarrow \vec{x} = \begin{pmatrix} +0 \\ +1 +\end{pmatrix} +``` + +Thus, the **eigenvectors are different**, even though the eigenvalues are the same. + +**So:** + +- \( A \) and \( A^\top \) always have the **same eigenvalues**, including multiplicities. +- However, they may have **different eigenvectors**, especially when the matrix is **not symmetric**. + +\(\blacksquare\) + +### 4. +#### (a) The rank of \( B \) + +We observe that each row (and column) of \( B \) is a linear combination of the others: + +- The second row is: + ```math + \text{Row}_2 = 2 \cdot \text{Row}_1 + ``` +- The third row is: + ```math + \text{Row}_3 = 3 \cdot \text{Row}_1 + ``` + +So all rows lie in the span of the first row. + +Let’s do row reduction to confirm: + +```math +\begin{pmatrix} +1 & 2 & 3 \\ +2 & 4 & 6 \\ +3 & 6 & 9 +\end{pmatrix} +\rightarrow +\begin{pmatrix} +1 & 2 & 3 \\ +0 & 0 & 0 \\ +0 & 0 & 0 +\end{pmatrix} +``` + +Only one non-zero row remains after Gaussian elimination. + +Therefore, the rank of \( B \) is: + +```math +\mathrm{rank}(B) = 1 +``` + +#### (b) Are the columns of \( B \) linearly independent? + +Recall that the number of linearly independent columns is equal to the rank of the matrix. + +Since: + +```math +\mathrm{rank}(B) = 1 < 3 +``` + +This means that the columns are **linearly dependent**. + +In fact: + +```math +\text{Col}_2 = 2 \cdot \text{Col}_1 \\ +\text{Col}_3 = 3 \cdot \text{Col}_1 +``` +\(\blacksquare\)