diff --git a/book/_toc.yml b/book/_toc.yml index d45cc1a..b0660b6 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -13,7 +13,7 @@ parts: chapters: # week 1 - file: chapter_ml_basics/intro - title: Machine Learning Basics + title: Machine Learning Problems sections: - file: chapter_ml_basics/classification title: Classification @@ -66,25 +66,45 @@ parts: - file: chapter_calculus/minima_first_order_condition title: First Order Condition - file: chapter_calculus/analytical_solution_ridge - title: Ridge Regression + title: Quadratic Optimization - file: chapter_calculus/line_search title: Line Search - file: chapter_calculus/hessian title: Hessian + - file: chapter_calculus/taylors_theorem +# - file: chapter_calculus/irls +# title: Iteratively Re-Weighted Least Squares # study the properties of matrices -# - file: chapter_decompositions/overview_decompositions # chapter_linear_algebra/linear_algebra -# sections: -# - file: chapter_decompositions/eigenvectors -# - file: chapter_decompositions/trace_determinant -# - file: chapter_decompositions/orthogonal_matrices -# - file: chapter_decompositions/symmetric_matrices -# - file: chapter_decompositions/psd_matrices -# - file: chapter_decompositions/svd -# - file: chapter_decompositions/big_picture -# - file: chapter_decompositions/pseudoinverse -# - file: chapter_decompositions/low_rank_approximation -# - file: chapter_decompositions/matrix_norms + - file: chapter_decompositions/overview_decompositions + title: Matrix Analysis + sections: + - file: chapter_decompositions/matrix_rank + - file: chapter_decompositions/determinant + - file: chapter_decompositions/row_equivalence + - file: chapter_decompositions/square_matrices + - file: chapter_decompositions/trace + - file: chapter_decompositions/eigenvectors # end week 05 + - file: chapter_decompositions/orthogonal_matrices + - file: chapter_decompositions/symmetric_matrices + - file: chapter_decompositions/Rayleigh_quotients + - file: chapter_decompositions/matrix_norms + - file: chapter_decompositions/psd_matrices + - file: chapter_decompositions/pca # PCA as example for the eigenvalue decomposition of a psd matrix + title: Principal Components Analysis + - file: chapter_decompositions/svd # +# - file: chapter_decompositions/RBF_kernel_Positive_Definite + - file: chapter_decompositions/pseudoinverse + - file: chapter_decompositions/orthogonal_projections + - file: chapter_decompositions/big_picture + title: Fundamental Subspaces +# - file: chapter_decompositions/representer_theorem +# - file: chapter_convexity/overview_convexity +# title: Convexity +# sections: +# - file: chapter_convexity/convex_sets +# - file: chapter_convexity/convex_functions # continue with second order optimization +# title: Second-Order Optimization # - file: chapter_calculus/newtons_method # title: Newton's Method # - file: chapter_taylor/minima_second_order_condition @@ -93,16 +113,13 @@ parts: # - file: chapter_calculus/orthogonal_projections # - file: chapter_taylor/overview_taylor # sections: -# - file: chapter_convexity/overview_convexity -# sections: -# - file: chapter_convexity/convexity # - file: chapter_optimization/overview_optimization # sections: # - file: chapter_optimization/optimization # - file: chapter_optimization/optimization_second_order # - file: chapter_optimization/bfgs -# - file: chapter_optimization/orthogonal_projection # - file: chapter_probability/overview_probability +# title: Probability and Random Variables # sections: # - file: chapter_probability/probability_basics # - file: chapter_probability/random_variables @@ -149,8 +166,21 @@ parts: title: First Fundamental Theorem of Calculus - file: appendix/second_fundamental_theorem_calculus title: Second Fundamental Theorem of Calculus + - file: appendix/Clairauts_theorem + title: Clairaut's Theorem - file: appendix/differentiation_rules title: Differentiation Rules + - file: appendix/Exercise Sheet Solutions.md + title: Exercise Sheet Solutions + sections: + - file: appendix/Exercise Sheet 1 Solutions.md + title: Exercise Sheet 1 Solutions + - file: appendix/Exercise Sheet 2 Solutions.md + title: Exercise Sheet 2 Solutions + - file: appendix/Exercise Sheet 3 Solutions.md + title: Exercise Sheet 3 Solutions + - file: appendix/Exercise Sheet 4 Solutions.md + title: Exercise Sheet 4 Solutions # sections: # - file: appendix/proof_vector_spaces # title: Vector Spaces diff --git a/book/appendix/Clairauts_theorem.md b/book/appendix/Clairauts_theorem.md new file mode 100644 index 0000000..16f3872 --- /dev/null +++ b/book/appendix/Clairauts_theorem.md @@ -0,0 +1,77 @@ +# Symmetry of Mixed Partial Derivatives (Clairaut’s Theorem) + +:::{prf:theorem} Clairaut Schwarz +:label: thm-Clairaut-appendix +:nonumber: + +Let $f: \mathbb{R}^2 \to \mathbb{R}$ be a function such that both mixed partial derivatives $\frac{\partial^2 f}{\partial x \partial y}$ and $\frac{\partial^2 f}{\partial y \partial x}$ exist and are **continuous** on an open set containing a point $(x_0, y_0)$ + +Then: + +$$ +\boxed{ +\frac{\partial^2 f}{\partial x \partial y}(x_0, y_0) = \frac{\partial^2 f}{\partial y \partial x}(x_0, y_0) +} +$$ + +That is, **the order of differentiation can be interchanged**. +::: + +## Intuition + +If a function is smooth enough (specifically, if the second-order partial derivatives exist and are continuous), then the "curvature" in the $x$ direction after differentiating in the $y$ direction is the same as the curvature in the $y$ direction after differentiating in the $x$ direction. + +--- + +## Proof Sketch + +We will sketch a proof using the **mean value theorem** and the definition of partial derivatives. Let’s assume that $f$ has continuous second partial derivatives in an open rectangle around the point $(x_0, y_0)$. + +Define: + +$$ +F(h,k) = \frac{f(x_0 + h, y_0 + k) - f(x_0 + h, y_0) - f(x_0, y_0 + k) + f(x_0, y_0)}{hk} +$$ + +Then, as $h, k \to 0$, $F(h,k) \to \frac{\partial^2 f}{\partial y \partial x}(x_0, y_0)$ and also $F(h,k) \to \frac{\partial^2 f}{\partial x \partial y}(x_0, y_0)$, provided the second partial derivatives are continuous. + +### Step-by-step: + +1. By the **Mean Value Theorem**, the numerator of $F(h,k)$ can be interpreted as a finite difference approximation to a mixed partial derivative. +2. Using Taylor’s Theorem with remainder, or via integral representations of derivatives, one can show that: + + $$ + \lim_{(h,k) \to (0,0)} F(h,k) = \frac{\partial^2 f}{\partial x \partial y}(x_0, y_0) + $$ + + and also + + $$ + \lim_{(h,k) \to (0,0)} F(h,k) = \frac{\partial^2 f}{\partial y \partial x}(x_0, y_0) + $$ + + due to continuity of the second derivatives. +3. Hence, the limits agree and the mixed partials are equal. + +Therefore: + +$$ +\frac{\partial^2 f}{\partial x \partial y}(x_0, y_0) = \frac{\partial^2 f}{\partial y \partial x}(x_0, y_0) +$$ + +--- + +## When Clairaut's Theorem **Does Not Apply** + +If the second-order mixed partial derivatives exist but are **not continuous**, the symmetry may fail. A classic counterexample is: + +$$ +f(x, y) = +\begin{cases} +\frac{xy(x^2 - y^2)}{x^2 + y^2}, & \text{if } (x, y) \neq (0, 0) \\ +0, & \text{if } (x, y) = (0, 0) +\end{cases} +$$ + +This function has both mixed partial derivatives at the origin, but they are not equal because they are not continuous there. + diff --git a/book/appendix/Exercise Sheet 1 Solutions.md b/book/appendix/Exercise Sheet 1 Solutions.md new file mode 100644 index 0000000..779cd0e --- /dev/null +++ b/book/appendix/Exercise Sheet 1 Solutions.md @@ -0,0 +1,60 @@ +# Exercise Sheet 1 Solutions + + +### 1. +#### (a) +Take any \(v_1=(a,b)\) and \(v_2=(c,d)\) in \(V\); then \(b=3a+1\) and \(d=3c+1\). +Their sum is +\[ +v_1+v_2=(a+c,\;b+d)=(a+c,\;3a+1+3c+1)=\bigl(a+c,\;3(a+c)+2\bigr), +\] +which **does not** satisfy \(b+d=3(a+c)+1\). Hence \(V\) is *not* closed under addition ⇒ **not a vector space**. +(Equivalently, the additive identity \((0,0)\notin V\), violating axiom V1.) + +#### (b) +All axioms except **distributivity over scalar addition** fail: + +Take \(v=(a,b)\) and scalars \(\alpha,\beta\in\mathbb R\). +\[ +(\alpha+\beta)\,v=((\alpha+\beta)a,\;b), +\quad +\alpha v+\beta v=(\alpha a,\;b)+(\beta a,\;b)=((\alpha+\beta)a,\;2b). +\] +Unless \(b=0\), the second component differs, so +\((\alpha+\beta)v\neq\alpha v+\beta v\). +Therefore \(V\) is **not** a vector space. + + +### 2. +#### (a) +*Zero vector:* \((0,0)\) satisfies \(0=2\cdot0\). +*Closure (addition):* if \(y_1=2x_1\) and \(y_2=2x_2\), then +\[ +y_1+y_2 = 2(x_1+x_2). +\] +*Closure (scalar mult.):* for \(\alpha\in\mathbb R\), +\[ +\alpha(x,y)=(\alpha x,\;2\alpha x). +\] +All three conditions hold ⇒ \(W\) **is a subspace**. + +#### (b) +Pick \((x,y)\in W\) with \(x>0\) and any negative scalar \(\alpha<0\). +Then +\[ +\alpha(x,y)=(\alpha x,\;\alpha y), +\] +and \(\alpha x<0\). Thus \(\alpha(x,y)\notin W\). +Not closed under scalar multiplication ⇒ **not a subspace**. + + +### 3. +For \(x=(a,b)\), \(y=(c,d)\) and scalars \(\alpha,\beta\): +\[ +T(\alpha x+\beta y)=\bigl((\alpha a+\beta c)^{2},\;\alpha b+\beta d\bigr), +\] +\[ +\alpha T(x)+\beta T(y)=\bigl(\alpha^{2}a^{2}+\beta^{2}c^{2},\;\alpha b+\beta d\bigr). +\] +The first components differ unless \(a c=0\) or \(\alpha\beta=0\). +Hence \(T\) **violates additivity/homogeneity ⇒ not linear**. diff --git a/book/appendix/Exercise Sheet 2 Solutions.md b/book/appendix/Exercise Sheet 2 Solutions.md new file mode 100644 index 0000000..89cf37b --- /dev/null +++ b/book/appendix/Exercise Sheet 2 Solutions.md @@ -0,0 +1,422 @@ +# Exercise Sheet 2 Solutions + + +### 1. +#### (a) +Let +\[ +f(x, y) = +\begin{cases} +\dfrac{x \sin y}{x^2 + y^2} & \text{if } (x, y) \neq (0, 0) \\ +0 & \text{if } (x, y) = (0, 0) +\end{cases} +\] + +We are asked to examine the **continuity of \( f \) in \( \mathbb{R}^2 \)**. + +*Remark* +We say that a single-variable function \( f : \mathbb{R} \rightarrow \mathbb{R} \) is **continuous at a point** \( a \in \mathbb{R} \) if + +\[ +\lim_{x \to a} f(x) = f(a) +\] + +*Extension to Two Variables* +Similarly, for a function of two variables \( f : \mathbb{R}^2 \rightarrow \mathbb{R} \), we say that \( f \) is continuous at the point \( (a, b) \in \mathbb{R}^2 \) if + +\[ +\lim_{(x, y) \to (a, b)} f(x, y) = f(a, b) +\] + +So, to study the continuity of \( f \), we need to check whether this limit exists and equals the value of the function at that point. + +*Strategy* +To determine the **existence of** +\[ +\lim_{(x, y) \to (0, 0)} f(x, y), +\] +we must examine whether the limit exists and is the same **along all possible directions** towards \( (0, 0) \). + +*Direction 1: Along the x-axis* +We approach \( (0, 0) \) along the **x-axis**, i.e., set \( y = 0 \). +Then: + +\[ +f(x, 0) = \frac{x \sin(0)}{x^2 + 0^2} = \frac{0}{x^2} = 0 \quad \text{for all } x \neq 0 +\] + +\[ +\Rightarrow \lim_{(x, y) \to (0, 0)} f(x, y) = \lim_{x \to 0} f(x, 0) = 0 +\] + +*Direction 2: Along the y-axis* +Let \( x = 0 \), then: + +\[ +f(0, y) = \frac{0 \cdot \sin(y)}{0 + y^2} = 0 +\] + +\[ +\Rightarrow \lim_{(x, y) \to (0, 0)} f(x, y) = \lim_{y \to 0} f(0, y) = 0 +\] + +*Direction 3: Along \( y = x \)* +Now we approach the origin along a different line, say \( y = x \): + +\[ +f(x, x) = \frac{x \sin x}{x^2 + x^2} = \frac{x \sin x}{2x^2} = \frac{\sin x}{2x} +\] + +\[ +\Rightarrow \lim_{(x, y) \to (0, 0)} f(x, y) = \lim_{x \to 0} \frac{\sin x}{2x} = \frac{1}{2} +\] + +Since this limit \(\frac{1}{2} \neq 0\), the two-dimensional limit +\[ +\lim_{(x, y) \to (0, 0)} f(x, y) +\] +**does not exist**, and hence \( f(x, y) \) is **not continuous** at the point \( (0, 0) \). + +#### (b) +*Partial derivatives of \( f \) at point \( (0, 0) \)* +If we have a function of two variables +\[ +f : \mathbb{R}^2 \rightarrow \mathbb{R}, \quad (x, y) \mapsto f(x, y) +\] +then the partial derivative of \( f \) with respect to \( x \) at \( (a, b) \) is defined as: + +\[ +f_x(a, b) = \lim_{h \to 0} \frac{f(a + h, b) - f(a, b)}{h} +\] + +and similarly, the partial derivative of \( f \) with respect to \( y \) at \( (a, b) \) is: + +\[ +f_y(a, b) = \lim_{h \to 0} \frac{f(a, b + h) - f(a, b)}{h} +\] + +*Compute partial derivatives at \( (0, 0) \)* +-Partial derivative with respect to \( x \): +\[ +f_x(0, 0) = \lim_{h \to 0} \frac{f(h, 0) - f(0, 0)}{h} += \lim_{h \to 0} \frac{0 - 0}{h} = 0 +\] + +-Partial derivative with respect to \( y \): + +\[ +f_y(0, 0) = \lim_{h \to 0} \frac{f(0, h) - f(0, 0)}{h} += \lim_{h \to 0} \frac{0 - 0}{h} = 0 +\] + +#### (c) +*At which points is \( f \) differentiable?* +To determine where the function \( f \) is differentiable, we use the following **theorem**: + +*Remark (Theorem)* +If \( f \) is a continuous function in an open set \( U \), +and has **continuous partial derivatives** at \( U \), +then \( f \) is **continuously differentiable** at all points in \( U \). + +Let \( U = \mathbb{R}^2 \setminus \{(0, 0)\} \). +The function \( f(x, y) = \dfrac{x \sin y}{x^2 + y^2} \) is continuous at all points in \( U \). + +Now we examine the partial derivatives of \( f \): + +*Compute \( \dfrac{\partial f}{\partial x} \) and \( \dfrac{\partial f}{\partial y} \)* +\[ +\frac{\partial}{\partial x} \left( \frac{x \sin y}{x^2 + y^2} \right) += \frac{(x^2 + y^2)\sin y - 2x^2 \sin y}{(x^2 + y^2)^2} += \frac{(y^2 - x^2)\sin y}{(x^2 + y^2)^2} +\] + +\[ +\frac{\partial}{\partial y} \left( \frac{x \sin y}{x^2 + y^2} \right) += \frac{x \cos y (x^2 + y^2) - 2x y \sin y}{(x^2 + y^2)^2} +\] + +These are rational functions where the **numerator and denominator** are composed of continuous functions, and the **denominator only vanishes at the origin** \( (0, 0) \). +Thus, the partial derivatives are continuous **everywhere in \( U \)**. + +*Conclusion* +So, based on the theorem, function \( f \) is **differentiable at all points except** the origin, that is, point \( (0, 0) \). + + +### 2. +#### (a) +Let the function \( f(z) = \exp\left(-\dfrac{1}{2} z\right) \), +where \( z = g(y) = y^\top S^{-1} y \), +and \( y = h(x) = x - u \), +with: + +- \( x, u \in \mathbb{R}^D \) +- \( S \in \mathbb{R}^{D \times D} \) + +*Chain Rule* +Based on the chain rule, we have: + +\[ +\frac{df}{dx} = \frac{df}{dz} \cdot \frac{dz}{dy} \cdot \frac{dy}{dx} +\] + +*Step 1: Note the functions and their domains* +- \( y = h(x) = x - u \) → maps \( \mathbb{R}^D \to \mathbb{R}^D \) +- \( z = g(y) = y^\top S^{-1} y \) → maps \( \mathbb{R}^D \to \mathbb{R} \) +- \( f(z) = e^{- \frac{1}{2} z} \) → maps \( \mathbb{R} \to \mathbb{R} \) + +So the full composition is: +\[ +x \mapsto y = x - u \mapsto z = y^\top S^{-1} y \mapsto f(z) = e^{- \frac{1}{2} z} +\] + +*Step 2: Compute \( \dfrac{dy}{dx} \)* +Since \( y = x - u \), the Jacobian \( \dfrac{dy}{dx} \) is: + +\[ +\frac{dy}{dx} = I_{D \times D} +\quad \text{(identity matrix)} +\] + +*Step 3: Compute \( \dfrac{dz}{dy} \)* +We have \( z = y^\top S^{-1} y \). +Using gradient rules for quadratic forms: + +\[ +\frac{d}{dy} (y^\top A y) = y^\top (A + A^\top) +\] + +Apply this: + +\[ +\frac{dz}{dy} = y^\top (S^{-1} + (S^{-1})^\top) +\quad \in \mathbb{R}^{1 \times D} +\] + +*Step 4: Compute \( \dfrac{df}{dz} \)* +\[ +f(z) = e^{- \frac{1}{2} z} +\quad \Rightarrow \quad +\frac{df}{dz} = -\frac{1}{2} e^{- \frac{1}{2} z} +\quad \in \mathbb{R} +\] + +*Final Result* +\[ +\frac{df}{dx} = -\frac{1}{2} e^{- \frac{1}{2} z} \cdot y^\top (S^{-1} + (S^{-1})^\top) +\quad \in \mathbb{R}^{1 \times D} +\] + +#### (b) +Let + +\[ +f(z) = \tanh(z), \quad z = Ax + b +\] + +where: + +- \( x \in \mathbb{R}^N \) +- \( A \in \mathbb{R}^{M \times N} \) +- \( b \in \mathbb{R}^M \) + +*Apply Chain Rule* +\[ +\frac{df}{dx} = \frac{df}{dz} \cdot \frac{dz}{dx} +\] + +*Step 1: Understand \( z = Ax + b \)* +We note: + +\[ +z = Ax + b \in \mathbb{R}^M +\Rightarrow \frac{dz}{dx} = A \in \mathbb{R}^{M \times N} +\] + +*Step 2: Compute \( \dfrac{df}{dz} \)* +We have: + +\[ +f(z) = +\begin{bmatrix} +\tanh(z_1) \\ +\tanh(z_2) \\ +\vdots \\ +\tanh(z_M) +\end{bmatrix} +\] + +So the Jacobian of \( f \) is diagonal: + +\[ +\frac{df}{dz} = +\begin{bmatrix} +\text{sech}^2(z_1) & & \\ +& \ddots & \\ +& & \text{sech}^2(z_M) +\end{bmatrix} +\in \mathbb{R}^{M \times M} +\] + +*Final Result* +```math +\frac{df}{dx} += +\operatorname{diag}\left( + \text{sech}^2(z_1),\ + \text{sech}^2(z_2),\ + \dots,\ + \text{sech}^2(z_M) +\right) +\cdot A +\quad \in \mathbb{R}^{M \times N} +``` + +### 3. +#### (a) +Let +```math +x^{(0)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad \eta = 1 +``` +and perform two steps of **gradient descent**. + +The update rule for gradient descent is: + +```math +x^{(i+1)} = x^{(i)} - \eta \nabla f(x^{(i)}) +``` + +So two steps of the gradient descent algorithm are: + +```math +\text{Step 1:} \quad x^{(1)} = x^{(0)} - \eta \nabla f(x^{(0)}) +``` + +```math +\text{Step 2:} \quad x^{(2)} = x^{(1)} - \eta \nabla f(x^{(1)}) +``` + +Given the gradient: + +```math +\nabla f = \begin{bmatrix} +x_1 + 2 \\ +2x_2 + 1 +\end{bmatrix} +``` + +We compute: + +```math +x^{(0)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} +\quad \Rightarrow \quad +\nabla f(x^{(0)}) = \begin{bmatrix} 2 \\ 1 \end{bmatrix} +``` + +*Step 1:* +```math +x^{(1)} = x^{(0)} - 1 \cdot \nabla f(x^{(0)}) = +\begin{bmatrix} 0 \\ 0 \end{bmatrix} +- \begin{bmatrix} 2 \\ 1 \end{bmatrix} += \begin{bmatrix} -2 \\ -1 \end{bmatrix} +``` + +```math +\nabla f(x^{(1)}) = +\begin{bmatrix} -2 + 2 \\ -2 + 1 \end{bmatrix} += \begin{bmatrix} 0 \\ -1 \end{bmatrix} +``` + +*Step 2:* +```math +x^{(2)} = x^{(1)} - 1 \cdot \nabla f(x^{(1)}) = +\begin{bmatrix} -2 \\ -1 \end{bmatrix} +- \begin{bmatrix} 0 \\ -1 \end{bmatrix} += \begin{bmatrix} -2 \\ 0 \end{bmatrix} +``` + +#### (b) +Will the gradient descent procedure from part (b) converge to the minimizer \( x^* \)? Why or why not? How can we fix it? + +Let’s look at the values over iterations: + +```math +x^{(0)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad +x^{(1)} = \begin{bmatrix} -2 \\ -1 \end{bmatrix}, \quad +x^{(2)} = \begin{bmatrix} -2 \\ 0 \end{bmatrix}, \quad +x^* = \begin{bmatrix} -2 \\ -0.5 \end{bmatrix} +``` + +And: + +```math +\nabla f(x^{(0)}) = \begin{bmatrix} 2 \\ 1 \end{bmatrix}, \quad +\nabla f(x^{(1)}) = \begin{bmatrix} 0 \\ -1 \end{bmatrix}, \quad +\nabla f(x^{(2)}) = \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \quad +\nabla f(x^*) = \begin{bmatrix} 0 \\ 0 \end{bmatrix} +``` + +We observe that **gradient descent does not converge** to \( x^* \). Why? + +Because the gradients do **not decrease constantly**. +Let’s examine the **partial derivatives**: + +```math +\frac{\partial f}{\partial x_1} \big|_{x^{(0)}} = 2, \quad +\frac{\partial f}{\partial x_1} \big|_{x^{(1)}} = 0, \quad +\frac{\partial f}{\partial x_1} \big|_{x^{(2)}} = 0 +``` + +```math +\frac{\partial f}{\partial x_2} \big|_{x^{(0)}} = 1, \quad +\frac{\partial f}{\partial x_2} \big|_{x^{(1)}} = -1, \quad +\frac{\partial f}{\partial x_2} \big|_{x^{(2)}} = 1 +``` + +Since \( x^* \) is a minimum and \( \nabla f(x^*) = 0 \), we expect the GD algorithm to converge to \( x^* \) **if the partial derivatives reduce toward zero**. + +But here, GD **jumps over the minimum** due to a **too high learning rate** \( \eta = 1 \). If we **decrease** the learning rate, convergence improves. + +*Trying smaller learning rates:* +Let’s try \( \eta = 0.5 \): + +```math +x^{(0)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad +\nabla f(x^{(0)}) = \begin{bmatrix} 2 \\ 1 \end{bmatrix} +``` + +*Step 1:* +```math +x^{(1)} = x^{(0)} - 0.5 \cdot \nabla f(x^{(0)}) += \begin{bmatrix} 0 \\ 0 \end{bmatrix} +- 0.5 \cdot \begin{bmatrix} 2 \\ 1 \end{bmatrix} += \begin{bmatrix} -1 \\ -0.5 \end{bmatrix} +``` + +```math +\nabla f(x^{(1)}) = \begin{bmatrix} 1 \\ 0 \end{bmatrix} +``` + +*Step 2:* +```math +x^{(2)} = x^{(1)} - 0.5 \cdot \nabla f(x^{(1)}) += \begin{bmatrix} -1 \\ -0.5 \end{bmatrix} +- 0.5 \cdot \begin{bmatrix} 1 \\ 0 \end{bmatrix} += \begin{bmatrix} -1.5 \\ -0.5 \end{bmatrix} +``` + +Now we see that the GD algorithm converges towards: + +```math +x^* = \begin{bmatrix} -2 \\ -0.5 \end{bmatrix} +``` + +with gradients: + +```math +\nabla f(x^{(0)}) = \begin{bmatrix} 2 \\ 1 \end{bmatrix}, \quad +\nabla f(x^{(1)}) = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \quad +\nabla f(x^{(2)}) = \begin{bmatrix} 0.5 \\ 0 \end{bmatrix}, \quad +\nabla f(x^*) = \begin{bmatrix} 0 \\ 0 \end{bmatrix} +``` + +✔️ So a smaller \( \eta \) leads to proper convergence! diff --git a/book/appendix/Exercise Sheet 3 Solutions.md b/book/appendix/Exercise Sheet 3 Solutions.md new file mode 100644 index 0000000..a7780a3 --- /dev/null +++ b/book/appendix/Exercise Sheet 3 Solutions.md @@ -0,0 +1,225 @@ +# Exercise Sheet 3 Solutions + +### 1. +#### (a) +Let + +\[ +f : \mathbb{R}^2 \to \mathbb{R}, \quad f(x, y) = 9x^2 - y^3 + 9xy +\] + +We are asked to compute the **Hessian matrix** at the point \( (x, y) = (3, -3) \). + + +*Step 1: Compute second-order partial derivatives* + +To compute the Hessian matrix, we first compute the first-order partial derivatives: + +\[ +\frac{\partial f}{\partial x} = 18x + 9y, \quad +\frac{\partial f}{\partial y} = -3y^2 + 9x +\] + +Then we compute the second-order partial derivatives: + +\[ +\frac{\partial^2 f}{\partial x^2} = 18, \quad +\frac{\partial^2 f}{\partial y^2} = -6y, \quad +\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x} = 9 +\] + +At the point \( (3, -3) \), we evaluate: + +\[ +\frac{\partial^2 f}{\partial x^2}(3, -3) = 18, \quad +\frac{\partial^2 f}{\partial y^2}(3, -3) = -6(-3) = 18, \quad +\frac{\partial^2 f}{\partial x \partial y}(3, -3) = 9 +\] + +*Step 2: Form the Hessian matrix* + +```math +H_{(3, -3)} = +\begin{bmatrix} +\frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ +\frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} +\end{bmatrix} += +\begin{bmatrix} +18 & 9 \\ +9 & 18 +\end{bmatrix} +``` + +#### (b) +We recall the following definitions and propositions: + +- **Definition**: A symmetric matrix \( A \) is **positive definite** if for all non-zero vectors \( a \), we have \( a^T A a > 0 \). +- **Proposition**: A symmetric matrix is positive definite **if and only if** all its **eigenvalues** are positive. + + +To compute the eigenvalues, solve the characteristic equation: + +\[ +\det(H - \lambda I) = 0 +\Rightarrow +\begin{vmatrix} +18 - \lambda & 9 \\ +9 & 18 - \lambda +\end{vmatrix} += (18 - \lambda)^2 - 81 = 0 +\] + +Simplifying: + +\[ +(18 - \lambda)^2 = 81 \Rightarrow 18 - \lambda = \pm 9 +\Rightarrow \lambda = 9, \ 27 +\] + + +Since both eigenvalues are **positive**, the Hessian matrix at the point \( (3, -3) \) is **positive definite**. Therefore, \( f(x, y) \) has a **local minimum** at this point. + + +### 2. + +Let \( f(x) = x \cdot \ln(x) \) defined on the interval \( [1, e^2] \). + +#### (a) +To apply the Mean Value Theorem (MVT), we must verify that: + +- \( f \) is **continuous** on \( [1, e^2] \) +- \( f \) is **differentiable** on \( (1, e^2) \) + +Since \( f(x) = x \ln(x) \) is a product of continuous and differentiable functions for \( x > 0 \), both conditions are satisfied. + + +#### (b) + +We compute: + +\[ +f(e^2) = e^2 \cdot \ln(e^2) = e^2 \cdot 2 = 2e^2 +\] +\[ +f(1) = 1 \cdot \ln(1) = 0 +\] + +Hence, the average rate of change is: + +\[ +\frac{f(e^2) - f(1)}{e^2 - 1} = \frac{2e^2}{e^2 - 1} +\] + +Next, compute the derivative: + +\[ +f'(x) = \frac{d}{dx}[x \ln(x)] = \ln(x) + 1 +\] + +We solve: + +\[ +f'(c) = \ln(c) + 1 = \frac{2e^2}{e^2 - 1} +\Rightarrow \ln(c) = \frac{2e^2}{e^2 - 1} - 1 = \frac{e^2 + 1}{e^2 - 1} +\Rightarrow c = \exp\left( \frac{e^2 + 1}{e^2 - 1} \right) +\] + +#### (c) +The Mean Value Theorem states that there exists a point \( c \in (1, e^2) \) where the **instantaneous rate of change** \( f'(c) \) equals the **average rate of change** over the interval: + +\[ +f'(c) = \frac{f(e^2) - f(1)}{e^2 - 1} +\] + +Geometrically, this means the **tangent line** to the curve at \( x = c \) is **parallel** to the **secant line** connecting the endpoints \( (1, f(1)) \) and \( (e^2, f(e^2)) \). + + +### 3. +#### (a) + +We compute the derivatives at \( x = 0 \): + +- \( f(x) = \arctan(x) \) +- \( f'(x) = \frac{1}{1+x^2} \Rightarrow f'(0) = 1 \) +- \( f''(x) = \frac{-2x}{(1+x^2)^2} \Rightarrow f''(0) = 0 \) +- \( f^{(3)}(x) = \frac{6x^2 - 2}{(1 + x^2)^3} \Rightarrow f^{(3)}(0) = -2 \) + +Hence, the third-degree Taylor polynomial is: + +\[ +P_3(x) = f(0) + f'(0)x + \frac{f''(0)}{2!}x^2 + \frac{f^{(3)}(0)}{3!}x^3 += 0 + x + 0 - \frac{2}{6}x^3 = x - \frac{1}{3}x^3 +\] + +#### (b) +The Lagrange remainder is: + +\[ +R_3(x) = \frac{f^{(4)}(c)}{4!} x^4 = \frac{f^{(4)}(c)}{24} x^4 \quad \text{for some } c \in (0, x) +\] + +We previously found: + +\[ +f^{(4)}(c) = \frac{24c(1 - c^2)}{(1 + c^2)^4} +\] + +So, + +\[ +R_3(x) = \frac{c(1 - c^2)}{(1 + c^2)^4} x^4 \quad \text{for some } c \in (0, x) +\] + +#### (c) + +Our goal is to show: + +```math +|R_3(x)| = \left| \frac{c(1 - c^2)}{(1 + c^2)^4} x^4 \right| \leq \frac{x^4}{4(1 + c^2)^2} +\quad \text{for } c \in (0, 1) +``` + +We simplify the absolute value of the remainder: + +```math +|R_3(x)| = \frac{c(1 - c^2)}{(1 + c^2)^4} x^4 +``` + +So we now want to **prove** that: + +```math +\frac{c(1 - c^2)}{(1 + c^2)^4} \leq \frac{1}{4(1 + c^2)^2} +``` + +Now we multiply both sides by \( (1 + c^2)^4 \) (which is strictly positive): + +```math +4c(1 - c^2) \leq (1 + c^2)^2 +``` + +Now expand both sides: + +**Left-hand side:** +```math +4c(1 - c^2) = 4c - 4c^3 +``` + +**Right-hand side:** +```math +(1 + c^2)^2 = 1 + 2c^2 + c^4 +``` + +So we want to show: +```math +4c - 4c^3 \leq 1 + 2c^2 + c^4 +\quad \Leftrightarrow \quad +c^4 + 4c^3 + 2c^2 - 4c + 1 \geq 0 +``` + +Now factor the left-hand side: +```math +c^4 + 4c^3 + 2c^2 - 4c + 1 = (c^2 + 2c - 1)^2 \geq 0 +``` + +This inequality clearly holds for all c including \( c \in (0, 1) \), so the bound is valid. diff --git a/book/appendix/Exercise Sheet 4 Solutions.md b/book/appendix/Exercise Sheet 4 Solutions.md new file mode 100644 index 0000000..eb77c26 --- /dev/null +++ b/book/appendix/Exercise Sheet 4 Solutions.md @@ -0,0 +1,393 @@ +# Exercise Sheet 4 Solutions + +### 1. + +\[ +A = \begin{pmatrix} 3 & 2 \\ -2 & -1 \end{pmatrix} +\] + +**Characteristic Polynomial:** + +We compute the characteristic polynomial: + +\[ +p(\lambda) = \det(A - \lambda I) += \det \begin{pmatrix} 3 - \lambda & 2 \\ -2 & -1 - \lambda \end{pmatrix} +\] + +\[ += (3 - \lambda)(-1 - \lambda) - (-2)(2) += \lambda^2 - 2\lambda + 1 = (\lambda - 1)^2 +\] + +**Eigenvalues:** + +Solving the characteristic polynomial: + +\[ +(\lambda - 1)^2 = 0 \Rightarrow \lambda_1 = \lambda_2 = 1 +\] + +**Eigenvectors:** + +We solve: +\[ +(A - \lambda I)\vec{x} = 0 \Rightarrow +\begin{pmatrix} 2 & 2 \\ -2 & -2 \end{pmatrix} \vec{x} = \vec{0} +\] + +This reduces to: +\[ +x_1 + x_2 = 0 \Rightarrow \vec{x} = \alpha \begin{pmatrix} 1 \\ -1 \end{pmatrix} +\] + +So, the eigenvector is any scalar multiple of: + +\[ +\vec{x} = \begin{pmatrix} 1 \\ -1 \end{pmatrix} +\] + + +**Now, we solve them for B:** +\[ +B = \begin{pmatrix} +0 & 1 & 0 \\ +1 & 0 & 1 \\ +1 & 1 & 0 +\end{pmatrix} +\] + +**Characteristic Polynomial:** + +We compute the characteristic polynomial: + +\[ +p(\lambda) = \det(B - \lambda I) = +\det \begin{pmatrix} +-\lambda & 1 & 0 \\ +1 & -\lambda & 1 \\ +1 & 1 & -\lambda +\end{pmatrix} +\] + +Expanding along the first row: + +```math += -\lambda \cdot \det \begin{pmatrix} +-\lambda & 1 \\ +1 & -\lambda +\end{pmatrix} +- 1 \cdot \det \begin{pmatrix} +1 & 1 \\ +1 & -\lambda +\end{pmatrix} ++ 0 +``` + +\[ += -\lambda(\lambda^2 - 1) - (-\lambda - 1) += -\lambda^3 + \lambda + \lambda + 1 += -\lambda^3 + 2\lambda + 1 +\] + +So the characteristic polynomial is: + +\[ +p(\lambda) = -\lambda^3 + 2\lambda + 1 +\] + +**Eigenvalues:** + +Solving the characteristic polynomial: + +\[ +-\lambda^3 + 2\lambda + 1 = 0 +\] + +```math +-\left( \lambda + 1 \right)\left( \lambda - \frac{1 + \sqrt{5}}{2} \right)\left( \lambda - \frac{1 - \sqrt{5}}{2} \right) = 0 +``` + +so: + +\[ +\lambda_1 \approx -0.62, \quad +\lambda_2 = -1, \quad +\lambda_3 \approx 1.62 +\] + +**Eigenvectors:** + +To find the eigenvectors, we solve: + +\[ +(B - \lambda I)\vec{x} = 0 +\] + +**For** \( \lambda_1 \approx -0.62 \): + +\[ +\vec{v}_1 = \begin{pmatrix} +1 \\ +-0.62 \\ +-0.62 +\end{pmatrix} +\] + +**For** \( \lambda_2 = -1 \): + +\[ +\vec{v}_2 = \begin{pmatrix} +-1 \\ +1 \\ +0 +\end{pmatrix} +\] + +**For** \( \lambda_3 \approx 1.62 \): + +\[ +\vec{v}_3 = \begin{pmatrix} +1 \\ +1.62 \\ +1.62 +\end{pmatrix} +\] + +\(\blacksquare\) + +### 2. +We start by expressing the trace of \( ABC \): + +```math +\mathrm{tr}(ABC) = \sum_{i=1}^{n} (ABC)_{ii} +``` + +**Computing \( (ABC)_{ii} \):** + +We want to compute the entry in the \( i \)-th row and \( i \)-th column of the matrix product \( ABC \). + +1. First, compute the product \( AB \), which gives: + +```math +(AB)_{il} = \sum_{j=1}^{m} A_{ij} B_{jl} +``` + +2. Then compute \( ABC = (AB)C \). The \( (i, i) \)-th element of \( ABC \) is: + +```math +(ABC)_{ii} = \sum_{l=1}^{k} (AB)_{il} C_{li} +``` + +3. Substitute the expression for \( (AB)_{il} \): + +```math +(ABC)_{ii} = \sum_{l=1}^{k} \left( \sum_{j=1}^{m} A_{ij} B_{jl} \right) C_{li} +``` + +4. Rearranging the summation order: + +```math +(ABC)_{ii} = \sum_{j=1}^{m} \sum_{l=1}^{k} A_{ij} B_{jl} C_{li} +``` + +So, the trace becomes: + +```math +\mathrm{tr}(ABC) = \sum_{i=1}^{n} \sum_{j=1}^{m} \sum_{k=1}^{k} A_{ij} B_{jk} C_{ki} +``` + + +**Computing the trace of \( BCA \):** + +```math +\mathrm{tr}(BCA) = \sum_{j=1}^{m} (BCA)_{jj} +``` + +Expand the product: + +1. First compute \( BC \): + +```math +(BC)_{ji} = \sum_{k=1}^{k} B_{jk} C_{ki} +``` + +2. Then compute: + +```math +(BCA)_{jj} = \sum_{i=1}^{n} (BC)_{ji} A_{ij} = \sum_{i=1}^{n} \sum_{k=1}^{k} B_{jk} C_{ki} A_{ij} +``` + +3. So the trace is: + +```math +\mathrm{tr}(BCA) = \sum_{j=1}^{m} \sum_{k=1}^{k} \sum_{i=1}^{n} B_{jk} C_{ki} A_{ij} +``` + + +**Final Step:** + +Since scalar multiplication is commutative: + +```math +\sum_{i=1}^{n} \sum_{j=1}^{m} \sum_{k=1}^{k} A_{ij} B_{jk} C_{ki} += \sum_{j=1}^{m} \sum_{k=1}^{k} \sum_{i=1}^{n} B_{jk} C_{ki} A_{ij} +``` + +Therefore: + +```math +\mathrm{tr}(ABC) = \mathrm{tr}(BCA) +``` + +\(\blacksquare\) + +### 3. + +**Eigenvalues are the same** + +Let \( p(\lambda) \) be the characteristic polynomial of \( A \). By definition: + +```math +p(\lambda) = \det(A - \lambda I) +``` + +Now consider the transpose: + +```math +\det(A^\top - \lambda I) += \det((A - \lambda I)^\top) += \det(A - \lambda I) += p(\lambda) +``` + +So both \( A \) and \( A^\top \) have the **same characteristic polynomial**, which means they have the **same eigenvalues**, including their algebraic multiplicities. + + +**Eigenvectors may differ** +Although \( A \) and \( A^\top \) have the same eigenvalues, their eigenvectors **need not** be the same. + +To see this, consider a concrete example: + +Let + +```math +A = \begin{pmatrix} +1 & 1 \\ +0 & 1 +\end{pmatrix} +``` + +Then: + +```math +A^\top = \begin{pmatrix} +1 & 0 \\ +1 & 1 +\end{pmatrix} +``` + +The characteristic polynomial of both is: + +```math +\det(A - \lambda I) = (1 - \lambda)^2 +``` + +So they both have a **repeated eigenvalue** \( \lambda = 1 \). + +Now compute eigenvectors: + +- For \( A \), we solve: + +```math +(A - I)\vec{x} = \begin{pmatrix} +0 & 1 \\ +0 & 0 +\end{pmatrix} \vec{x} = 0 +\Rightarrow x_2 = 0 \Rightarrow \vec{x} = \begin{pmatrix} +1 \\ +0 +\end{pmatrix} +``` + +- For \( A^\top \), we solve: + +```math +(A^\top - I)\vec{x} = \begin{pmatrix} +0 & 0 \\ +1 & 0 +\end{pmatrix} \vec{x} = 0 +\Rightarrow x_1 = 0 \Rightarrow \vec{x} = \begin{pmatrix} +0 \\ +1 +\end{pmatrix} +``` + +Thus, the **eigenvectors are different**, even though the eigenvalues are the same. + +**So:** + +- \( A \) and \( A^\top \) always have the **same eigenvalues**, including multiplicities. +- However, they may have **different eigenvectors**, especially when the matrix is **not symmetric**. + +\(\blacksquare\) + +### 4. +#### (a) The rank of \( B \) + +We observe that each row (and column) of \( B \) is a linear combination of the others: + +- The second row is: + ```math + \text{Row}_2 = 2 \cdot \text{Row}_1 + ``` +- The third row is: + ```math + \text{Row}_3 = 3 \cdot \text{Row}_1 + ``` + +So all rows lie in the span of the first row. + +Let’s do row reduction to confirm: + +```math +\begin{pmatrix} +1 & 2 & 3 \\ +2 & 4 & 6 \\ +3 & 6 & 9 +\end{pmatrix} +\rightarrow +\begin{pmatrix} +1 & 2 & 3 \\ +0 & 0 & 0 \\ +0 & 0 & 0 +\end{pmatrix} +``` + +Only one non-zero row remains after Gaussian elimination. + +Therefore, the rank of \( B \) is: + +```math +\mathrm{rank}(B) = 1 +``` + +#### (b) Are the columns of \( B \) linearly independent? + +Recall that the number of linearly independent columns is equal to the rank of the matrix. + +Since: + +```math +\mathrm{rank}(B) = 1 < 3 +``` + +This means that the columns are **linearly dependent**. + +In fact: + +```math +\text{Col}_2 = 2 \cdot \text{Col}_1 \\ +\text{Col}_3 = 3 \cdot \text{Col}_1 +``` +\(\blacksquare\) diff --git a/book/appendix/Exercise Sheet Solutions.md b/book/appendix/Exercise Sheet Solutions.md new file mode 100644 index 0000000..cadaa1e --- /dev/null +++ b/book/appendix/Exercise Sheet Solutions.md @@ -0,0 +1,6 @@ +# Exercise Solutions + +In this appendix, we provide worked-out solutions to the weekly exercise sheets accompanying the *Mathematics for Machine Learning* course. These solutions are designed to reinforce understanding of the theoretical material covered in the main chapters. Each solution sheet contains detailed step-by-step derivations and justifications. + +```{tableofcontents} +``` diff --git a/book/appendix/scalar-scalar_chain_rule.md b/book/appendix/scalar-scalar_chain_rule.md index a939512..7dd46f7 100644 --- a/book/appendix/scalar-scalar_chain_rule.md +++ b/book/appendix/scalar-scalar_chain_rule.md @@ -1,5 +1,8 @@ # The Chain Rule for Scalar-Scalar Functions +The **Chain Rule** is a fundamental theorem in calculus that describes how to differentiate composite functions. It states that if you have two functions, $f$ and $g$, and you want to differentiate their composition $f(g(x))$, you can do so by multiplying the derivative of $f$ evaluated at $g(x)$ by the derivative of $g$ evaluated at $x$. +This is particularly useful when dealing with functions that are composed of other functions, as it allows us to break down the differentiation process into manageable parts. + ::: {prf:theorem} scalar-scalar chain rule :label: thm-scalar-scalar-chain-rule-appendix :nonumber: @@ -33,7 +36,9 @@ Set $$ \Delta u = g(x_0+\Delta x)-g(x_0), $$ -so that $\Delta u\to0$ and $\tfrac{\Delta u}{\Delta x}\to g'(x_0)$ by differentiability of $g$. We now write +so that $\Delta u\to0$ and $\tfrac{\Delta u}{\Delta x}\to g'(x_0)$ by differentiability of $g$. + +We now write $$ \frac{f\bigl(g(x_0+\Delta x)\bigr)-f\bigl(g(x_0)\bigr)}{\Delta x} @@ -48,7 +53,9 @@ $$ \frac{f(u_0+\Delta u)-f(u_0)}{\Delta u} = f'(\xi). $$ -As $\Delta x\to0$, we have $\xi\to u_0$, and hence $f'(\xi)\to f'(u_0)$. Therefore +As $\Delta x\to0$, we have $\xi\to u_0$, and hence $f'(\xi)\to f'(u_0)$. + +Therefore $$ h'(x_0) diff --git a/book/appendix/squeeze_theorem.md b/book/appendix/squeeze_theorem.md index aadeb76..aab7f2b 100644 --- a/book/appendix/squeeze_theorem.md +++ b/book/appendix/squeeze_theorem.md @@ -14,6 +14,7 @@ kernelspec: :::{prf:theorem} Squeeze theorem :label: squeeze_theorem-appendix +:nonumber: Let $g(x), h(x), f(x)$ be functions defined near $c$. Suppose that there is an open interval around $c$, except possibly at $c$ itself, such that: diff --git a/book/chapter_calculus/analytical_solution_ridge.md b/book/chapter_calculus/analytical_solution_ridge.md index 351be76..8970305 100644 --- a/book/chapter_calculus/analytical_solution_ridge.md +++ b/book/chapter_calculus/analytical_solution_ridge.md @@ -10,9 +10,12 @@ kernelspec: language: python name: python3 --- -# Analytical Solution for Ridge Regression +# Ridge Regression as a Quadratic Optimization Problem + So far, we have optimized ridge regression using the gradient descent algorithm. -However, the first order condition tells us that at the minimum of the objective function, the gradient should vanish. We will use this knowledge to derive an analytical solution to the weights in ridge regression. +However, the first order condition tells us that at the minimum of the objective function, the gradient should vanish. We will use this knowledge to derive an analytical solution to the weights in ridge regression. We will show that Ridge Regression belongs to the set of quadratic Optimization Problems and will show how to solve quadratic optimization problems analytically. + +## Ride Regression The objective function for Ridge regression is given by: @@ -94,7 +97,7 @@ class RidgeRegression: ## Example usage -We will use the Ridge regression implementation to fit a model to the maximum temperature data from the year 1900. The data is available in the `data_train` and `data_test` variables, which contain the training and testing datasets, respectively. We will fit a model based on three tanh basis functions to the data and evaluate its performance using Mean Squared Error (MSE). +We will use the Ridge regression implementation to fit a model to the maximum temperature data from the year 1900. We will fit a model based on three tanh basis functions with the fixed parameters defined before, without optimizing over the basis functions. The model is given by @@ -115,27 +118,14 @@ $$ a_1 = 0.1, \quad a_2 = 0.2, \quad a_3 = 0.3 \quad \text{and} \quad b_1 = -10, \quad b_2 = -50, \quad b_3 = -100.0 $$ -To streamline the implementation, we will collect the hyperparameters for all basis functions $\phi_i$ in a single matrix $\mathbf{W}_\phi$: - -$$ - \mathbf{W}_\phi = \begin{pmatrix} - a_1 & a_2 & a_3 \\ - b_1 & b_2 & b_3 - \end{pmatrix} -$$ - -Using this notation, we can express the tanh basis functions as: -$$ - \boldsymbol{\phi}(x; \mathbf{W}_\phi) = - \begin{pmatrix} - \tanh(\mathbf{W}_\phi[0,i] x + \mathbf{W}_\phi[1,i]) - \end{pmatrix}_{i=1}^3 -$$ +```{code-cell} ipython3 +:tags: [hide-input] -We implement the tanh basis functions in a class called `TanhBasis`. The class has two methods: `XW` and `transform`. The `XW` method computes the product of the input data and the weights, while the `transform` method computes the tanh basis functions. +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt -```{code-cell} ipython3 import numpy as np class TanhBasis: @@ -151,16 +141,7 @@ class TanhBasis: def transform(self, x): """Compute the tanh basis functions.""" return np.tanh(self.XW(x)) -``` -Let's use the `TanhBasis` class to fit a Ridge regression model to the maximum temperature data from the year 1900. We will use three tanh basis functions with the specified hyperparameters. - -```{code-cell} ipython3 -:tags: [hide-input] - -import numpy as np -import pandas as pd -import matplotlib.pyplot as plt YEAR = 1900 def load_weather_data(year = None): """ @@ -237,4 +218,62 @@ ax = plt.ylabel("Maximum Temperature - degree C") ax = plt.title("Year : %i N : %i" % (YEAR, N_train)) ``` We see that we obtain the nearly identical solution to the version using gradient descent. -However, in this version it would require some additional work to optimize over the basis function parameters. \ No newline at end of file +However, in this version it would require some additional work to optimize over the basis function parameters. + + +## Quadratic Optimization Problems + +Many problems in machine learning and statistics reduce to minimizing a **quadratic function** of the form + +$$ +f(\mathbf{w}) = \frac{1}{2} \mathbf{w}^\top \mathbf{A} \mathbf{w} - \mathbf{b}^\top \mathbf{w} +$$ + +where $\mathbf{A} \in \mathbb{R}^{d \times d}$ is a **symmetric positive definite** matrix, and $\mathbf{b} \in \mathbb{R}^d$. The minimum of this function can be found analytically by setting the gradient to zero: + +$$ +\nabla f(\mathbf{w}) = \mathbf{A} \mathbf{w} - \mathbf{b} = 0 \quad \Rightarrow \quad \boxed{\mathbf{w} = \mathbf{A}^{-1} \mathbf{b}} +$$ + +--- + +### Ridge Regression as a Special Case + +The Ridge Regression objective can be rewritten in this general form. Starting from: + +$$ +f(\mathbf{w}) = \frac{1}{2} \|\mathbf{y} - \mathbf{Xw}\|^2_2 + \frac{\lambda}{2} \|\mathbf{w}\|^2_2 +$$ + +we expand the squared norm: + +$$ +f(\mathbf{w}) = \frac{1}{2} (\mathbf{y} - \mathbf{Xw})^\top (\mathbf{y} - \mathbf{Xw}) + \frac{\lambda}{2} \mathbf{w}^\top \mathbf{w} +$$ + +$$ += \frac{1}{2} \left[ \mathbf{y}^\top \mathbf{y} - 2 \mathbf{y}^\top \mathbf{Xw} + \mathbf{w}^\top \mathbf{X}^\top \mathbf{X} \mathbf{w} \right] + \frac{\lambda}{2} \mathbf{w}^\top \mathbf{w} +$$ + +Dropping the constant term $\frac{1}{2} \mathbf{y}^\top \mathbf{y}$, the expression becomes: + +$$ +f(\mathbf{w}) = \frac{1}{2} \mathbf{w}^\top (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}) \mathbf{w} - \mathbf{w}^\top \mathbf{X}^\top \mathbf{y} +$$ + +This matches the generalized quadratic form with: + +* $\mathbf{A} = \mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}$ +* $\mathbf{b} = \mathbf{X}^\top \mathbf{y}$ + +Since $\mathbf{A}$ is symmetric and positive definite for $\lambda > 0$, the minimum is achieved at: + +$$ +\boxed{ +\mathbf{w} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y} +} +$$ + +--- + +This perspective makes it clear that Ridge Regression is simply a **quadratic optimization problem with a symmetric positive definite matrix**, and therefore has a unique analytical solution. This also connects to broader optimization theory and prepares us to explore other models — including Bayesian linear regression, kernel methods, and even Newton’s method — through the lens of **solving linear systems**. diff --git a/book/chapter_calculus/hessian.md b/book/chapter_calculus/hessian.md index 96326c8..85ced0c 100644 --- a/book/chapter_calculus/hessian.md +++ b/book/chapter_calculus/hessian.md @@ -12,8 +12,10 @@ kernelspec: --- # The Hessian -The **Hessian** matrix of a scalar-valued function $ f : \mathbb{R}^d \to \mathbb{R} $ is a square matrix -of second-order partial derivatives: +In one variable, the second derivative of a function is a number that tells us about the curvature of the function. +But in many variables, each partial derivative can change in many directions—so we need a **matrix** of second derivatives: + +The **Hessian** matrix of a scalar-valued function $ f : \mathbb{R}^d \to \mathbb{R} $ is a square matrix of second-order partial derivatives: $$\nabla^2 f(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \dots & \frac{\partial^2 f}{\partial x_1 \partial x_d} \\ @@ -22,119 +24,338 @@ $$\nabla^2 f(\mathbf{x}) = \begin{bmatrix} \end{bmatrix}, \quad\text{i.e.,}\quad [\nabla^2 f]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} $$ -If the second partial derivatives are continuous (as they often are in optimization), then by **Clairaut's theorem**, -the order of differentiation can be interchanged: $ \frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i} $, -which implies that the Hessian matrix is symmetric. ---- +:::{prf:theorem} Clairaut Schwarz +:label: thm-Clairaut +:nonumber: -## First-Order Taylor Expansion +Let $f: \mathbb{R}^d \to \mathbb{R}$ be a function such that both mixed partial derivatives $\frac{\partial^2 f}{\partial x_i \partial x_j}$ and $\frac{\partial^2 f}{\partial x_j \partial x_i}$ exist and are **continuous** on an open set containing a point $\mathbf{x}_0$ -Recall, that we can create a locally linear approximation to a function at a point $\mathbf{x}_0 \in \mathbb{R}^d $ using the gradient at $\nabla f(\mathbf{x}_0)$. +Then: -$$ f(\mathbf{x}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top (\mathbf{x} - \mathbf{x}_0) . $$ +$$ +\boxed{ +\frac{\partial^2 f}{\partial x_i \partial x_j}(\mathbf{x}_0) = \frac{\partial^2 f}{\partial x_j \partial x_i}(\mathbf{x}_0) +} +$$ -This affine approximation is also known as the **first-order Taylor approximation**. +That is, **the order of differentiation can be interchanged**. +::: -It agrees with the original function in value and gradient at the point $ \mathbf{x}_0 $. -## Second-Order Taylor Expansion +Clairut's Theorem implies that the Hessian matrix is symmetric. We provide a proof sketch in the appendix. -The Hessian appears naturally in the **second-order Taylor approximation** of a function around a point $ \mathbf{x}_0 \in \mathbb{R}^d $. -For a sufficiently smooth function $ f : \mathbb{R}^d \to \mathbb{R} $, we can approximate its values near $ \mathbf{x}_0 $ as: +## **Curvature in One Dimension** -$$ f(\mathbf{x}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top (\mathbf{x} - \mathbf{x}_0) + \frac{1}{2}(\mathbf{x} - \mathbf{x}_0)^\top \nabla^2 f(\mathbf{x}_0)(\mathbf{x} - \mathbf{x}_0). $$ +Recall the second derivative in one dimension: -This is a **local quadratic approximation** to the function. It agrees with the original function in value, gradient, and Hessian at the point $ \mathbf{x}_0 $. +* $f(x) = x^2$: curve is "smiling" ⇒ second derivative is positive ⇒ function is curving upward. +* $f(x) = -x^2$: curve is "frowning" ⇒ second derivative is negative ⇒ function is curving downward. +* Point: second derivative tells us **how the function curves**. -### Interpretation: -- The **gradient** term captures the linear behavior (slope) of the function near $ \mathbf{x}_0 $. -- The **Hessian** term captures the curvature — how the gradient changes in different directions. +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt ---- +x = np.linspace(-2, 2, 400) +f1 = x**2 +f2 = -x**2 +f1_dd = np.full_like(x, 2) # Second derivative of x^2 +f2_dd = np.full_like(x, -2) # Second derivative of -x^2 + +fig, axes = plt.subplots(1, 2, figsize=(10, 4)) + +# Plot for f(x) = x^2 +axes[0].plot(x, f1, label='$f(x) = x^2$') +axes[0].plot(x, f1_dd, '--', label='$f\'\'(x) = 2$') +axes[0].set_title('Positive Curvature') +axes[0].legend() +axes[0].grid(True) + +# Plot for f(x) = -x^2 +axes[1].plot(x, f2, label='$f(x) = -x^2$') +axes[1].plot(x, f2_dd, '--', label='$f\'\'(x) = -2$') +axes[1].set_title('Negative Curvature') +axes[1].legend() +axes[1].grid(True) + +plt.suptitle("Second Derivative as Curvature in 1D", fontsize=14) +plt.tight_layout() +plt.show() +``` -We illustrate both the first-order and the second-order Taylor expansion using the following function. +The Hessian generalizes this intuition to multiple Dimensions. -$$ f(x, y) = \log(1 + x^2 + y^2) $$ +## **Curvature in Two Dimensions** -We compute the first-order and second-order Taylor approximations at the point $ (x_0, y_0) = (0.3, 0.3) $. +Now, let's look at a simple 2D surface like: -The true function and the linear approximation match in value and gradient at the point $ (x_0, y_0)$ but differ elsewhere. Similarly, the quadratic approximation match in value, gradient, and Hessian at this point but differ elsewhere. +* $f(x, y) = x^2 + y^2$: bowl shape +* $f(x, y) = x^2 - y^2$: saddle shape ```{code-cell} ipython3 :tags: [hide-input] -import numpy as np -import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D +from matplotlib import cm -# Define the function and its gradient and Hessian -f = lambda x, y: np.log(1 + x**2 + y**2) -x0, y0 = 0.3, 0.3 - -# Compute value, gradient, and Hessian at (x0, y0) -r2 = x0**2 + y0**2 -f0 = np.log(1 + r2) -grad = np.array([2*x0, 2*y0]) / (1 + r2) -H = (2 / (1 + r2)) * np.eye(2) - (4 * np.outer([x0, y0], [x0, y0])) / (1 + r2)**2 - -# Taylor expansion up to second order -def f_taylor_first_order(x, y): - dx = x - x0 - dy = y - y0 - delta = np.array([dx, dy]) - return f0 + (grad @ delta).sum() - -# Taylor expansion up to second order -def f_taylor_second_order(x, y): - dx = x - x0 - dy = y - y0 - delta = np.array([dx, dy]) - return f0 + (grad @ delta).sum() + 0.5 * (delta @ H @ delta).sum() - -# Create grid for plotting -x_vals = np.linspace(x0-1, x0+1, 100) -y_vals = np.linspace(y0-1, y0+1, 100) -X, Y = np.meshgrid(x_vals, y_vals) -Z_true = f(X,Y) -Z_first = np.zeros(X.shape) -Z_second = np.zeros(X.shape) - -for i in range(X.shape[0]): - for j in range(X.shape[1]): - Z_first[i,j] = f_taylor_first_order(X[i,j],Y[i,j]) - Z_second[i,j] = f_taylor_second_order(X[i,j],Y[i,j]) - -# Plot both Taylor approximations -fig = plt.figure(figsize=(14, 6)) -ax1 = fig.add_subplot(1, 2, 1, projection='3d') -ax2 = fig.add_subplot(1, 2, 2, projection='3d') +x = np.linspace(-2, 2, 100) +y = np.linspace(-2, 2, 100) +X, Y = np.meshgrid(x, y) + +# Bowl: f(x, y) = x^2 + y^2 +Z_bowl = X**2 + Y**2 + +# Saddle: f(x, y) = x^2 - y^2 +Z_saddle = X**2 - Y**2 + +fig = plt.figure(figsize=(12, 5)) -true_surface1 = ax1.plot_surface(X, Y, Z_true, cmap='viridis', alpha=0.6) -approx_surface1 = ax1.plot_surface(X, Y, Z_first, cmap='coolwarm', alpha=0.7) -ax1.scatter(x0, y0, f0, color='red', s=50, label=r'$\mathbf{x}_0$') -ax1.set_title("First-Order Taylor Approximation") +# Bowl surface +ax1 = fig.add_subplot(1, 2, 1, projection='3d') +ax1.plot_surface(X, Y, Z_bowl, cmap=cm.viridis, alpha=0.9) +ax1.set_title("Bowl: $f(x, y) = x^2 + y^2$") ax1.set_xlabel("x") ax1.set_ylabel("y") -ax1.legend() -ax1.set_zlim([-0.5,2]) +ax1.set_zlabel("f(x, y)") +# Add annotations for curvature +ax1.text(0, 0, 0, '∂²f/∂x² = 2\n∂²f/∂y² = 2', fontsize=10) -true_surface2 = ax2.plot_surface(X, Y, Z_true, cmap='viridis', alpha=0.6) -approx_surface2 = ax2.plot_surface(X, Y, Z_second, cmap='coolwarm', alpha=0.7) -ax2.scatter(x0, y0, f0, color='red', s=50, label=r'$\mathbf{x}_0$') -ax2.set_title("Second-Order Taylor Approximation") +# Saddle surface +ax2 = fig.add_subplot(1, 2, 2, projection='3d') +ax2.plot_surface(X, Y, Z_saddle, cmap=cm.coolwarm, alpha=0.9) +ax2.set_title("Saddle: $f(x, y) = x^2 - y^2$") ax2.set_xlabel("x") ax2.set_ylabel("y") -ax2.legend() -ax2.set_zlim([-0.5,2]) +ax2.set_zlabel("f(x, y)") +ax2.text(0, 0, 0, '∂²f/∂x² = 2\n∂²f/∂y² = -2', fontsize=10) + +plt.suptitle("Curvature in 2D: Bowl vs Saddle", fontsize=14) +plt.tight_layout() +plt.show() +``` + +At each point, the function curves more or less in certain directions. The Hessian is a matrix that captures all this curvature information—it tells us how the slope (the gradient) changes in every direction. + +--- + +### **A Simple Example** + + +$$ +f(x, y) = 3x^2 + 2xy + y^2 +$$ + +* $\frac{\partial f}{\partial x} = 6x + 2y$ +* $\frac{\partial f}{\partial y} = 2x + 2y$ +* Hessian: + + $$ + \nabla^2 f = \begin{bmatrix} + 6 & 2 \\ + 2 & 2 + \end{bmatrix} + $$ + +Each entry corresponds to a second derivative—either in the x-direction, y-direction, or mixed for the off-diagonals. + +## Gradient Vector Fields +The **Hessian matrix** describes how the **gradient vector** changes as you move through space. Let's visualize this in a grid with arrows pointing in the direction of the gradient — i.e., where the function increases most steeply. + +```{code-cell} ipython3 +:tags: [hide-input] +x = np.linspace(-3, 3, 30) +y = np.linspace(-3, 3, 30) +X, Y = np.meshgrid(x, y) + +# Gradients +U_bowl = 2 * X +V_bowl = 2 * Y + +U_saddle = 2 * X +V_saddle = -2 * Y + +fig, axes = plt.subplots(1, 2, figsize=(12, 5)) + +# Bowl gradient field +axes[0].quiver(X, Y, U_bowl, V_bowl, color='green') +axes[0].set_title('Gradient Field: $f(x, y) = x^2 + y^2$') +axes[0].set_xlabel('x') +axes[0].set_ylabel('y') +axes[0].axis('equal') +axes[0].grid(True) +axes[0].set_ylim([-2.3,2.3]) +axes[0].set_xlim([-2.3,2.3]) + +# Saddle gradient field +axes[1].quiver(X, Y, U_saddle, V_saddle, color='blue') +axes[1].set_title('Gradient Field: $f(x, y) = x^2 - y^2$') +axes[1].set_xlabel('x') +axes[1].set_ylabel('y') +axes[1].axis('equal') +axes[1].grid(True) +axes[1].set_ylim([-2.3,2.3]) +axes[1].set_xlim([-2.3,2.3]) + + +plt.suptitle("Gradient Vector Fields Show How ∇f Changes", fontsize=14) plt.tight_layout() plt.show() ``` -This visualization shows how the first-order (left) and second-order (right) Taylor expansions approximate the original function locally around the point $ (0.3,0.3) $, but deviates farther away. Both approximations are shown in blue to red and the original function in yellow to green colors. +* The **gradient vector field** shows how gradients vary over space. +* The **Hessian** is the *rate of change of the gradient*—it tells you how steep the slope is getting in every direction. +* The direction and length of arrows = the **gradient vector** at each point. +* The **rate of change** of those arrows = what the **Hessian** captures. + +--- + +## 🔍 How This Works in the Two Examples + +### 🟢 **Bowl: $f(x, y) = x^2 + y^2$** + +* **Gradient**: $\nabla f(x, y) = [2x,\ 2y]$ +* **Hessian**: + + $$ + \nabla^2 f = \begin{bmatrix} + 2 & 0 \\ + 0 & 2 + \end{bmatrix} + $$ + +This means: + +* In the **x-direction**, the gradient increases by 2 units per unit of x. +* In the **y-direction**, the gradient increases by 2 units per unit of y. +* The gradient field shows arrows pointing radially outward—getting longer linearly with distance from the origin. +* This **linear increase** in slope is exactly what the constant entries (2) in the Hessian mean. -## Summary and Outlook -An advantage of a local quadratic approximation is that we can find its minimum analytically. -This idea lies at the heart of **Newton's method**. -The Hessian matrix also allows us also to better understand the properties of stationary points of a function and derive **second-order conditions of minima**. +### 🔵 **Saddle: $f(x, y) = x^2 - y^2$** + +* **Gradient**: $\nabla f(x, y) = [2x,\ -2y]$ +* **Hessian**: + + $$ + \nabla^2 f = \begin{bmatrix} + 2 & 0 \\ + 0 & -2 + \end{bmatrix} + $$ + +This means: + +* In the **x-direction**, the gradient increases at the same rate as before: 2 per unit of x. +* In the **y-direction**, the gradient **decreases** (negative rate): -2 per unit of y. +* The gradient field shows **outward arrows** in the x-direction, but **inward arrows** in the y-direction. +* That flip in sign in the **Hessian entry $\partial^2 f/\partial y^2 = -2$** explains why the gradient pulls you *toward* the origin in y. + +## 🧩 Optional Extension: The Hessian as Jacobian of the Gradient + +We can think of the Hessian as the **Jacobian of the gradient** — it's the matrix of all partial derivatives of the components of the gradient vector field. + +That is: + +$$ +\nabla f(x, y) = +\begin{bmatrix} +\frac{\partial f}{\partial x} \\ +\frac{\partial f}{\partial y} +\end{bmatrix} +\quad\Rightarrow\quad +\nabla^2 f(x, y) = \text{Jacobian}\left( \nabla f(x, y) \right) +$$ + +## Gradient Descent and the Hessian: Why Off-Diagonal Terms Matter + +### 🧠 Key Idea + +Gradient descent minimizes functions by moving in the direction **opposite the gradient**. + +For quadratic functions: + +$$ +f(x) = \frac{1}{2} x^\top A x +\quad \text{with gradient} \quad \nabla f(x) = A x +$$ + +Here, $A$ is the **Hessian matrix**, and it determines the **shape of level sets** and how gradient descent behaves. + +* If $A$ is diagonal → level sets are **axis-aligned ellipses** (or circles). +* If $A$ has off-diagonal elements → ellipses are **rotated**, and gradient descent struggles (zig-zags). + + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +def gradient_descent(A, x0, lr=0.1, steps=30): + traj = [x0] + x = x0 + for _ in range(steps): + grad = A @ x + x = x - lr * grad + traj.append(x) + return np.array(traj) + +def plot_descent(A, title, lr=0.1): + x = np.linspace(-100, 100, 100) + y = np.linspace(-100, 100, 100) + X, Y = np.meshgrid(x, y) + Z = 0.5 * (A[0,0]*X**2 + 2*A[0,1]*X*Y + A[1,1]*Y**2) + + fig, ax = plt.subplots(figsize=(6, 6)) + ax.contour(X, Y, Z, levels=40, cmap='viridis') + + x0 = np.array([80, 90]) + traj = gradient_descent(A, x0, lr=lr, steps=30) + ax.plot(traj[:,0], traj[:,1], 'ro--', label='GD Path') + + ax.set_title(title) + ax.set_xlabel('x') + ax.set_ylabel('y') + ax.set_aspect('equal') + ax.grid(True) + ax.legend() + plt.show() +``` + +### Case 1: Spherical Hessian (Identity Matrix) + +```{code-cell} ipython3 +A_sphere = np.array([[1, 0], [0, 1]]) +plot_descent(A_sphere, "Spherical Hessian: $A = I$") +``` + +* Level sets are circles. +* Gradient descent takes straight, efficient steps toward the minimum. + + +### Case 2: Anisotropic Hessian (Different Curvatures) + +```{code-cell} ipython3 +:tags: [hide-input] +A_aniso = np.array([[15, 0], [0, 1]]) +plot_descent(A_aniso, "Anisotropic Hessian: $A = \\mathrm{diag}(10, 1)$", lr=0.1) +``` + +* Level sets are stretched ellipses. +* Gradient descent zig-zags, especially in the steep direction. + +--- + +### Case 3: Skewed Hessian (Off-Diagonal Elements) + +```{code-cell} ipython3 +:tags: [hide-input] +A_skew = np.array([[10, 6], [6, 8]]) +plot_descent(A_skew, "Skewed Hessian", lr=0.1) +``` +$A = \begin{bmatrix} 10 & 6 \\ 6 & 8 \end{bmatrix}$ + +* Level sets are rotated ellipses. +* Gradient descent strongly zig-zags and converges slowly. +* The skew comes directly from the **off-diagonal elements in the Hessian**. -Before we will explore these two topics further, we first have to better understand the **properties of matrices** such as the Hessian. So let's turn to the topic of **matrix algebra**. +Off-diagonal terms in the Hessian rotate the level curves. Since gradient descent moves perpendicular to level curves, it zig-zags when these are skewed. This is one of the motivations for using **second-order methods** that take the Hessian into account. diff --git a/book/chapter_calculus/taylors_theorem.md b/book/chapter_calculus/taylors_theorem.md new file mode 100644 index 0000000..c767c4a --- /dev/null +++ b/book/chapter_calculus/taylors_theorem.md @@ -0,0 +1,599 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Taylor’s Theorem + +Polynomials provide a framework for function approximation. It turns out, that many functions can be approximated well by so-called Taylor polynomials and that for a large class of infintely differentiable functions this approximation can be exact. We call this class of functions *analytic*. + +We state and prove the first order Taylor's Theorem with remainder in the multivariable case and state it in the second order, as is typically encountered in machine learning and optimization contexts. + +We’ll first state the **single-variable version**, then the **multivariable** version (more relevant to gradient and Hessian-based methods), and give a **proof** for the single-variable case using the **mean value form** of the remainder. + +--- +:::{prf:theorem} Taylor’s Theorem with Remainder (Single Variable) +:label: thm-taylor-single +:nonumber: + +Let $f : \mathbb{R} \to \mathbb{R}$ be $(n+1)$-times continuously differentiable on an open interval containing $a$ and $x$. + +Then: + +$$ +f(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \dots + \frac{f^{(n)}(a)}{n!}(x-a)^n + R_{n+1}(x) +$$ + +Where the **remainder term** is given by the **Lagrange form**: + +$$ +R_{n+1}(x) = \frac{f^{(n+1)}(\xi)}{(n+1)!}(x-a)^{n+1} +\quad \text{for some } \xi \in (a, x) +$$ +::: + +Let's visualize a function $f : \mathbb{R} \to \mathbb{R}$ along with its **Taylor approximations** of increasing degree $n = 1, 2, \dots, N$ centered at a point $a$. We overlay each approximation on the graph of the true function to show how the Taylor series converges. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt +from math import factorial +import sympy as sp + +# Define the function f symbolically +x = sp.symbols('x') +f_expr = sp.sin(x) # Change this to any (n+1)-times differentiable function +f = sp.lambdify(x, f_expr, modules='numpy') + +# Taylor expansion at point a +a = 1 +N = 16 # Highest degree of Taylor polynomial to visualize +x_vals = np.linspace(-2*np.pi+a, 2*np.pi+a, 400) + +# Generate the Taylor polynomial of degree n +def taylor_poly(expr, a, n): + return sum((expr.diff(x, k).subs(x, a) / factorial(k)) * (x - a)**k for k in range(n+1)) + +# Plotting +fig, ax = plt.subplots(figsize=(10, 6)) +plt.plot(x_vals, f(x_vals), label='True function', color='black', linewidth=6) + +colors = plt.cm.viridis(np.linspace(0, 1, N)) +for n in range(1, N+1): + taylor_expr = taylor_poly(f_expr, a, n) + taylor_func = sp.lambdify(x, taylor_expr, modules='numpy') + plt.plot(x_vals, taylor_func(x_vals), label=f'Taylor degree {n}', color=colors[n-1]) + +plt.axvline(a, color='gray', linestyle='--', alpha=0.5) +plt.title(r'Taylor Approximations of $f(x) = \sin(x)$ at $x = {a}$') +plt.xlabel('x') +plt.ylabel('f(x)') +plt.legend() +plt.grid(True) +ax.set_ylim([-2.7,2.7]) +ax.set_xlim([-2*np.pi+a, 2*np.pi+a]) +plt.tight_layout() +plt.show() +``` + +## Big-O Form of Taylor's Remainder (Single Variable) + +:::{prf:corollary} +:label: thm-taylor-single-BigO +:nonumber: + +Let $f: \mathbb{R} \to \mathbb{R}$ be $(n+1)$-times continuously differentiable in a neighborhood of $a$. + +Then: + +$$ +f(x) = \sum_{k=0}^n \frac{f^{(k)}(a)}{k!}(x - a)^k + \mathcal{O}((x - a)^{n+1}) +\quad \text{as } x \to a +$$ + +::: + +This means: + +> There exists a constant $C$ and a neighborhood around $a$ such that +> +> $$ |R_{n+1}(x)| \leq C |x - a|^{n+1} \quad \text{as } x \to a $$ + +The notation tells us that **the remainder vanishes at the same rate as $(x - a)^{n+1}$** as $x \to a$. + +Let's prove Taylor's Theorem with the exact expression for the remainder. + +:::{prf:proof} (Single Variable, Lagrange Form of the Remainder) + +We want to prove that: + +$$ +f(x) = T_n(x) + R_{n+1}(x) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(x-a)^k + \frac{f^{(n+1)}(\xi)}{(n+1)!}(x - a)^{n+1} +\quad \text{for some } \xi \in (a, x) +$$ + +--- + +### Step 1: Define the Taylor Polynomial and Remainder + +Let + +$$ +T_n(t) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(t-a)^k +\quad \text{and} \quad +R_{n+1}(x) = f(x) - T_n(x) +$$ + +We want to find an expression for $R_{n+1}(x)$. + +--- + +### Step 2: Define an Auxiliary Function + +Define a function $\phi(t)$ that measures the difference between $f(t)$ and the Taylor polynomial + an extra term we choose to vanish at $t = x$: + +$$ +\phi(t) = f(t) - T_n(t) - \frac{f^{(n+1)}(x)}{(n+1)!}(t-a)^{n+1} +$$ + +We designed this function such that: + +* $\phi(a) = f(a) - T_n(a) - 0 = 0$ (since $T_n(a) = f(a)$) +* $\phi(x) = f(x) - T_n(x) - \frac{f^{(n+1)}(x)}{(n+1)!}(x - a)^{n+1}$ + +So $\phi(x) = R_{n+1}(x) - \frac{f^{(n+1)}(x)}{(n+1)!}(x - a)^{n+1}$ + +Now, the goal is to compare this to a function that we can analyze using **Cauchy's Mean Value Theorem**. + +--- + +### Step 3: Construct a Function with a Known Root + +Let: + +$$ +h(t) := (t - a)^{n+1} +$$ + +and define: + +$$ +F(t) := \phi(t) \quad \text{and} \quad G(t) := h(t) +$$ + +Both $F(t)$ and $G(t)$ are $C^1$ functions, and they vanish at $t = a$: $F(a) = G(a) = 0$ + +We now apply **Cauchy's Mean Value Theorem** to $F$ and $G$ on the interval $[a, x]$: + +> If $F$ and $G$ are differentiable and $G'(t) \neq 0$ on $(a, x)$, then there exists $\xi \in (a, x)$ such that: +> +> $$\frac{F(x) - F(a)}{G(x) - G(a)} = \frac{F'(\xi)}{G'(\xi)}$$ + +Apply it: + +* $F(x) - F(a) = \phi(x) - 0 = R_{n+1}(x) - \frac{f^{(n+1)}(x)}{(n+1)!}(x-a)^{n+1}$ +* $G(x) - G(a) = (x-a)^{n+1} - 0$ + +So: + +$$ +\frac{R_{n+1}(x) - \frac{f^{(n+1)}(x)}{(n+1)!}(x-a)^{n+1}}{(x-a)^{n+1}} = \frac{\phi'(\xi)}{(n+1)(\xi - a)^n} +$$ + +Compute $\phi'(t)$: + +* $\phi'(t) = f'(t) - T_n'(t) - \frac{f^{(n+1)}(x)}{(n+1)!} \cdot (n+1)(t - a)^n$ + +But recall that $T_n'(t) = \sum_{k=1}^n \frac{f^{(k)}(a)}{(k-1)!}(t - a)^{k-1}$, so $\phi'(t)$ behaves like a difference between $f'(t)$ and the Taylor expansion of $f'$. + +But instead of continuing with $\phi(t)$, there's a **simpler and cleaner proof** using a function designed for **Lagrange’s form**. + +--- + +### Using Cauchy's Mean Value Theorem + +Let’s define: + +$$ +h(t) := f(t) - T_n(t) +\quad \text{and} \quad +g(t) := (t - a)^{n+1} +$$ + +Note: + +* $h(a) = 0$, because $T_n(a) = f(a)$ +* $g(a) = 0$ + +Then apply Cauchy’s Mean Value Theorem to $h$ and $g$ on $[a, x]$: + +There exists $\xi \in (a, x)$ such that: + +$$ +\frac{h(x)}{g(x)} = \frac{h'(\xi)}{g'(\xi)} +$$ + +Let’s compute: + +* $g(x) = (x - a)^{n+1}$, and $g'(\xi) = (n+1)(\xi - a)^n$ +* $h(x) = f(x) - T_n(x) = R_{n+1}(x)$ +* $h'(\xi) = f^{(n+1)}(\xi) \cdot \frac{(\xi - a)^n}{n!}$ (this is a known identity) + +Then: + +$$ +\frac{R_{n+1}(x)}{(x - a)^{n+1}} = \frac{f^{(n+1)}(\xi)}{(n+1)!} +\quad \Rightarrow \quad +R_{n+1}(x) = \frac{f^{(n+1)}(\xi)}{(n+1)!}(x - a)^{n+1} +$$ + +Q.E.D. + +::: + +### Analytic Functions + +**Analytic functions** are intimately related to Taylor series and to the **remainder** behavior. + +### 🔍 What Is an Analytic Function? + +> A function $f : \mathbb{R} \to \mathbb{R}$ (or $f : \mathbb{R}^d \to \mathbb{R}$) is called **analytic at a point** $a$ if: +> +> The Taylor series of $f$ at $a$ **converges to** the function in a neighborhood of $a$: +> +> $$f(x) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!}(x - a)^k\quad \text{for all } x \text{ near } a$$ + +That is: + +* Not only does the Taylor series **exist** (i.e., $f$ is infinitely differentiable), +* But it **converges to the true function** (i.e., the remainder $R_n(x) \to 0$ as $n \to \infty$). + +### 🚫 Not All Smooth Functions Are Analytic + +An important subtlety: + +> There exist functions that are **infinitely differentiable** (smooth), but **not analytic**. + +For example, the function + +$$ +f(x) = \begin{cases} +e^{-1/x^2} & \text{if } x \neq 0 \\\\ +0 & \text{if } x = 0 +\end{cases} +$$ + +is **$C^\infty$** everywhere, but its **Taylor series at 0 is identically zero** (all derivatives vanish at 0) — even though the function is not identically zero. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt +from math import factorial + +# Define the smooth but non-analytic function +def f(x): + out = np.zeros_like(x) + nonzero = x != 0 + out[nonzero] = np.exp(-1 / x[nonzero]**2) + return out + +# Compute Taylor polynomials at x=0 (they are all zero) +def taylor_approx(x, n): + return np.zeros_like(x) # All derivatives at 0 are zero + +# Set up x-values +x_vals = np.linspace(-1, 1, 400) +f_vals = f(x_vals) + +# Plot the true function and several Taylor approximations +# Create the plot +fig, ax = plt.subplots(figsize=(10, 6)) +ax.plot(x_vals, f_vals, label='$f(x) = e^{-1/x^2}$ (extended by 0 at 0)', color='black') + +colors = plt.cm.viridis(np.linspace(0, 1, 5)) +for n, c in zip([1, 3, 5, 10, 20], colors): + plt.plot(x_vals, taylor_approx(x_vals, n), linestyle='--', color=c, label=f'Taylor degree {n}') + +plt.axvline(0, color='gray', linestyle='--', alpha=0.5) +plt.title('Smooth but Non-Analytic Function at $x = 0$') +plt.xlabel('x') +plt.ylabel('f(x)') +ax.set_ylim([-0.03, 0.2]) +ax.set_xlim([-0.75, 0.75]) +plt.legend() +plt.grid(True) +plt.tight_layout() +plt.show() +``` + +This function is **infinitely differentiable** (smooth) at $x = 0$, but **not analytic** there: all of its derivatives at 0 vanish, so every Taylor polynomial is the zero function. Yet the function is clearly nonzero for any $x \neq 0$. + +* The true function $f(x)$ (black curve), sharply rising near zero. +* All Taylor polynomials (dashed lines) are identically zero and fail to approximate the function anywhere except exactly at $x = 0$. + +So: +✅ smooth ≠ analytic +✅ analytic ⇒ smooth +❌ smooth ⇒ analytic + + +## 🔄 How This Relates to the Big-O Remainder + +* The **Big‑O bound** tells you that the remainder **goes to zero like $(x - a)^{n+1}$** near $a$, *for fixed $n$*. +* But to be **analytic**, you need: + + $$ + \lim_{n \to \infty} R_n(x) = 0 + \quad \text{for all } x \text{ in a neighborhood of } a + $$ + + i.e., convergence of the full infinite series, not just the rate of vanishing of each finite approximation. + +So, **Big-O bounds are necessary** (they control approximation error), but **not sufficient** for analyticity. You need the entire remainder sequence $R_n(x) \to 0$ for analytic behavior. + +--- + +## 🧠 Summary Table + +| Property | What It Implies | +| ----------------- | ------------------------------------------- | +| Smooth $C^\infty$ | All derivatives exist and are continuous | +| Analytic | Taylor series converges to function locally | +| Big-O remainder | Controls approximation error for fixed $n$ | +| $R_n(x) \to 0$ | Required for analyticity | + + + +## Taylor Expansion in Multiple Variables + +Recall, that we can create a locally linear approximation to a function at a point $\mathbf{x}_0 \in \mathbb{R}^d $ using the gradient at $\nabla f(\mathbf{x}_0)$. + +$$ f(\mathbf{x}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top (\mathbf{x} - \mathbf{x}_0) . $$ + +This affine approximation is also known as the **first-order Taylor approximation**. +It agrees with the original function in value and gradient at the point $ \mathbf{x}_0 $. + +If you explicitly include the second-order term evaluated at $\mathbf{x}_0$, then you’re writing the **second-order Taylor expansion**: + +$$ +f(\mathbf{x}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T (\mathbf{x} - \mathbf{x}_0) + \frac{1}{2} (\mathbf{x} - \mathbf{x}_0)^T \nabla^2 f(\mathbf{x}_0) (\mathbf{x} - \mathbf{x}_0) +$$ + +This is a **local quadratic approximation** to the function. It agrees with the original function in value, gradient, and Hessian at the point $ \mathbf{x}_0 $. +The Hessian appears naturally in the **second-order Taylor approximation** of a function around a point $ \mathbf{x}_0 \in \mathbb{R}^d $. + + +- The **gradient** term captures the linear behavior (slope) of the function near $ \mathbf{x}_0 $. +- The **Hessian** term captures the curvature — how the gradient changes in different directions. +- In this case, the remainder (if stated) would involve third derivatives, and the approximation is called **second-order** because you're explicitly using second-order information in the main approximation. + +--- + +We illustrate both the first-order and the second-order Taylor expansion using the following function. + +$$ f(x, y) = \log(1 + x^2 + y^2) $$ + +We compute the first-order and second-order Taylor approximations at the point $ (x_0, y_0) = (0.3, 0.3) $. + +The true function and the linear approximation match in value and gradient at the point $ (x_0, y_0)$ but differ elsewhere. Similarly, the quadratic approximation match in value, gradient, and Hessian at this point but differ elsewhere. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt +from mpl_toolkits.mplot3d import Axes3D + +# Define the function and its gradient and Hessian +f = lambda x, y: np.log(1 + x**2 + y**2) +x0, y0 = 0.3, 0.3 + +# Compute value, gradient, and Hessian at (x0, y0) +r2 = x0**2 + y0**2 +f0 = np.log(1 + r2) +grad = np.array([2*x0, 2*y0]) / (1 + r2) +H = (2 / (1 + r2)) * np.eye(2) - (4 * np.outer([x0, y0], [x0, y0])) / (1 + r2)**2 + +# Taylor expansion up to second order +def f_taylor_first_order(x, y): + dx = x - x0 + dy = y - y0 + delta = np.array([dx, dy]) + return f0 + (grad @ delta).sum() + +# Taylor expansion up to second order +def f_taylor_second_order(x, y): + dx = x - x0 + dy = y - y0 + delta = np.array([dx, dy]) + return f0 + (grad @ delta).sum() + 0.5 * (delta @ H @ delta).sum() + +# Create grid for plotting +x_vals = np.linspace(x0-1, x0+1, 100) +y_vals = np.linspace(y0-1, y0+1, 100) +X, Y = np.meshgrid(x_vals, y_vals) +Z_true = f(X,Y) +Z_first = np.zeros(X.shape) +Z_second = np.zeros(X.shape) + +for i in range(X.shape[0]): + for j in range(X.shape[1]): + Z_first[i,j] = f_taylor_first_order(X[i,j],Y[i,j]) + Z_second[i,j] = f_taylor_second_order(X[i,j],Y[i,j]) + +# Plot both Taylor approximations +fig = plt.figure(figsize=(14, 6)) +ax1 = fig.add_subplot(1, 2, 1, projection='3d') +ax2 = fig.add_subplot(1, 2, 2, projection='3d') + +true_surface1 = ax1.plot_surface(X, Y, Z_true, cmap='viridis', alpha=0.6) +approx_surface1 = ax1.plot_surface(X, Y, Z_first, cmap='coolwarm', alpha=0.7) +ax1.scatter(x0, y0, f0, color='red', s=50, label=r'$\mathbf{x}_0$') +ax1.set_title("First-Order Taylor Approximation") +ax1.set_xlabel("x") +ax1.set_ylabel("y") +ax1.legend() +ax1.set_zlim([-0.5,2]) + +true_surface2 = ax2.plot_surface(X, Y, Z_true, cmap='viridis', alpha=0.6) +approx_surface2 = ax2.plot_surface(X, Y, Z_second, cmap='coolwarm', alpha=0.7) +ax2.scatter(x0, y0, f0, color='red', s=50, label=r'$\mathbf{x}_0$') +ax2.set_title("Second-Order Taylor Approximation") +ax2.set_xlabel("x") +ax2.set_ylabel("y") +ax2.legend() +ax2.set_zlim([-0.5,2]) + +plt.tight_layout() +plt.show() +``` + +This visualization shows how the first-order (left) and second-order (right) Taylor expansions approximate the original function locally around the point $ (0.3,0.3) $, but deviates farther away. Both approximations are shown in blue to red and the original function in yellow to green colors. + + +## Taylor's Theorem in Multiple Variables + +:::{prf:theorem} Taylor’s Theorem in Multiple Variables (Second-Order Remainder) +:label: thm-taylor-multiple-first +:nonumber: + +Let $f: \mathbb{R}^d \to \mathbb{R}$ be a function that is **three times continuously differentiable** on an open set $U \subset \mathbb{R}^d$. Let $\mathbf{x}_0 \in U$, and let $\mathbf{x} \in U$ be such that the **line segment** between $\mathbf{x}_0$ and $\mathbf{x}$ lies entirely in $U$. Then: + +$$ +f(\mathbf{x}) = f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T (\mathbf{x} - \mathbf{x}_0) + \frac{1}{2} (\mathbf{x} - \mathbf{x}_0)^T \nabla^2 f(\boldsymbol{\xi}) (\mathbf{x} - \mathbf{x}_0) +$$ + +for some point $\boldsymbol{\xi}$ on the open segment between $\mathbf{x}_0$ and $\mathbf{x}$. +::: + +This is the **first-order Taylor approximation** with **remainder in integral form or mean value form**. + +We observe: + +* We’re approximating $f(\mathbf{x})$ using only the **first-order derivative**, but the **error** (or remainder) is controlled by the **second-order derivative**, specifically involving the Hessian at some intermediate point $\boldsymbol{\xi}$. +* Therefore, it's a **first-order approximation with a second-order remainder**. + +:::{prf:theorem} Taylor’s Theorem in Multiple Variables (Third-Order Integral Remainder) +:label: thm-taylor-multiple-second +:nonumber: + +Let $f: \mathbb{R}^d \to \mathbb{R}$ be **four times continuously differentiable** on an open set $U \subset \mathbb{R}^d$, and let $\mathbf{x}_0, \mathbf{x} \in U$ such that the line segment between them lies entirely in $U$. Then: + +$$ +f(\mathbf{x}) = f(\mathbf{x}_0) ++ \nabla f(\mathbf{x}_0)^T (\mathbf{x} - \mathbf{x}_0) ++ \frac{1}{2} (\mathbf{x} - \mathbf{x}_0)^T \nabla^2 f(\mathbf{x}_0) (\mathbf{x} - \mathbf{x}_0) ++ \frac{1}{6} \sum_{i,j,k=1}^d \frac{\partial^3 f}{\partial x_i \partial x_j \partial x_k}(\boldsymbol{\xi}) (x_i - x_{0,i})(x_j - x_{0,j})(x_k - x_{0,k}) +$$ + +for some $\boldsymbol{\xi}$ on the segment between $\mathbf{x}_0$ and $\mathbf{x}$. +::: + + +**Notes on Higher-Order Remainders** + +* The **third-order term** involves a **third-order tensor** (all third partial derivatives), and the remainder is often written using **multi-index notation** or **tensor contraction**. +* For applications in optimization and machine learning, most practical Taylor approximations stop at **second-order**, because third- and higher-order terms are expensive to compute and rarely needed unless using higher-order optimization methods. + + + +:::{prf:theorem} Big-O Remainder in Multivariable Case +:label: thm-taylor-multiple-BigO +:nonumber: + +For $f: \mathbb{R}^d \to \mathbb{R}$, we can write: + +$$ +f(\mathbf{x}) = \sum_{|\alpha| \leq n} \frac{D^\alpha f(\mathbf{x}_0)}{\alpha!} (\mathbf{x} - \mathbf{x}_0)^\alpha + \mathcal{O}(\|\mathbf{x} - \mathbf{x}_0\|^{n+1}) +\quad \text{as } \mathbf{x} \to \mathbf{x}_0 +$$ + +Where: + +* $\alpha \in \mathbb{N}^d$ is a multi-index, +* $D^\alpha f$ is the partial derivative corresponding to $\alpha$, +* $(\mathbf{x} - \mathbf{x}_0)^\alpha = \prod_i (x_i - x_{0,i})^{\alpha_i}$, +* And $|\alpha| = \sum_i \alpha_i$. +::: + +* The **exact form** (Lagrange or integral remainder) gives precise values, but is often impractical. +* The **Big-O remainder** focuses on **how the error behaves**, not what it is exactly. +* This is especially useful in: + + * Error estimates + * Convergence analysis + * Algorithm design (e.g. gradient descent, Newton’s method) + +While we can state and prove Taylor's theorem for a remainder of arbitrary order, we prove only the version of the theorem for the first order Taylor expansion with second-order remainder. + +:::{prf:proof} Proof Sketch (Mean Value Form of the Remainder) + +Let’s define the path: + +$$ +\gamma(t) = \mathbf{x}_0 + t(\mathbf{x} - \mathbf{x}_0), \quad t \in [0,1] +$$ + +This is a straight-line path from $\mathbf{x}_0$ to $\mathbf{x}$. + +Define the composite function $g(t) = f(\gamma(t))$. Then $g : [0,1] \to \mathbb{R}$ is a function of one variable. + +Using the **chain rule**, we have: + +$$ +g'(t) = \nabla f(\gamma(t))^T (\mathbf{x} - \mathbf{x}_0) +$$ + +and + +$$ +g''(t) = (\mathbf{x} - \mathbf{x}_0)^T \nabla^2 f(\gamma(t)) (\mathbf{x} - \mathbf{x}_0) +$$ + +Now apply the **Taylor expansion of $g(t)$ around $t = 0$** with **Lagrange remainder** (from single-variable calculus): + +$$ +g(1) = g(0) + g'(0) + \frac{1}{2} g''(\tau) \quad \text{for some } \tau \in (0,1) +$$ + +Substitute back: + +* $g(0) = f(\mathbf{x}_0)$ +* $g'(0) = \nabla f(\mathbf{x}_0)^T (\mathbf{x} - \mathbf{x}_0)$ +* $g''(\tau) = (\mathbf{x} - \mathbf{x}_0)^T \nabla^2 f(\gamma(\tau)) (\mathbf{x} - \mathbf{x}_0)$ + +So: + +$$ +f(\mathbf{x}) = f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T (\mathbf{x} - \mathbf{x}_0) + \frac{1}{2} (\mathbf{x} - \mathbf{x}_0)^T \nabla^2 f(\boldsymbol{\xi}) (\mathbf{x} - \mathbf{x}_0) +$$ + +where $\boldsymbol{\xi} = \gamma(\tau)$ lies on the open segment between $\mathbf{x}_0$ and $\mathbf{x}$. + +Q.E.D. +::: + + + + + + +--- + +## 🔍 Summary + +| Expansion Type | Uses | Remainder Involves | +| -------------- | ------------------------------------------- | ---------------------------- | +| First-order | $f, \nabla f$ at $\mathbf{x}_0$ | Second derivatives (Hessian) | +| Second-order | $f, \nabla f, \nabla^2 f$ at $\mathbf{x}_0$ | Third derivatives | + +## Outlook +A nice property of second-order Taylor expansion is that the resulting function is a quadratic and that we know how to analytically solve quadratic optimization problems. This observation is the key idea behind Newton's method. Thus, similarly to how linear approximation using the gradient (a.k.a. first-order Taylor expansion) was the basis for first-order optimization, the second order Taylor expansion will be the basis for second-order optimization methods. However, before we delve into second-order optimization, we have to study the properties of matrices such as the Hessian at a deeper level. \ No newline at end of file diff --git a/drafts/chapter_convexity/convexity.md b/book/chapter_convexity/convex_functions.md similarity index 67% rename from drafts/chapter_convexity/convexity.md rename to book/chapter_convexity/convex_functions.md index 5f78c83..585f30e 100644 --- a/drafts/chapter_convexity/convexity.md +++ b/book/chapter_convexity/convex_functions.md @@ -1,44 +1,16 @@ -## Convexity - -**Convexity** is a term that pertains to both sets and functions. For -functions, there are different degrees of convexity, and how convex a -function is tells us a lot about its minima: do they exist, are they -unique, how quickly can we find them using optimization algorithms, etc. -In this section, we present basic results regarding convexity, strict -convexity, and strong convexity. - -### Convex sets - -::: center -![image](../figures/convex-set.png) -A convex set -::: - -::: center -![image](../figures/nonconvex-set.png) -A non-convex set -::: - -A set $\mathcal{X} \subseteq \mathbb{R}^d$ is **convex** if - -$$t\mathbf{x} + (1-t)\mathbf{y} \in \mathcal{X}$$ - -for all -$\mathbf{x}, \mathbf{y} \in \mathcal{X}$ and all $t \in [0,1]$. - -Geometrically, this means that all the points on the line segment -between any two points in $\mathcal{X}$ are also in $\mathcal{X}$. See -Figure [1](#fig:convexset){reference-type="ref" -reference="fig:convexset"} for a visual. - -Why do we care whether or not a set is convex? We will see later that -the nature of minima can depend greatly on whether or not the feasible -set is convex. Undesirable pathological results can occur when we allow -the feasible set to be arbitrary, so for proofs we will need to assume -that it is convex. Fortunately, we often want to minimize over all of -$\mathbb{R}^d$, which is easily seen to be a convex set. - -### Basics of convex functions +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: math4ml + language: python + name: python3 +--- +# Basics of convex functions In the remainder of this section, assume $f : \mathbb{R}^d \to \mathbb{R}$ unless otherwise noted. We'll start @@ -59,27 +31,96 @@ A function $f$ is **strongly convex with parameter $m$** (or $$\mathbf{x} \mapsto f(\mathbf{x}) - \frac{m}{2}\|\mathbf{x}\|_2^2$$ -is -convex. +is convex. These conditions are given in increasing order of strength; strong convexity implies strict convexity which implies convexity. - ::: center -![What convex functions look like](../figures/convex-function.png) -What convex functions look like -::: + + +## Geometric interpretation +The following figure illustrates the three types of convexity: Geometrically, convexity means that the line segment between two points -on the graph of $f$ lies on or above the graph itself. See Figure -[2](#fig:convexfunction){reference-type="ref" -reference="fig:convexfunction"} for a visual. +on the graph of $f$ lies on or above the graph itself. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Define a convex function +f = lambda x: x**2 + +# Define x values and compute y +x = np.linspace(-2, 2, 400) +y = f(x) + +# Choose two points on the graph +x1, x2 = -1.5, 1.0 +y1, y2 = f(x1), f(x2) + +# Compute the line segment between the two points +t = np.linspace(0, 1, 100) +xt = t * x1 + (1 - t) * x2 +yt_line = t * y1 + (1 - t) * y2 + +# Plot the function and the line segment +plt.figure(figsize=(8, 6)) +plt.plot(x, y, label=r'$f(x) = x^2$', color='blue') +plt.plot(xt, yt_line, 'r--', label='Line segment') +plt.plot([x1, x2], [y1, y2], 'ro') # endpoints +plt.title("Geometric Interpretation of Convexity") +plt.xlabel("x") +plt.ylabel("f(x)") +plt.legend() +plt.grid(True) +plt.tight_layout() +plt.show() +``` Strict convexity means that the graph of $f$ lies strictly above the -line segment, except at the segment endpoints. (So actually the function -in the figure appears to be strictly convex.) - -### Consequences of convexity +line segment, except at the segment endpoints. +(So actually the function in the figure appears to be strictly convex.) + + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Define x values +x = np.linspace(-2, 2, 400) + +# Define three functions: convex, strictly convex, and strongly convex +f1 = lambda x: np.abs(x) # convex but not strictly convex +f2 = lambda x: x**4 # strictly convex but not strongly convex +f3 = lambda x: x**2 + 1 # strongly convex + +# Evaluate functions +y1 = f1(x) +y2 = f2(x) +y3 = f3(x) + +# Plot the functions +plt.figure(figsize=(10, 6)) +plt.plot(x, y1, label=r'$f(x) = |x|$ (Convex)', linestyle='--') +plt.plot(x, y2, label=r'$f(x) = x^4$ (Strictly Convex)', linestyle='-.') +plt.plot(x, y3, label=r'$f(x) = x^2 + 1$ (Strongly Convex)', linestyle='-') +plt.title("Examples of Convex, Strictly Convex, and Strongly Convex Functions") +plt.xlabel("x") +plt.ylabel("f(x)") +plt.legend() +plt.grid(True) +plt.tight_layout() +plt.show() +``` +* A **convex but not strictly convex** function $f(x) = |x|$ +* A **strictly convex but not strongly convex** function $f(x) = x^4$ +* A **strongly convex** function $f(x) = x^2 + 1$ + + +## Consequences of convexity Why do we care if a function is (strictly/strongly) convex? @@ -87,16 +128,23 @@ Basically, our various notions of convexity have implications about the nature of minima. It should not be surprising that the stronger conditions tell us more about the minima. -*Proposition.* -Let $\mathcal{X}$ be a convex set. If $f$ is convex, then any local -minimum of $f$ in $\mathcal{X}$ is also a global minimum. +:::{prf:proposition} Minima of convex functions +:label: prop-convex-minima +:nonumber: +Let $\mathcal{X}$ be a convex set. + +If $f$ is convex, then any local minimum of $f$ in $\mathcal{X}$ is also a global minimum. +::: + +:::{prf:proof} +Suppose $f$ is convex, and let $\mathbf{x}^*$ be a local +minimum of $f$ in $\mathcal{X}$. +Then for some neighborhood $N \subseteq \mathcal{X}$ about $\mathbf{x}^*$, we have +$f(\mathbf{x}) \geq f(\mathbf{x}^*)$ for all $\mathbf{x} \in N$. -*Proof.* Suppose $f$ is convex, and let $\mathbf{x}^*$ be a local -minimum of $f$ in $\mathcal{X}$. Then for some neighborhood -$N \subseteq \mathcal{X}$ about $\mathbf{x}^*$, we have -$f(\mathbf{x}) \geq f(\mathbf{x}^*)$ for all $\mathbf{x} \in N$. Suppose +Suppose towards a contradiction that there exists $\tilde{\mathbf{x}} \in \mathcal{X}$ such that $f(\tilde{\mathbf{x}}) < f(\mathbf{x}^*)$. @@ -118,16 +166,22 @@ above inequality, a contradiction. It follows that $f(\mathbf{x}^*) \leq f(\mathbf{x})$ for all $\mathbf{x} \in \mathcal{X}$, so $\mathbf{x}^*$ is a global minimum of $f$ in $\mathcal{X}$. ◻ +::: + +:::{prf:proposition} Minima stricly convex functions +:label: prop-minima-striclty-convex +:nonumber: +Let $\mathcal{X}$ be a convex set. -*Proposition.* -Let $\mathcal{X}$ be a convex set. If $f$ is strictly convex, then there +If $f$ is strictly convex, then there exists at most one local minimum of $f$ in $\mathcal{X}$. Consequently, if it exists it is the unique global minimum of $f$ in $\mathcal{X}$. +::: +:::{prf:proof} - -*Proof.* The second sentence follows from the first, so all we must show +The second sentence follows from the first, so all we must show is that if a local minimum exists in $\mathcal{X}$ then it is unique. Suppose $\mathbf{x}^*$ is a local minimum of $f$ in $\mathcal{X}$, and @@ -145,12 +199,14 @@ of $f$, $$f(\mathbf{x}(t)) < tf(\mathbf{x}^*) + (1-t)f(\tilde{\mathbf{x}}) = tf(\mathbf{x}^*) + (1-t)f(\mathbf{x}^*) = f(\mathbf{x}^*)$$ -for all $t \in (0,1)$. But this contradicts the fact that $\mathbf{x}^*$ +for all $t \in (0,1)$. + +But this contradicts the fact that $\mathbf{x}^*$ is a global minimum. Therefore if $\tilde{\mathbf{x}}$ is a local minimum of $f$ in $\mathcal{X}$, then $\tilde{\mathbf{x}} = \mathbf{x}^*$, so $\mathbf{x}^*$ is the unique minimum in $\mathcal{X}$. ◻ - +::: It is worthwhile to examine how the feasible set affects the optimization problem. We will see why the assumption that $\mathcal{X}$ @@ -158,6 +214,7 @@ is convex is needed in the results above. Consider the function $f(x) = x^2$, which is a strictly convex function. The unique global minimum of this function in $\mathbb{R}$ is $x = 0$. + But let's see what happens when we change the feasible set $\mathcal{X}$. @@ -178,7 +235,7 @@ $\mathcal{X}$. non-convex, and we can see that there are two global minima ($x = \pm 1$). -### Showing that a function is convex +## Showing that a function is convex Hopefully the previous section has convinced the reader that convexity is an important property. Next we turn to the issue of showing that a @@ -186,12 +243,17 @@ function is (strictly/strongly) convex. It is of course possible (in principle) to directly show that the condition in the definition holds, but this is usually not the easiest way. -*Proposition.* +:::{prf:proposition} Norms +:label: prop-norms-convex +:nonumber: + Norms are convex. +::: +:::{prf:proof} -*Proof.* Let $\|\cdot\|$ be a norm on a vector space $V$. Then for all +Let $\|\cdot\|$ be a norm on a vector space $V$. Then for all $\mathbf{x}, \mathbf{y} \in V$ and $t \in [0,1]$, $$\|t\mathbf{x} + (1-t)\mathbf{y}\| \leq \|t\mathbf{x}\| + \|(1-t)\mathbf{y}\| = |t|\|\mathbf{x}\| + |1-t|\|\mathbf{y}\| = t\|\mathbf{x}\| + (1-t)\|\mathbf{y}\|$$ @@ -199,21 +261,31 @@ $$\|t\mathbf{x} + (1-t)\mathbf{y}\| \leq \|t\mathbf{x}\| + \|(1-t)\mathbf{y}\| = where we have used respectively the triangle inequality, the homogeneity of norms, and the fact that $t$ and $1-t$ are nonnegative. Hence $\|\cdot\|$ is convex. ◻ +::: + +:::{prf:proposition} Gradient of Convex Functions +:label: prop-convex-functions-graph +:nonumber: +Suppose $f$ is differentiable. -*Proposition.* -Suppose $f$ is differentiable. Then $f$ is convex if and only if +Then $f$ is convex if and only if $$f(\mathbf{y}) \geq f(\mathbf{x}) + \langle \nabla f(\mathbf{x}), \mathbf{y} - \mathbf{x} \rangle$$ for all $\mathbf{x}, \mathbf{y} \in \operatorname{dom} f$. +::: +:::{prf:proof} -*Proof.* To-do. ◻ +To-do. ◻ +::: +:::{prf:proposition} Hessian of Convex Functions +:label: prop-Hessian-convex +:nonumber: -*Proposition.* Suppose $f$ is twice differentiable. Then (i) $f$ is convex if and only if $\nabla^2 f(\mathbf{x}) \succeq 0$ for @@ -225,18 +297,24 @@ Suppose $f$ is twice differentiable. Then (iii) $f$ is $m$-strongly convex if and only if $\nabla^2 f(\mathbf{x}) \succeq mI$ for all $\mathbf{x} \in \operatorname{dom} f$. +::: +:::{prf:proof} +Omitted. ◻ +::: -*Proof.* Omitted. ◻ - +:::{prf:proposition} Scaling Convex Functions +:label: prop-scaling-convex-functions +:nonumber: -*Proposition.* If $f$ is convex and $\alpha \geq 0$, then $\alpha f$ is convex. +::: +:::{prf:proof} -*Proof.* Suppose $f$ is convex and $\alpha \geq 0$. Then for all +Suppose $f$ is convex and $\alpha \geq 0$. Then for all $\mathbf{x}, \mathbf{y} \in \operatorname{dom}(\alpha f) = \operatorname{dom} f$, $$\begin{aligned} @@ -247,15 +325,24 @@ $$\begin{aligned} \end{aligned}$$ so $\alpha f$ is convex. ◻ +::: + +:::{prf:proposition} Sum of Convex Functions +:label: prop-sum-convex-functions +:nonumber: -*Proposition.* -If $f$ and $g$ are convex, then $f+g$ is convex. Furthermore, if $g$ is +If $f$ and $g$ are convex, then $f+g$ is convex. + +Furthermore, if $g$ is strictly convex, then $f+g$ is strictly convex, and if $g$ is $m$-strongly convex, then $f+g$ is $m$-strongly convex. +::: +:::{prf:proof} -*Proof.* Suppose $f$ and $g$ are convex. Then for all +Suppose $f$ and $g$ are convex. +Then for all $\mathbf{x}, \mathbf{y} \in \operatorname{dom} (f+g) = \operatorname{dom} f \cap \operatorname{dom} g$, $$\begin{aligned} @@ -274,34 +361,44 @@ convex. If $g$ is $m$-strongly convex, then the function $h(\mathbf{x}) \equiv g(\mathbf{x}) - \frac{m}{2}\|\mathbf{x}\|_2^2$ is -convex, so $f+h$ is convex. But +convex, so $f+h$ is convex. + +But $$(f+h)(\mathbf{x}) \equiv f(\mathbf{x}) + h(\mathbf{x}) \equiv f(\mathbf{x}) + g(\mathbf{x}) - \frac{m}{2}\|\mathbf{x}\|_2^2 \equiv (f+g)(\mathbf{x}) - \frac{m}{2}\|\mathbf{x}\|_2^2$$ so $f+g$ is $m$-strongly convex. ◻ +::: +:::{prf:proposition} Weighted Sum of Convex Functions +:label: prop-convex-functions-weighted-sum +:nonumber: -*Proposition.* If $f_1, \dots, f_n$ are convex and $\alpha_1, \dots, \alpha_n \geq 0$, then $$\sum_{i=1}^n \alpha_i f_i$$ is convex. +::: +:::{prf:proof} +Follows from the previous two propositions by induction. ◻ +::: +:::{prf:proposition} Combination of Affine and Convex Functions +:label: prop-linear-convex +:nonumber: -*Proof.* Follows from the previous two propositions by induction. ◻ - - -*Proposition.* If $f$ is convex, then $g(\mathbf{x}) \equiv f(\mathbf{A}\mathbf{x} + \mathbf{b})$ is convex for any appropriately-sized $\mathbf{A}$ and $\mathbf{b}$. +::: +:::{prf:proof} -*Proof.* Suppose $f$ is convex and $g$ is defined like so. Then for all +Suppose $f$ is convex and $g$ is defined like so. Then for all $\mathbf{x}, \mathbf{y} \in \operatorname{dom} g$, $$\begin{aligned} @@ -314,15 +411,20 @@ g(t\mathbf{x} + (1-t)\mathbf{y}) &= f(\mathbf{A}(t\mathbf{x} + (1-t)\mathbf{y}) \end{aligned}$$ Thus $g$ is convex. ◻ +::: +:::{prf:proposition} Maximum of Convex Functions +:label: prop-max-convex-functions +:nonumber: -*Proposition.* If $f$ and $g$ are convex, then $h(\mathbf{x}) \equiv \max\{f(\mathbf{x}), g(\mathbf{x})\}$ is convex. +::: +:::{prf:proof} -*Proof.* Suppose $f$ and $g$ are convex and $h$ is defined like so. Then +Suppose $f$ and $g$ are convex and $h$ is defined like so. Then for all $\mathbf{x}, \mathbf{y} \in \operatorname{dom} h$, $$\begin{aligned} @@ -339,7 +441,7 @@ $\max\{a,b\} \leq \max\{c,d\}$. In the second inequality we have used the fact that $\max\{a+b, c+d\} \leq \max\{a,c\} + \max\{b,d\}$. Thus $h$ is convex. ◻ - +::: ### Examples @@ -352,22 +454,26 @@ Functions that are convex but not strictly convex: (i) $f(\mathbf{x}) = \mathbf{w}^{\!\top\!}\mathbf{x} + \alpha$ for any $\mathbf{w} \in \mathbb{R}^d, \alpha \in \mathbb{R}$. Such a function is called an **affine function**, and it is both convex and - concave. (In fact, a function is affine if and only if it is both - convex and concave.) Note that linear functions and constant + concave. + (In fact, a function is affine if and only if it is both convex and concave.) + Note that linear functions and constant functions are special cases of affine functions. (ii) $f(\mathbf{x}) = \|\mathbf{x}\|_1$ Functions that are strictly but not strongly convex: -(i) $f(x) = x^4$. This example is interesting because it is strictly +(i) $f(x) = x^4$. +This example is interesting because it is strictly convex but you cannot show this fact via a second-order argument (since $f''(0) = 0$). -(ii) $f(x) = \exp(x)$. This example is interesting because it's bounded +(ii) $f(x) = \exp(x)$. +This example is interesting because it's bounded below but has no local minimum. -(iii) $f(x) = -\log x$. This example is interesting because it's +(iii) $f(x) = -\log x$. +This example is interesting because it's strictly convex but not bounded below. Functions that are strongly convex: diff --git a/book/chapter_convexity/convex_sets.md b/book/chapter_convexity/convex_sets.md new file mode 100644 index 0000000..69cecba --- /dev/null +++ b/book/chapter_convexity/convex_sets.md @@ -0,0 +1,108 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: math4ml + language: python + name: python3 +--- +# Convex sets + +```{code-cell} ipython3 +:tags: [hide-input] +import matplotlib.pyplot as plt +import numpy as np + +# Generate a convex polygon (e.g., a convex hull of some points) +points = np.array([ + [1, 1], [2, 3], [4, 4], [6, 3], [5, 1], [3, 0] +]) + +# Choose two points inside the polygon +A = np.array([2.5, 2]) +B = np.array([4.5, 2.5]) + +# Line segment between A and B +t = np.linspace(0, 1, 100) +segment = np.outer(1 - t, A) + np.outer(t, B) + +# Plot the convex set and the line segment +plt.figure(figsize=(8, 6)) +plt.fill(points[:, 0], points[:, 1], alpha=0.3, label="Convex Set", edgecolor='blue') +plt.plot(segment[:, 0], segment[:, 1], 'r--', label="Line segment AB") +plt.plot(*A, 'ro', label="Point A") +plt.plot(*B, 'go', label="Point B") + +plt.title("A Convex Set") +plt.xlabel("x") +plt.ylabel("y") +plt.axis("equal") +plt.grid(True) +plt.legend() +plt.tight_layout() +plt.show() + +``` +This figure visualizes a **convex set**: a polygon where the line segment connecting any two points within the set (e.g., points A and B) lies entirely inside the set. The red dashed line confirms this key geometric property of convex sets. + + + +```{code-cell} ipython3 +:tags: [hide-input] +import matplotlib.pyplot as plt +import numpy as np + +# Define a non-convex polygon (e.g., a simple star shape or concave polygon) +points = np.array([ + [1, 1], [2, 3], [3, 1.5], [4, 3], [5, 1], [3, 0] +]) + +# Choose two points inside the set where the connecting line goes outside +A = np.array([2, 2]) +B = np.array([4, 2]) + +# Line segment between A and B +t = np.linspace(0, 1, 100) +segment = np.outer(1 - t, A) + np.outer(t, B) + +# Plot the non-convex set and the line segment +plt.figure(figsize=(8, 6)) +plt.fill(points[:, 0], points[:, 1], alpha=0.3, label="Non-Convex Set", edgecolor='blue') +plt.plot(segment[:, 0], segment[:, 1], 'r--', label="Line segment AB") +plt.plot(*A, 'ro', label="Point A") +plt.plot(*B, 'go', label="Point B") + +plt.title("A Non-Convex Set") +plt.xlabel("x") +plt.ylabel("y") +plt.axis("equal") +plt.grid(True) +plt.legend() +plt.tight_layout() +plt.show() + +``` +This figure illustrates a **non-convex set**: a shape where the line segment between two points inside the set (A and B) partially lies **outside** the set. This violation of the convexity condition distinguishes non-convex sets from convex ones. + +A set $\mathcal{X} \subseteq \mathbb{R}^d$ is **convex** if + +$$t\mathbf{x} + (1-t)\mathbf{y} \in \mathcal{X}$$ + +for all +$\mathbf{x}, \mathbf{y} \in \mathcal{X}$ and all $t \in [0,1]$. + +Geometrically, this means that all the points on the line segment +between any two points in $\mathcal{X}$ are also in $\mathcal{X}$. + + +Why do we care whether or not a set is convex? We will see later that the nature of minima can depend greatly on whether or not the feasible set is convex. +Undesirable pathological results can occur when we allow +the feasible set to be arbitrary, so for proofs we will need to assume that it is convex. + +Fortunately, we often want to minimize over all of +$\mathbb{R}^d$, which is easily seen to be a convex set. + diff --git a/book/chapter_convexity/overview_convexity.md b/book/chapter_convexity/overview_convexity.md new file mode 100644 index 0000000..547fb0c --- /dev/null +++ b/book/chapter_convexity/overview_convexity.md @@ -0,0 +1,12 @@ +# Convexity + +**Convexity** is a term that pertains to both sets and functions. For +functions, there are different degrees of convexity, and how convex a +function is tells us a lot about its minima: do they exist, are they +unique, how quickly can we find them using optimization algorithms, etc. + +In this section, we present basic results regarding convexity, strict +convexity, and strong convexity. + +```{tableofcontents} +``` \ No newline at end of file diff --git a/book/chapter_decompositions/Rayleigh_quotients.md b/book/chapter_decompositions/Rayleigh_quotients.md new file mode 100644 index 0000000..80312fe --- /dev/null +++ b/book/chapter_decompositions/Rayleigh_quotients.md @@ -0,0 +1,231 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Rayleigh Quotients + +There turns out to be an interesting connection between the quadratic form of a symmetric matrix and its eigenvalues. +This connection is provided by the **Rayleigh quotient** + +> $$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}$$ + +The Rayleigh quotient has a couple of important properties: + +:::{prf:lemma} Properties of the Rayleigh Quotient +:label: trm-Rayleigh-properties +:nonumber: + +(i) **Scale invariance**: for any vector $\mathbf{x} \neq \mathbf{0}$ + and any scalar $\alpha \neq 0$, + $R_\mathbf{A}(\mathbf{x}) = R_\mathbf{A}(\alpha\mathbf{x})$. + +(ii) If $\mathbf{x}$ is an eigenvector of $\mathbf{A}$ with eigenvalue + $\lambda$, then $R_\mathbf{A}(\mathbf{x}) = \lambda$. +::: + +:::{prf:proof} +(i) + + $$R_\mathbf{A}(\alpha\mathbf{x}) = \frac{(\alpha\mathbf{x})^{\!\top\!}\mathbf{A}(\alpha\mathbf{x})}{(\alpha\mathbf{x})^{\!\top\!}(\alpha\mathbf{x})} = \frac{\alpha^2}{\alpha^2}\frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}=R_\mathbf{A}(\mathbf{x}).$$ + +(ii) Let $\mathbf{x}$ be an eigenvector of $\mathbf{A}$ with eigenvalue + $\lambda$, then + + $$R_\mathbf{A}(\mathbf{x})= \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}} = \frac{\mathbf{x}^{\!\top\!}(\lambda\mathbf{x})}{\mathbf{x}^{\!\top\!}\mathbf{x}}=\lambda\frac{\mathbf{x}^{\!\top\!}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}} = \lambda.$$ +::: + +We can further show that the Rayleigh quotient is bounded by the largest +and smallest eigenvalues of $\mathbf{A}$. + +But first we will show a useful special case of the final result. + +:::{prf:theorem} Bound Rayleigh Quotient +:label: trm-bound-Rayleigh-quotient +:nonumber: + +For any $\mathbf{x}$ such that $\|\mathbf{x}\|_2 = 1$, + +$$\lambda_{\min}(\mathbf{A}) \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$$ + +with equality if and only if $\mathbf{x}$ is a corresponding eigenvector. +::: + +:::{prf:proof} + +We show only the $\max$ case because the argument for the +$\min$ case is entirely analogous. + +Since $\mathbf{A}$ is symmetric, we can decompose it as +$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$. + +Then use +the change of variable $\mathbf{y} = \mathbf{Q}^{\!\top\!}\mathbf{x}$, +noting that the relationship between $\mathbf{x}$ and $\mathbf{y}$ is +one-to-one and that $\|\mathbf{y}\|_2 = 1$ since $\mathbf{Q}$ is +orthogonal. + +Hence + +$$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \max_{\|\mathbf{y}\|_2 = 1} \mathbf{y}^{\!\top\!}\mathbf{\Lambda}\mathbf{y} = \max_{y_1^2+\dots+y_n^2=1} \sum_{i=1}^n \lambda_i y_i^2$$ + +Written this way, it is clear that $\mathbf{y}$ maximizes this +expression exactly if and only if it satisfies +$\sum_{i \in I} y_i^2 = 1$ where +$I = \{i : \lambda_i = \max_{j=1,\dots,n} \lambda_j = \lambda_{\max}(\mathbf{A})\}$ +and $y_j = 0$ for $j \not\in I$. + +That is, $I$ contains the index or +indices of the largest eigenvalue. + +In this case, the maximal value of +the expression is + +$$\sum_{i=1}^n \lambda_i y_i^2 = \sum_{i \in I} \lambda_i y_i^2 = \lambda_{\max}(\mathbf{A}) \sum_{i \in I} y_i^2 = \lambda_{\max}(\mathbf{A})$$ + +Then writing $\mathbf{q}_1, \dots, \mathbf{q}_n$ for the columns of +$\mathbf{Q}$, we have + +$$\mathbf{x} = \mathbf{Q}\mathbf{Q}^{\!\top\!}\mathbf{x} = \mathbf{Q}\mathbf{y} = \sum_{i=1}^n y_i\mathbf{q}_i = \sum_{i \in I} y_i\mathbf{q}_i$$ + +where we have used the matrix-vector product identity. + +Recall that $\mathbf{q}_1, \dots, \mathbf{q}_n$ are eigenvectors of +$\mathbf{A}$ and form an orthonormal basis for $\mathbb{R}^n$. + +Therefore by construction, the set $\{\mathbf{q}_i : i \in I\}$ forms an +orthonormal basis for the eigenspace of $\lambda_{\max}(\mathbf{A})$. + +Hence $\mathbf{x}$, which is a linear combination of these, lies in that +eigenspace and thus is an eigenvector of $\mathbf{A}$ corresponding to +$\lambda_{\max}(\mathbf{A})$. + +We have shown that +$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \lambda_{\max}(\mathbf{A})$, +from which we have the general inequality +$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$ +for all unit-length $\mathbf{x}$. ◻ +::: + +By the scale invariance of the Rayleigh quotient, we immediately have as +a corollary + +:::{prf:theorem} Min-Max Theorem +:label: trm-min-max +:nonumber: + +For all $\mathbf{x} \neq \mathbf{0}$, + +$$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ + +with equality if and only if $\mathbf{x}$ is a corresponding +eigenvector. +::: + +:::{prf:proof} + +Let $\mathbf{x}\neq \boldsymbol{0},$ then + +$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}} = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\|\mathbf{x}\|^2} = (\frac{\mathbf{x}}{\|\mathbf{x}\|})^{\!\top\!}\mathbf{A}(\frac{\mathbf{x}}{\|\mathbf{x}\|})$ + +Thus, minimimum and maximum of the Rayleigh quotient are identical to minimum and maximum of the squared form $\mathbf{y}^\top\mathbf{A}\mathbf{y}$ for the unit-norm vector $\mathbf{y}=\mathbf{x}/\|\mathbf{x}\|$: + +$$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ + +◻ +::: + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Define symmetric matrix +A = np.array([[2, 1], + [1, 3]]) + +# Eigenvalues and eigenvectors +eigvals, eigvecs = np.linalg.eigh(A) +λ_min, λ_max = eigvals + +# Generate unit circle points +theta = np.linspace(0, 2*np.pi, 300) +circle = np.stack((np.cos(theta), np.sin(theta))) + +# Rayleigh quotient computation +R = np.einsum('ij,ji->i', circle.T @ A, circle) # x^T A x +R /= np.einsum('ij,ji->i', circle.T, circle) # x^T x + +# Rayleigh extrema +idx_min = np.argmin(R) +idx_max = np.argmax(R) +x_min = circle[:, idx_min] +x_max = circle[:, idx_max] + +# Prepare grid for quadratic form level sets +x = np.linspace(-2, 2, 400) +y = np.linspace(-2, 2, 400) +X, Y = np.meshgrid(x, y) +XY = np.stack((X, Y), axis=-1) +Z = np.einsum('...i,ij,...j->...', XY, A, XY) +levels = np.linspace(np.min(Z), np.max(Z), 20) + +# Create combined figure +fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) + +# Left: Rayleigh quotient on unit circle +sc = ax1.scatter(circle[0], circle[1], c=R, cmap='viridis', s=10) +ax1.quiver(0, 0, x_min[0], x_min[1], color='red', scale=1, scale_units='xy', angles='xy', label='argmin R(x)') +ax1.quiver(0, 0, x_max[0], x_max[1], color='orange', scale=1, scale_units='xy', angles='xy', label='argmax R(x)') +for i in range(2): + eigvec = eigvecs[:, i] + ax1.quiver(0, 0, eigvec[0], eigvec[1], color='black', alpha=0.5, scale=1, scale_units='xy', angles='xy', width=0.008) +ax1.set_title("Rayleigh Quotient on the Unit Circle") +ax1.set_aspect('equal') +ax1.set_xlim(-1.1, 1.1) +ax1.set_ylim(-1.1, 1.1) +ax1.grid(True) +ax1.legend() +plt.colorbar(sc, ax=ax1, label="Rayleigh Quotient $R_A(\\mathbf{x})$") + +# Right: Level sets of quadratic form +contour = ax2.contour(X, Y, Z, levels=levels, cmap='viridis') +ax2.clabel(contour, inline=True, fontsize=8, fmt="%.1f") +ax2.set_title("Level Sets of $\\mathbf{x}^\\top \\mathbf{A} \\mathbf{x}$") +ax2.set_xlabel("$x_1$") +ax2.set_ylabel("$x_2$") +ax2.axhline(0, color='gray', lw=0.5) +ax2.axvline(0, color='gray', lw=0.5) +for i in range(2): + vec = eigvecs[:, i] * np.sqrt(eigvals[i]) + ax2.quiver(0, 0, vec[0], vec[1], color='red', scale=1, scale_units='xy', angles='xy', width=0.01, label=f"$\\mathbf{{q}}_{i+1}$") +ax2.set_aspect('equal') +ax2.legend() + +plt.suptitle("Rayleigh Quotient and Quadratic Form Level Sets", fontsize=16) +plt.tight_layout(rect=[0, 0, 1, 0.93]) +plt.show() +``` + +This combined visualization brings together the **Rayleigh quotient** and the **level sets of the quadratic form** $\mathbf{x}^\top \mathbf{A} \mathbf{x}$: + +* **Left panel**: Rayleigh quotient $R_\mathbf{A}(\mathbf{x})$ on the unit circle + + * Color shows how the value varies with direction. + * Extremes occur at eigenvector directions (marked with arrows). + +* **Right panel**: Level sets (contours) of the quadratic form + + * Elliptical shapes aligned with eigenvectors. + * Red vectors indicate principal axes (scaled eigenvectors). + +Together, these panels illustrate how the **direction of a vector determines how strongly it is scaled** by the symmetric matrix, and how this scaling relates to the matrix's **eigenstructure**. + +✅ As guaranteed by the **Min–Max Theorem**, the maximum and minimum of the Rayleigh quotient occur precisely at the **eigenvectors corresponding to the largest and smallest eigenvalues**. diff --git a/book/chapter_decompositions/big_picture.md b/book/chapter_decompositions/big_picture.md new file mode 100644 index 0000000..fe6bed7 --- /dev/null +++ b/book/chapter_decompositions/big_picture.md @@ -0,0 +1,107 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# The fundamental subspaces of a matrix + +The fundamental subspaces of a matrix $\mathbf{A}$ are the four subspaces associated with the matrix and its transpose. +These subspaces are important in linear algebra and numerical analysis, particularly in the context of solving linear systems and eigenvalue problems. +We also provide the projections onto these subspaces, which are useful for various applications such as least squares problems and dimensionality reduction. The proof of these projection formulas relies on the properties of the Moore-Penrose pseudoinverse and the orthogonal projections onto subspaces. + +We denote the matrix $\mathbf{A}$ as an $m \times n$ matrix, where $m$ is the number of rows and $n$ is the number of columns. + +The four fundamental subspaces are: + +## 1. **Column Space (Range) of $\mathbf{A}$**: +The column space of a matrix $\mathbf{A}$ is the set of all possible linear combinations of its columns. It represents the span of the columns of $\mathbf{A}$ and is denoted as $\text{Col}(\mathbf{A})$ or $\text{Range}(\mathbf{A})$. + +:::{prf:lemma} Projection onto the Column Space +:label: trm-projection-column-space +:nonumber: + +The projection of a vector $\mathbf{b}$ onto the column space of a matrix $\mathbf{A}$ is given by: + +$$ +\mathbf{P}_{\text{Col}(\mathbf{A})}(\mathbf{b}) = \mathbf{A}\mathbf{A}^+ \mathbf{b} +$$ +::: + +## 2. **Null Space (Kernel) of $\mathbf{A}$**: +The null space of a matrix $\mathbf{A}$ is the set of all vectors $\mathbf{x}$ such that $\mathbf{A}\mathbf{x} = \mathbf{0}$. It represents the solutions to the homogeneous equation associated with $\mathbf{A}$ and is denoted as $\text{Null}(\mathbf{A})$ or $\text{Ker}(\mathbf{A})$. + +:::{prf:lemma} Projection onto the Null Space +:label: trm-projection-null-space +:nonumber: + +The projection of a vector $\mathbf{b}$ onto the null space of a matrix $\mathbf{A}$ is given by: + +$$ +\mathbf{P}_{\text{Null}(\mathbf{A})}(\mathbf{b}) = \left(\mathbf{I} - \mathbf{P}_{\text{Col}(\mathbf{A})}\right)(\mathbf{b}) = \mathbf{b} - \mathbf{A}\mathbf{A}^+ \mathbf{b} +$$ +::: + +## 3. **Row Space of $\mathbf{A}$**: +The row space of a matrix $\mathbf{A}$ is the set of all possible linear combinations of its rows. It is equivalent to the column space of its transpose, $\mathbf{A}^\top$, and is denoted as $\text{Row}(\mathbf{A})$ or $\text{Col}(\mathbf{A}^\top)$. + +:::{prf:lemma} Projection onto the Row Space +:label: trm-projection-row-space +:nonumber: + +The projection of a vector $\mathbf{b}$ onto the row space of a matrix $\mathbf{A}$ is given by: + +$$ +\mathbf{P}_{\text{Row}(\mathbf{A})}(\mathbf{b}) = \mathbf{A}^+\mathbf{A}\mathbf{b} +$$ +::: + + +## 4. **Left Null Space (Kernel) of $\mathbf{A}$**: +The left null space of a matrix $\mathbf{A}$ is the set of all vectors $\mathbf{y}$ such that $\mathbf{A}^\top\mathbf{y} = \mathbf{0}$. It represents the solutions to the homogeneous equation associated with $\mathbf{A}^\top$ and is denoted as $\text{Null}(\mathbf{A}^\top)$ or $\text{Ker}(\mathbf{A}^\top)$. + +:::{prf:lemma} Projection onto the Left Null Space +:label: trm-projection-left-null-space +:nonumber: + +The projection of a vector $\mathbf{b}$ onto the left null space of a matrix $\mathbf{A}$ is given by: + +$$ +\mathbf{P}_{\text{Null}(\mathbf{A}^\top)}(\mathbf{b}) = \left(\mathbf{I} - \mathbf{P}_{\text{Row}(\mathbf{A})}\right)(\mathbf{b}) = \mathbf{b} - \mathbf{A}^+\mathbf{A}\mathbf{b} +$$ +::: + + +## Singular Value Decomposition and the four fundamental subspaces +The SVD provides a powerful way to understand the four fundamental +subspaces of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$. + +The SVD of $\mathbf{A}$ is given by: + +$$ +\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!} +$$ + +where $\mathbf{U} \in \mathbb{R}^{m \times m}$ and $\mathbf{V} \in \mathbb{R}^{n \times n}$ are orthogonal matrices, and $\mathbf{\Sigma} \in \mathbb{R}^{m \times n}$ is a diagonal matrix with the singular values of $\mathbf{A}$ on its diagonal. + +:::{prf:lemma} SVD and the Four Fundamental Subspaces +:label: trm-svd-four-subspaces +:nonumber: + +The SVD of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ can be used to identify the four fundamental subspaces as follows: +1. **Column Space**: $\text{Col}(\mathbf{A}) = \text{span}(\mathbf{U}_r)$, where $\mathbf{U}_r$ consists of the first $r$ columns of $\mathbf{U}$ corresponding to non-zero singular values. +2. **Row Space**: $\text{Row}(\mathbf{A}) = \text{span}(\mathbf{V}_r)$, where $\mathbf{V}_r$ consists of the first $r$ columns of $\mathbf{V}$ corresponding to non-zero singular values. +3. **Null Space**: $\text{Null}(\mathbf{A}) = \text{span}(\mathbf{V}_{n-r})$, where $\mathbf{V}_{n-r}$ consists of the last $n-r$ columns of $\mathbf{V}$ corresponding to zero singular values. +4. **Left Null Space**: $\text{Null}(\mathbf{A}^\top) = \text{span}(\mathbf{U}_{m-r})$, where $\mathbf{U}_{m-r}$ consists of the last $m-r$ columns of $\mathbf{U}$ corresponding to zero singular values. +::: + +## Summary +The four fundamental subspaces of a matrix $\mathbf{A}$ are essential in understanding the structure of the matrix and its properties. +The projections onto these subspaces can be computed using the Moore-Penrose pseudoinverse, which provides a powerful tool for solving linear systems and performing dimensionality reduction. +The SVD further enhances our understanding by revealing the relationships between these subspaces through the orthogonal matrices and singular values. diff --git a/book/chapter_decompositions/determinant.md b/book/chapter_decompositions/determinant.md new file mode 100644 index 0000000..c2b43fd --- /dev/null +++ b/book/chapter_decompositions/determinant.md @@ -0,0 +1,328 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Determinant + +The **determinant** is a scalar quantity associated with any square matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$. It encodes important geometric and algebraic information about the transformation represented by $\mathbf{A}$. + +**Geometrically**, the determinant tells us: + +* How the matrix $\mathbf{A}$ **scales volume**: + The absolute value $|\det(\mathbf{A})|$ is the **volume-scaling factor** for the linear transformation $\mathbf{x} \mapsto \mathbf{A}\mathbf{x}$. +* Whether the transformation **preserves or flips orientation**: + If $\det(\mathbf{A}) > 0$, the transformation preserves orientation; if $\det(\mathbf{A}) < 0$, it reverses it (like a reflection). + +**Algebraically**, the determinant can be defined as: + +$$ +\det(\mathbf{A}) = \sum_{\sigma \in S_n} \operatorname{sgn}(\sigma) \cdot a_{1\sigma(1)} a_{2\sigma(2)} \cdots a_{n\sigma(n)} +$$ + +where: + +* The sum is over all permutations $\sigma$ of $\{1, 2, \dots, n\}$, +* $\operatorname{sgn}(\sigma)$ is $+1$ or $-1$ depending on the parity of the permutation. + +This formula is **computationally expensive** and confusing, but conceptually important: it captures how the determinant depends on all possible signed products of entries, each taken once from a distinct row and column. + +Let's illustrate the determinant geometrically. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt +from matplotlib.patches import Polygon + +# Define matrices to show area effects +matrices = { + "Area = 1 (Identity)": np.array([[1, 0], [0, 1]]), + "Area > 1": np.array([[2, 0.5], [0.5, 1.5]]), + "Area < 0 (Flip)": np.array([[0, 1], [1, 0]]), + "Rotation (Area = 1)": np.array([[np.cos(np.pi/4), -np.sin(np.pi/4)], + [np.sin(np.pi/4), np.cos(np.pi/4)]]) +} + +# Unit square +square = np.array([[0, 0], + [1, 0], + [1, 1], + [0, 1], + [0, 0]]).T + +fig, axes = plt.subplots(1, 4, figsize=(20, 5)) + +for ax, (title, M) in zip(axes, matrices.items()): + transformed_square = M @ square + area = np.abs(np.linalg.det(M)) + det = np.linalg.det(M) + + # Plot original unit square + ax.plot(square[0], square[1], 'k--', label='Unit square') + ax.fill(square[0], square[1], facecolor='lightgray', alpha=0.4) + + # Plot transformed shape + ax.plot(transformed_square[0], transformed_square[1], 'b-', label='Transformed') + ax.fill(transformed_square[0], transformed_square[1], facecolor='skyblue', alpha=0.6) + + # Add vector arrows for columns of M + origin = np.array([[0, 0]]).T + for i in range(2): + vec = M[:, i] + ax.quiver(*origin, vec[0], vec[1], angles='xy', scale_units='xy', scale=1, color='red') + + ax.set_title(f"{title}\nDet = {det:.2f}, Area = {area:.2f}") + ax.set_xlim(-2, 3) + ax.set_ylim(-2, 3) + ax.set_aspect('equal') + ax.grid(True) + ax.legend() + +plt.suptitle("Geometric Interpretation of the Determinant (Area Scaling and Orientation)", fontsize=16) +plt.tight_layout(rect=[0, 0, 1, 0.93]) +plt.show() +``` + +1. **Identity**: No change — area = 1. +2. **Stretch**: Expands area — determinant > 1. +3. **Flip**: Reflects across the diagonal — determinant < 0. +4. **Rotation**: Rotates without distortion — determinant = 1. + +## What is the Determinant? + +The **determinant** of a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ is a scalar that describes how $\mathbf{A}$ scales space. +Algebraically, it is defined by a signed sum over all permutations of the matrix’s entries. +Geometrically, it quantifies the change in **signed volume** of a unit shape under transformation by $\mathbf{A}$. If the determinant is zero, the transformation collapses the volume entirely, and $\mathbf{A}$ is singular (non-invertible). + +--- + +The determinant has several important properties: + +(i) $\det(\mathbf{I}) = 1$ + +(ii) $\det(\mathbf{A}^{\!\top\!}) = \det(\mathbf{A})$ + +(iii) $\det(\mathbf{A}\mathbf{B}) = \det(\mathbf{A})\det(\mathbf{B})$ + +(iv) $\det(\mathbf{A}^{-1}) = \det(\mathbf{A})^{-1}$ + +(v) $\det(\alpha\mathbf{A}) = \alpha^n \det(\mathbf{A})$ + +--- +## Practical Computation of the Determinant + +The algebraic definition of the determinant is computationally expensive. +In practice, we compute the determinant using property (iii) and a matrix factorization such as the **PLU decomposition**: + +$$ +\mathbf{A} = \mathbf{P} \mathbf{L} \mathbf{U} +$$ + +where $\mathbf{P}$ is a permutation matrix, $\mathbf{L}$ is a unit lower triangular matrix, and $\mathbf{U}$ is an upper triangular matrix. + +:::{prf:theorem} Triangular Matrix Determinant +:label: trm-triangular-determinant +:nonumber: + +Let $\mathbf{T} \in \mathbb{R}^{n \times n}$ be a **triangular matrix**, either upper or lower triangular. + +Then: + +$$ +\boxed{ +\det(\mathbf{T}) = \prod_{i=1}^n T_{ii} +} +$$ +::: + + +Then, + +$$ +\boxed{ +\det(\mathbf{A}) = \det(\mathbf{P}) \cdot \det(\mathbf{L}) \cdot \det(\mathbf{U}) +} +$$ + +Since: + +* $\det(\mathbf{L}) = 1$ (if unit lower triangular), +* $\det(\mathbf{U}) = \prod_{i=1}^n u_{ii}$, +* $\det(\mathbf{P}) = (-1)^s$, where $s$ is the number of row swaps, + +this method reduces determinant computation to $\mathcal{O}(n)$ operations after LU decomposition. +As the cost for the LU decomposition is $\mathcal{O}(n^3),$ the total cost of computing the determinant is $\mathcal{O}(n^3).$ + +## Cofactor Expansion: Definition + +**Cofactor expansion** (also called **Laplace expansion**) gives a **recursive definition** of the determinant. + +Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a square matrix. + +Then the **determinant** of $\mathbf{A}$ can be computed by expanding along **any row or column**. + +For simplicity, we’ll define it for the **first column**. + +$$ +\boxed{ +\det(\mathbf{A}) = \sum_{i=1}^{n} (-1)^{i+1} \cdot A_{i1} \cdot \det(\mathbf{A}^{(i,1)}) +} +$$ + +Where: + +* $A_{i1}$ is the entry in row $i$, column 1 +* $\mathbf{A}^{(i,1)}$ is the **minor** matrix obtained by deleting row $i$ and column 1 from $\mathbf{A}$ +* $(-1)^{i+1}$ is the **sign** factor for alternating signs (from the **checkerboard sign pattern**) +* $(-1)^{i+j} \cdot \det(\mathbf{A}^{(i,j)})$ is called the **cofactor** of $A_{ij}$ + +This formula recursively reduces the computation of a determinant to smaller and smaller submatrices. + +--- + +### Cofactor Expantion Example (3×3 Matrix) + +Let: + +$$ +\mathbf{A} = +\begin{bmatrix} +1 & 2 & 3 \\ +4 & 5 & 6 \\ +7 & 8 & 9 +\end{bmatrix} +$$ + +Expand along the **first column**: + +$$ +\det(\mathbf{A}) = +(+1) \cdot 1 \cdot +\begin{vmatrix} +5 & 6 \\ +8 & 9 +\end{vmatrix} +- 4 \cdot +\begin{vmatrix} +2 & 3 \\ +8 & 9 +\end{vmatrix} ++ 7 \cdot +\begin{vmatrix} +2 & 3 \\ +5 & 6 +\end{vmatrix} +$$ + +Now compute the 2×2 determinants: + +$$ +\det(\mathbf{A}) = +1 \cdot (5 \cdot 9 - 6 \cdot 8) +- 4 \cdot (2 \cdot 9 - 3 \cdot 8) ++ 7 \cdot (2 \cdot 6 - 3 \cdot 5) +$$ + +$$ += 1 \cdot (-3) - 4 \cdot (-6) + 7 \cdot (-3) += -3 + 24 - 21 = 0 +$$ + +So: + +$$ +\boxed{\det(\mathbf{A}) = 0} +$$ + + +:::{prf:proof} via Laplace Expansion / Cofactor Expansion + +We’ll prove this for **upper triangular** matrices by induction on the matrix size $n$. The same argument applies symmetrically for lower triangular matrices. + +--- + +### Base Case: $n = 1$ + +Let $\mathbf{T} = [t_{11}]$. Then clearly: + +$$ +\det(\mathbf{T}) = t_{11} += \prod_{i=1}^1 T_{ii} +$$ + +The base case holds. + +--- + +### Inductive Step + +Assume the result holds for $(n-1) \times (n-1)$ upper triangular matrices. + +Now let $\mathbf{T} \in \mathbb{R}^{n \times n}$ be upper triangular. That means all entries below the diagonal are zero: + +$$ +\mathbf{T} = +\begin{bmatrix} +t_{11} & t_{12} & \dots & t_{1n} \\ +0 & t_{22} & \dots & t_{2n} \\ +\vdots & \vdots & \ddots & \vdots \\ +0 & 0 & \dots & t_{nn} +\end{bmatrix} +$$ + +Use **cofactor expansion** along the first column. Since the only nonzero entry in the first column is $t_{11}$, we have: + +$$ +\det(\mathbf{T}) = t_{11} \cdot \det(\mathbf{T}^{(1,1)}) +$$ + +Where $\mathbf{T}^{(1,1)}$ is the $(n-1)\times(n-1)$ matrix obtained by deleting row 1 and column 1. But: + +* $\mathbf{T}^{(1,1)}$ is still upper triangular. +* By the inductive hypothesis: + + $$ + \det(\mathbf{T}^{(1,1)}) = \prod_{i=2}^{n} T_{ii} + $$ + +So: + +$$ +\det(\mathbf{T}) = t_{11} \cdot \prod_{i=2}^{n} T_{ii} += \prod_{i=1}^{n} T_{ii} +$$ + +The inductive step holds. + +--- + +### Conclusion + +By induction, for any upper (or lower) triangular matrix $\mathbf{T} \in \mathbb{R}^{n \times n}$, + +$$ +\boxed{ +\det(\mathbf{T}) = \prod_{i=1}^n T_{ii} +} +$$ + +::: + +* The determinant accumulates only the diagonal entries because **each pivot is isolated**, and all other paths in the expansion have zero entries. +* This result is frequently used in: + + * Computing determinants from **LU decomposition** + * Checking invertibility efficiently + * Proving properties of **eigenvalues** and **characteristic polynomials** + + + + diff --git a/book/chapter_decompositions/eigenvectors.md b/book/chapter_decompositions/eigenvectors.md new file mode 100644 index 0000000..a21104e --- /dev/null +++ b/book/chapter_decompositions/eigenvectors.md @@ -0,0 +1,478 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Eigenvalues and Eigenvectors + +For a *square matrix* $\mathbf{A} \in \mathbb{R}^{n \times n}$, there may +be vectors which, when $\mathbf{A}$ is applied to them, are simply +scaled by some constant. + +A nonzero vector $\mathbf{x} \in \mathbb{C}^n$ is an **eigenvector** of $\mathbf{A}$ corresponding to **eigenvalue** $\lambda \in \mathbb{C}$ if + +$$\mathbf{A}\mathbf{x} = \lambda\mathbf{x}$$ + +The zero vector is excluded from this definition because +$\mathbf{A}\mathbf{0} = \mathbf{0} = \lambda\mathbf{0}$ +for every $\lambda$. + +Eigenvalues and eigenvectors can be complex numbers, even if $\mathbf{A}$ is real-valued. +We will provide a high-level discussion of the conditions below. + +First, let's look at an example and how multiplication with a matrix $\mathbf{A}$ transforms vectors that lie on the unit circle and, in particular, how it changes it's eivenvectors during multiplication. + +$$ +\mathbf{A} = \begin{pmatrix}1.5 & 0.5 \\ 0.1 & 1.2\end{pmatrix} +$$ + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# --- Base matrix --- +A = np.array([[1.5, 0.5], + [0.1, 1.2]]) + +# Compute eigenvalues and eigenvectors +eigvals, eigvecs = np.linalg.eig(A) + +from IPython.display import display, Markdown + +λ1, λ2 = eigvals +display(Markdown(f"The matrix has Eigenvalues λ₁ = {eigvals[0]:.2f}, λ₂ = {eigvals[1]:.2f}.")) + +square = np.array([[0, 1, 1, 0, 0], + [0, 0, 1, 1, 0]]) +transformed_square = A @ square + +# Unit circle for reference +theta = np.linspace(0, 2*np.pi, 100) +circle = np.stack((np.cos(theta), np.sin(theta)), axis=0) + +# Transformed unit circle +circle_transformed = A @ circle + +# Plot settings +fig, ax = plt.subplots(figsize=(6,6)) +ax.plot(circle[0], circle[1], ':', color='gray', label='Unit circle') +ax.plot(circle_transformed[0], circle_transformed[1], color='gray', linestyle='--', label='A ∘ circle') + +# Plot eigenvectors +for i in range(2): + vec = eigvecs[:, i] + ax.quiver(0, 0, vec[0], vec[1], angles='xy', scale_units='xy', scale=1.0, color='blue', width=0.01) + ax.quiver(0, 0, *(eigvals[i] * vec), angles='xy', scale_units='xy', scale=1.0, color='red', width=0.01) + ax.text(*(1.1 * vec), f"v{i+1}", color='blue') + ax.text(*(1.05 * eigvals[i] * vec), f"λ{i+1}·v{i+1}", color='red') + +# Axes +ax.axhline(0, color='gray', lw=1) +ax.axvline(0, color='gray', lw=1) +ax.set_aspect('equal') +ax.set_xlim(-2.1, 2.1) +ax.set_ylim(-2.1, 2.1) +ax.set_title("Eigenvectors are Invariant Directions") +ax.plot(square[0], square[1], 'g:', label='Original square') +ax.plot(transformed_square[0], transformed_square[1], 'g--', label='A ∘ square') + +ax.legend() +plt.grid(True) +plt.show() +``` +The visualization shows: +- The **original unit circle** (dashed black) +- The **transformed unit circle** under $\mathbf{A}$ (solid red) +- The **eigenvectors** in blue and their **scaled images** in green + +Note how the eigenvectors are aligned with the directions that remain unchanged in orientation under transformation — they are only scaled by their respective eigenvalues. + +## Eigenvectors can be real-valued or complex. + +Here’s a breakdown of the geometric distinction between linear maps that have **only real eigenvectors** and those that have **complex eigenvectors**: + +### Real Eigenvectors → Maps That Stretch or Reflect Along Fixed Directions + +If a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ has **only real eigenvalues and eigenvectors**, it means: + +* There exist **real directions** in space that are preserved (up to scaling). +* The action of the matrix is intuitively: + + * **Scaling** (positive eigenvalues) + * **Reflection + scaling** (negative eigenvalues) +* You can visualize this as: + + * Pulling/stretching space along certain axes + * Possibly flipping directions + + +### Complex Eigenvectors → Maps That Rotate or Spiral + +If a matrix has **complex eigenvalues** and **no real eigenvectors**, it **cannot leave any real direction invariant**. + +This typically corresponds to: + +* **Rotation** or **spiral** motion +* Sometimes **rotation + scaling** (when complex eigenvalues have modulus $\ne 1$) +* The action in real space: + + * **No real eigenvector** + * Points are **rotated** or **rotated and scaled** + * Repeated application creates **circular** or **spiraling trajectories** + +#### Example: Stretching vs. Shearing vs. Rotation +* **Stretching**: scales space differently along the axes. The matrix has only real eigenvalues and eigenvectors. +* **Shearing**: shifts one axis direction while keeping the other fixed. The matrix has only real eigenvalues and eigenvectors. +* **Rotation**: turns everything around the origin. The matrix has only complex eigenvalues and eigenvectors. + +Each transformation is applied to a **unit square** and a **grid**, so you can clearly see how space is deformed under each linear map. +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +def apply_transform(grid, matrix): + return np.tensordot(matrix, grid, axes=1) + +def draw_transform(ax, matrix, title, color='red'): + # Draw original grid + x = np.linspace(-1, 1, 11) + y = np.linspace(-1, 1, 11) + for xi in x: + ax.plot([xi]*len(y), y, color='lightgray', lw=0.5) + for yi in y: + ax.plot(x, [yi]*len(x), color='lightgray', lw=0.5) + + # Draw transformed grid + for xi in x: + line = np.stack(([xi]*len(y), y)) + transformed = apply_transform(line, matrix) + ax.plot(transformed[0], transformed[1], color=color, lw=1) + for yi in y: + line = np.stack((x, [yi]*len(x))) + transformed = apply_transform(line, matrix) + ax.plot(transformed[0], transformed[1], color=color, lw=1) + + # Draw unit square before and after + square = np.array([[0, 1, 1, 0, 0], + [0, 0, 1, 1, 0]]) + transformed_square = matrix @ square + ax.plot(square[0], square[1], 'k--', label='Original square') + ax.plot(transformed_square[0], transformed_square[1], 'k-', label='Transformed square') + ax.set_aspect('equal') + ax.set_xlim(-2, 2) + ax.set_ylim(-2, 2) + ax.set_title(title) + ax.legend() + +# Define transformation matrices +stretch = np.array([[1.5, 0], + [0, 0.5]]) + +shear = np.array([[1, 1], + [0, 1]]) + +theta = np.pi / 4 +rotation = np.array([[np.cos(theta), -np.sin(theta)], + [np.sin(theta), np.cos(theta)]]) + +# Plot all three +fig, axes = plt.subplots(1, 3, figsize=(15, 5)) +draw_transform(axes[0], stretch, "Stretching") +draw_transform(axes[1], shear, "Shearing") +draw_transform(axes[2], rotation, "Rotation") +plt.suptitle("Linear Transformations: Stretch vs Shear vs Rotation", fontsize=14) +plt.tight_layout(rect=[0, 0, 1, 0.95]) +plt.show() +``` + +--- + +We now give some useful results about how eigenvalues change after +various manipulations. + + +:::{prf:proposition} Eigenvalues and Eigenvectors +:label: eigenvalues_eigenvectors_properties +:nonumber: + +Let $\mathbf{x}$ be an eigenvector of $\mathbf{A}$ with corresponding +eigenvalue $\lambda$. + +Then + +(i) For any $\gamma \in \mathbb{R}$, $\mathbf{x}$ is an eigenvector of + $\mathbf{A} + \gamma\mathbf{I}$ with eigenvalue $\lambda + \gamma$. + +(ii) If $\mathbf{A}$ is invertible, then $\mathbf{x}$ is an eigenvector + of $\mathbf{A}^{-1}$ with eigenvalue $\lambda^{-1}$. + +(iii) $\mathbf{A}^k\mathbf{x} = \lambda^k\mathbf{x}$ for any + $k \in \mathbb{Z}$ (where $\mathbf{A}^0 = \mathbf{I}$ by + definition). +::: + +Below we illustrate the geometric meaning of Propositions (i)–(iii) using the same original matrix $\mathbf{A}$. + +Each subplot shows: +- The **unit circle** (dashed black) +- The **circle transformed by the original matrix $\mathbf{A}$** (dotted gray) +- The **circle transformed by the modified matrix** (solid red) +- An **eigenvector of $\mathbf{A}$** (blue) +- The **eigenvector after transformation** by the modified matrix (red arrow) + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +def plot_eig_effect(ax, A_original, A_transformed, transformation_label, proposition_label, color_vec='red', color_circle='crimson',): + # Unit circle + theta = np.linspace(0, 2 * np.pi, 200) + circle = np.vstack((np.cos(theta), np.sin(theta))) + circle_A = A_original @ circle + circle_transformed = A_transformed @ circle + + # Eigenvectors and values of A_original + eigvals, eigvecs = np.linalg.eig(A_original) + + # Plot unit and transformed circles + ax.plot(circle[0], circle[1], 'k--', label='Unit Circle') + ax.plot(circle_A[0], circle_A[1], color='gray', linestyle=':', label='A ∘ Circle') + ax.plot(circle_transformed[0], circle_transformed[1], color=color_circle, label=transformation_label+' ∘ Circle') + + for i in range(2): + v = eigvecs[:, i] + v = v / np.linalg.norm(v) + Atrans_v = A_transformed @ v + + # Plot eigenvector and its transformed image + ax.quiver(0, 0, v[0], v[1], angles='xy', scale_units='xy', scale=1, color='blue', label=r'Eigenvector $\mathbf{v}$' if i == 0 else None) + ax.quiver(0, 0, Atrans_v[0], Atrans_v[1], angles='xy', scale_units='xy', scale=1, color=color_vec, label=transformation_label+r' ∘ $\mathbf{v}$' if i == 0 else None) + + # Formatting + ax.set_xlim(-3, 3) + ax.set_ylim(-3, 3) + ax.set_aspect('equal') + ax.axhline(0, color='gray', lw=0.5) + ax.axvline(0, color='gray', lw=0.5) + ax.set_title(proposition_label + transformation_label) + ax.grid(True) + +# --- Base matrix --- +A = np.array([[1.5, 0.5], + [0.1, 1.2]]) + +# --- Matrix variants --- +gamma = 0.5 +A_shifted = A + gamma * np.eye(2) +A_inv = np.linalg.inv(A) +A_sq = A @ A + +# --- Plotting --- +fig, axes = plt.subplots(1, 3, figsize=(18, 6)) + +plot_eig_effect(axes[0], A, A_shifted, "(A + γI)", proposition_label= "i): ") +plot_eig_effect(axes[1], A, A_inv, "A⁻¹", proposition_label= "ii): ") +plot_eig_effect(axes[2], A, A_sq, "A²", proposition_label= "iii): ") + +for ax in axes: + ax.legend() + +plt.suptitle("Eigenvector Transformations for Proposition (i)–(iii)", fontsize=16) +plt.tight_layout(rect=[0, 0, 1, 0.95]) +plt.show() + +``` +We observe that: +- The eigenvector direction is **invariant** (it doesn’t rotate) +- The **scaling changes** depending on the transformation: + - In (i), $\mathbf{A} + \gamma \mathbf{I}$ adds $\gamma$ to the eigenvalue. + - In (ii), $\mathbf{A}^{-1}$ inverts the eigenvalue. + - In (iii), $\mathbf{A}^2$ squares the eigenvalue. + +Note how the red-transformed circles deform differently in each panel, but the eigenvector stays aligned. + +:::{prf:proof} + +(i) follows readily: + +$$(\mathbf{A} + \gamma\mathbf{I})\mathbf{x} = \mathbf{A}\mathbf{x} + \gamma\mathbf{I}\mathbf{x} = \lambda\mathbf{x} + \gamma\mathbf{x} = (\lambda + \gamma)\mathbf{x}$$ + +(ii) Suppose $\mathbf{A}$ is invertible. Then + +$$\mathbf{x} = \mathbf{A}^{-1}\mathbf{A}\mathbf{x} = \mathbf{A}^{-1}(\lambda\mathbf{x}) = \lambda\mathbf{A}^{-1}\mathbf{x}$$ + +Dividing by $\lambda$, which is valid because the invertibility of +$\mathbf{A}$ implies $\lambda \neq 0$, gives +$\lambda^{-1}\mathbf{x} = \mathbf{A}^{-1}\mathbf{x}$. + +(iii) The case $k \geq 0$ follows immediately by induction on $k$. +Then the general case $k \in \mathbb{Z}$ follows by combining the +$k \geq 0$ case with (ii). ◻ +::: + + + + +## Relationship between Eigenvalues and Determinant +Interestingly, the determinant of a matrix is equal to the product of +its eigenvalues (repeated according to multiplicity): + +$$\det(\mathbf{A}) = \prod_{i=1}^n \lambda_i(\mathbf{A})$$ + +This provides a means to find the eigenvalues by deriving the roots of the characteristic polynomial. + +:::{prf:corollary} Characteristic Polynomial +:label: trm-characteristic-polynomial +:nonumber: + +The eigenvalues of a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ are the **roots of its characteristic polynomial** defined as: + +$$ +p(\lambda) = \det(\mathbf{A} - \lambda \mathbf{I}) +$$ + +It is a degree-$n$ polynomial in $\lambda$, and its roots are precisely the eigenvalues of $\mathbf{A}$. + +::: + +:::{prf:proof} Characteristic Polynomial + +By definition, $\lambda$ is an **eigenvalue** of $\mathbf{A}$ if: + +$$ +\exists \, \mathbf{x} \neq \mathbf{0} \text{ such that } \mathbf{A} \mathbf{x} = \lambda \mathbf{x} +$$ + +Rewriting: + +$$ +(\mathbf{A} - \lambda \mathbf{I}) \mathbf{x} = \mathbf{0} +$$ + +This is a homogeneous linear system. +A **nontrivial solution** exists **if and only if** the matrix $\mathbf{A} - \lambda \mathbf{I}$ is **not invertible**, which according to the **fundamental equivalences for square matrices** is equivalent to: + +$$ +\det(\mathbf{A} - \lambda \mathbf{I}) = 0 +$$ + +Therefore, the **eigenvalues are the roots of the characteristic polynomial** $p(\lambda)$. +::: + +$$ +\mathbf{A} - \lambda \mathbf{I} = +\begin{bmatrix} +a_{11} - \lambda & a_{12} & \cdots & a_{1n} \\ +a_{21} & a_{22} - \lambda & \cdots & a_{2n} \\ +\vdots & \vdots & \ddots & \vdots \\ +a_{n1} & a_{n2} & \cdots & a_{nn} - \lambda +\end{bmatrix}. +$$ + +Taking the determinant of this matrix yields a **polynomial in $\lambda$**. + +Each term in the determinant expansion is a product of $n$ entries, and due to the linearity in $\lambda$ of each diagonal term, the highest degree term in $\lambda$ is $(-\lambda)^n$. + +Hence: + +$$ +p(\lambda) = \det(\mathbf{A} - \lambda \mathbf{I}) = (-1)^n \lambda^n + c_{n-1} \lambda^{n-1} + \cdots + c_1 \lambda + c_0, +$$ + +for some coefficients $c_i \in \mathbb{R}$. + +Thus, $p(\lambda)$ is a **monic polynomial** of degree $n$. + +### **Example: Characteristic Polynomial of a 2×2 Matrix** + +Here is the full derivation of the **characteristic polynomial** for a general $2 \times 2$ matrix, step by step: + + +Let + +$$ +\mathbf{A} = \begin{bmatrix} +a & b \\ +c & d +\end{bmatrix} \in \mathbb{R}^{2 \times 2}. +$$ + +We want to compute the characteristic polynomial: + +$$ +p(\lambda) = \det(\mathbf{A} - \lambda \mathbf{I}). +$$ + +#### Step 1: Subtract $\lambda \mathbf{I}$ + +$$ +\mathbf{A} - \lambda \mathbf{I} = +\begin{bmatrix} +a - \lambda & b \\ +c & d - \lambda +\end{bmatrix}. +$$ + +#### Step 2: Compute the determinant + +$$ +p(\lambda) = \det(\mathbf{A} - \lambda \mathbf{I}) += (a - \lambda)(d - \lambda) - bc. +$$ + +#### Step 3: Expand the polynomial + +$$ +p(\lambda) += ad - a\lambda - d\lambda + \lambda^2 - bc += \lambda^2 - (a + d)\lambda + (ad - bc). +$$ + +--- + +### **Interpretation** + +So the characteristic polynomial is: + +$$ +p(\lambda) = \lambda^2 - \mathrm{tr}(\mathbf{A})\lambda + \det(\mathbf{A}), +$$ + +where: + +* $\mathrm{tr}(\mathbf{A}) = a + d$ is the **trace**, +* $\det(\mathbf{A}) = ad - bc$ is the **determinant**. + +--- + +### **Eigenvalues** + +The eigenvalues are the roots of this quadratic polynomial: + +$$ +\lambda_{1,2} = \frac{1}{2} \left( \mathrm{tr}(\mathbf{A}) \pm \sqrt{ \mathrm{tr}(\mathbf{A})^2 - 4 \det(\mathbf{A}) } \right). +$$ + + + +## Relationship between the Trace of a Matrix and its Eigenvalues + +Interestingly, the trace of a matrix $\mathbf{A}\in\mathbb{R}^{n \times n}$ is equal to the sum of its eigenvalues (repeated according to multiplicity): + +$$\operatorname{tr}(\mathbf{A}) = \sum_{i=1}^n \lambda_i(\mathbf{A})$$ + +Note that this sum yields a real value even holds if $\mathbf{A}$ has complex eigenvalues. + +The reason is that complex eigenvalues always appear in conjugate pairs. + + diff --git a/book/chapter_decompositions/matrix_norms.md b/book/chapter_decompositions/matrix_norms.md new file mode 100644 index 0000000..9a354f9 --- /dev/null +++ b/book/chapter_decompositions/matrix_norms.md @@ -0,0 +1,485 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Matrix Norms + +Matrix norms provide a way to measure the "size" or "magnitude" of a matrix. They are used throughout machine learning and numerical analysis—for example, to quantify approximation error, assess convergence in optimization algorithms, or bound the spectral properties of linear transformations. + +## Definition + +A **matrix norm** is a function $ \|\cdot\| : \mathbb{R}^{m \times n} \to \mathbb{R} $ satisfying the following properties for all matrices $ \mathbf{A}, +\mathbf{B} \in \mathbb{R}^{m \times n} $ and all scalars $ \alpha \in \mathbb{R} $: + +1. **Non-negativity**: $ \|\mathbf{A}\| \geq 0 $ +2. **Definiteness**: $ \|\mathbf{A}\| = 0 \iff \mathbf{A} = 0 $ +3. **Homogeneity**: $ \|\alpha \mathbf{A}\| = |\alpha| \cdot \|\mathbf{A}\| $ +4. **Triangle inequality**: $ \|\mathbf{A} + \mathbf{B}\| \leq \|\mathbf{A}\| + \|\mathbf{B}\| $ + +These are the **minimal axioms** for a matrix norm — analogous to vector norms. + +## Common Matrix Norms + +### 1. **Frobenius Norm** + +Defined by: + +$$ +\|\mathbf{A}\|_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{\mathrm{tr}(\mathbf{A}^\top \mathbf{A})} +$$ + +It treats the matrix as a vector in $ \mathbb{R}^{mn} $. + +### 2. **Induced (Operator) Norms** + +Given a vector norm $ \|\cdot\| $, the **induced matrix norm** is: + +$$ +\|\mathbf{A}\| = \sup_{\mathbf{x} \neq 0} \frac{\|\mathbf{A} \mathbf{x}\|}{\|\mathbf{x}\|} = \sup_{\|\mathbf{x}\| = 1} \|\mathbf{A} \mathbf{x}\| +$$ + +Examples: +- **Spectral norm**: Induced by the Euclidean norm $ \|\cdot\|_2 $. + + Equal to the largest singular value of $ \mathbf{A}. $ +- **$ \ell_1 $ norm**: Maximum absolute column sum. +- **$ \ell_\infty $ norm**: Maximum absolute row sum. + + +## Submultiplicativity + +The **submultiplicative property is an additional structure**, not a required axiom. Many useful matrix norms (especially induced norms) **do** satisfy it, but not all matrix norms do. + +When a matrix norm satisfies it, we say it is a: + +> **Submultiplicative matrix norm** + +- Induced norms satisfy the **submultiplicative property**: + +$$ +\|\mathbf{A}\mathbf{B}\| \leq \|\mathbf{A}\| \cdot \|\mathbf{B}\| +$$ + +- For the Frobenius norm: + +$$ +\|\mathbf{A}\mathbf{B}\|_F \leq \|\mathbf{A}\|_F \cdot \|\mathbf{B}\|_F +$$ + +| Norm | Submultiplicative? | Notes | +| ------------------------------ | ------------------ | ---------------------------------- | +| Frobenius norm $\|\cdot\|_F$ | ✅ Yes | But not induced from a vector norm | +| Induced norms (e.g., spectral) | ✅ Yes | Always submultiplicative | +| Entrywise max norm | ❌ No | Not submultiplicative in general | + + + +- All norms on a finite-dimensional vector space are equivalent (they define the same topology), but may differ in scaling. + + +## Visual Comparison (2D case) + +In 2D, vector norms induce different geometries: +- $ \ell_2 $: circular level sets +- $ \ell_1 $: diamond-shaped level sets +- $ \ell_\infty $: square level sets + +This influences which directions are favored in optimization and which vectors are "small" under a given norm. + +Here is a visual comparison of how different induced norms transform unit circles in 2D space under a linear transformation defined by a matrix $ A $: + +$$ +\mathbf{A} = \begin{bmatrix} +2 & 1 \\ +1 & 3 +\end{bmatrix} +$$ + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Define example matrix A +A = np.array([[2, 1], + [1, 3]]) + +# Create unit circles in different norms +theta = np.linspace(0, 2 * np.pi, 400) +circle = np.stack([np.cos(theta), np.sin(theta)], axis=1) + +# l1 unit ball boundary (diamond) +l1_vectors = [] +for v in circle: + norm = np.sum(np.abs(v)) + l1_vectors.append(v / norm) +l1_vectors = np.array(l1_vectors) + +# l2 unit ball (circle) +l2_vectors = circle + +# linf unit ball boundary (square) +linf_vectors = [] +for v in circle: + norm = np.max(np.abs(v)) + linf_vectors.append(v / norm) +linf_vectors = np.array(linf_vectors) + +# Apply matrix A to each set +l1_transformed = l1_vectors @ A.T +l2_transformed = l2_vectors @ A.T +linf_transformed = linf_vectors @ A.T + +# Plot +fig, ax = plt.subplots(1, 3, figsize=(15, 5)) + +# l1 norm effect +ax[0].plot(l1_vectors[:, 0], l1_vectors[:, 1], label='Original') +ax[0].plot(l1_transformed[:, 0], l1_transformed[:, 1], label='Transformed') +ax[0].set_title(r'$\ell_1$ Norm Unit Ball $\rightarrow A$') +ax[0].axis('equal') +ax[0].grid(True) +ax[0].legend() + +# l2 norm effect +ax[1].plot(l2_vectors[:, 0], l2_vectors[:, 1], label='Original') +ax[1].plot(l2_transformed[:, 0], l2_transformed[:, 1], label='Transformed') +ax[1].set_title(r'$\ell_2$ Norm Unit Ball $\rightarrow A$') +ax[1].axis('equal') +ax[1].grid(True) +ax[1].legend() + +# linf norm effect +ax[2].plot(linf_vectors[:, 0], linf_vectors[:, 1], label='Original') +ax[2].plot(linf_transformed[:, 0], linf_transformed[:, 1], label='Transformed') +ax[2].set_title(r'$\ell_\infty$ Norm Unit Ball $\rightarrow A$') +ax[2].axis('equal') +ax[2].grid(True) +ax[2].legend() + +plt.tight_layout() +plt.show() + +``` + + +Let’s give formal **definitions and proofs** for several commonly used **induced matrix norms**, also known as **operator norms**, derived from vector norms. + +Let $\|\cdot\|$ be a **vector norm** on $\mathbb{R}^n$, and define the **induced matrix norm** for $\mathbf{A} \in \mathbb{R}^{m \times n}$ as: + +$$ +\|\mathbf{A}\| = \sup_{\mathbf{x} \neq 0} \frac{\|\mathbf{A} \mathbf{x}\|}{\|\mathbf{x}\|} = \sup_{\|\mathbf{x}\| = 1} \|\mathbf{A} \mathbf{x}\| +$$ + +We’ll now state and prove specific formulas for induced norms when the underlying vector norm is: + +* $\ell_1$ +* $\ell_\infty$ +* $\ell_2$ (spectral norm) + +--- + +## 1. Induced $\ell_1$ Norm + +**Claim**: +If $\|\mathbf{x}\| = \|\mathbf{x}\|_1$, then: + +$$ +\|\mathbf{A}\|_1 = \max_{1 \leq j \leq n} \sum_{i=1}^m |A_{ij}| +\quad \text{(maximum absolute column sum)} +$$ + + +:::{prf:proof} + +Let $\mathbf{A} = [a_{ij}] \in \mathbb{R}^{m \times n}$. + +Then by the definition of the induced norm: + +$$ +\|\mathbf{A}\|_1 = \sup_{\|\mathbf{x}\|_1 = 1} \|\mathbf{A} \mathbf{x}\|_1 += \sup_{\|\mathbf{x}\|_1 = 1} \sum_{i=1}^m \left| \sum_{j=1}^n a_{ij} x_j \right| +$$ + +Apply the triangle inequality inside the absolute value: + +$$ +\leq \sup_{\|\mathbf{x}\|_1 = 1} \sum_{i=1}^m \sum_{j=1}^n |a_{ij}| \cdot |x_j| += \sup_{\|\mathbf{x}\|_1 = 1} \sum_{j=1}^n |x_j| \left( \sum_{i=1}^m |a_{ij}| \right) +$$ + +Let us define the **column sums**: + +$$ +c_j := \sum_{i=1}^m |a_{ij}| +$$ + +Then the expression becomes: + +$$ +\sum_{j=1}^n |x_j| c_j \leq \max_j c_j \cdot \sum_{j=1}^n |x_j| = \max_j c_j +$$ + +since $\sum_{j=1}^n |x_j| = \|\mathbf{x}\|_1 = 1$, and this is a convex combination of the $c_j$. + +### Attainment of the Maximum + +Let $j^* \in \{1, \dots, n\}$ be the index of the column with maximum sum: + +$$ +c_{j^*} = \max_j \sum_i |a_{ij}| +$$ + +Now choose the **standard basis vector** $\mathbf{e}_{j^*} \in \mathbb{R}^n$, where: + +$$ +(\mathbf{e}_{j^*})_j = \begin{cases} +1, & j = j^* \\\\ +0, & j \neq j^* +\end{cases} +$$ + +Then $\|\mathbf{e}_{j^*}\|_1 = 1$, and: + +$$ +\|\mathbf{A} \mathbf{e}_{j^*}\|_1 = \sum_{i=1}^m \left| a_{i j^*} \right| = c_{j^*} +$$ + +So the upper bound is **achieved**, and we conclude: + +$$ +\|\mathbf{A}\|_1 = \max_j \sum_i |a_{ij}| +$$ + +QED. +::: +--- + +## 2. Induced $\ell_\infty$ Norm + +**Claim**: +If $\|\mathbf{x}\| = \|\mathbf{x}\|_\infty$, then: + +$$ +\|\mathbf{A}\|_\infty = \max_{1 \leq i \leq m} \sum_{j=1}^n |A_{ij}| +\quad \text{(maximum absolute row sum)} +$$ + +:::{prf:proof} + +Let $\|\mathbf{x}\|_\infty = 1$. + +Then: + +$$ +\|A \mathbf{x}\|_\infty = \max_{i} \left| \sum_j a_{ij} x_j \right| +\leq \max_i \sum_j |a_{ij}||x_j| \leq \max_i \sum_j |a_{ij}| +$$ + +Equality is achieved by choosing $x_j = \operatorname{sign}(a_{ij^*})$ at the row $i^*$ with largest sum. So: + +$$ +\|\mathbf{A}\|_\infty = \max_i \sum_j |a_{ij}| +$$ + +QED. +::: + +## 3. Induced $\ell_2$ Norm (Spectral Norm) + +**Claim**: +If $\|\cdot\| = \|\cdot\|_2$, then: + +$$ +\|\mathbf{A}\|_2 = \sigma_{\max}(\mathbf{A}) = \sqrt{\lambda_{\max}(\mathbf{A}^\top \mathbf{A})} +$$ + +where $\sigma_{\max}(\mathbf{A})$ is the **largest singular value** of $\mathbf{A}$, and $\lambda_{\max}$ denotes the largest eigenvalue. + +:::{prf:proof} + +Let $\|\mathbf{x}\|_2 = 1$. + +Then: + +$$ +\|\mathbf{A}\|_2 = \sup_{\|\mathbf{x}\|_2 = 1} \|\mathbf{A} \mathbf{x}\|_2 += \sup_{\|\mathbf{x}\|_2 = 1} \sqrt{(\mathbf{A} \mathbf{x})^\top (\mathbf{A} \mathbf{x})} += \sup_{\|\mathbf{x}\|_2 = 1} \sqrt{\mathbf{x}^\top \mathbf{A}^\top \mathbf{A} \mathbf{x}} +$$ + +This is the **Rayleigh quotient** of $\mathbf{A}^\top \mathbf{A}$, a symmetric PSD matrix. + +So: + +$$ +\|\mathbf{A}\|_2 = \sqrt{\lambda_{\max}(\mathbf{A}^\top \mathbf{A})} +$$ + +QED. +::: +--- + +## Summary Table + +| Vector Norm | Induced Matrix Norm | | +| ------------- | ----------------------------|----------------------------- | +| $\ell_1$ | Max column sum: | $\max_j \sum_i a_{ij}$ | +| $\ell_\infty$ | Max row sum: | $\max_i \sum_j a_{ij}$ | +| $\ell_2$ | Largest singular value: |$\sqrt{\lambda_{\max}(A^\top A)}$ | + + +## Applications in Machine Learning + +- In **optimization**, norms define constraints (e.g., Lasso uses $ \ell_1 $-norm penalty). +- In **regularization**, norms quantify complexity of parameter matrices (e.g., weight decay with $ \ell_2 $-norm). +- In **spectral methods**, matrix norms bound approximation error (e.g., spectral norm bounds for generalization). + + +--- + +## Collaborative Filtering and Matrix Factorization + +**Collaborative filtering** is a foundational technique in recommendation systems, where the goal is to predict a user's preference for items based on observed interactions (such as ratings, clicks, or purchases). The key assumption underlying collaborative filtering is that **user preferences and item characteristics lie in a shared low-dimensional latent space**. That is, although we observe only sparse user-item interactions, there exists a hidden structure — often of low rank — that explains these patterns. + +A common model formalizes this intuition by representing the user-item rating matrix $R \in \mathbb{R}^{m \times n}$ as the product of two low-rank matrices: + +$$ +R \approx UV^\top +$$ + +where $U \in \mathbb{R}^{m \times k}$ encodes latent user features and $V \in \mathbb{R}^{n \times k}$ encodes latent item features, for some small $k \ll \min(m, n)$. The model is typically fit by **minimizing the squared error** over observed entries, together with regularization to prevent overfitting: + +$$ +\min_{U, V} \sum_{(i,j) \in \Omega} (R_{ij} - U_i^\top V_j)^2 + \lambda (\|U\|_F^2 + \|V\|_F^2) +$$ + +where $\Omega \subset [m] \times [n]$ is the set of observed ratings, and $\| \cdot \|_F$ is the Frobenius norm. This formulation implicitly assumes that **missing ratings are missing at random** and that users with similar latent profiles tend to rate items similarly — an assumption that allows the model to generalize from sparse data. + +```{code-cell} ipython3 +class MatrixFactorization: + def __init__(self, k=2, steps=1000, lam=0.1): + """ + Initializes the matrix factorization model. + + Parameters: + - k (int): number of latent features + - steps (int): number of ALS iterations + - lam (float): regularization strength + """ + self.k = k + self.steps = steps + self.lam = lam + self.U = None + self.V = None + + def fit(self, R, mask): + """ + Fit the model to the observed rating matrix using ALS. + + Parameters: + - R (ndarray): observed rating matrix (with zeros for missing entries) + - mask (ndarray): boolean matrix where True indicates an observed entry + """ + num_users, num_items = R.shape + self.U = np.random.randn(num_users, self.k) + self.V = np.random.randn(num_items, self.k) + + for step in range(self.steps): + # Update U + for i in range(num_users): + V_masked = self.V[mask[i, :]] + R_i = R[i, mask[i, :]] + if len(R_i) > 0: + A = V_masked.T @ V_masked + self.lam * np.eye(self.k) + b = V_masked.T @ R_i + self.U[i] = np.linalg.solve(A, b) + # Update V + for j in range(num_items): + U_masked = self.U[mask[:, j]] + R_j = R[mask[:, j], j] + if len(R_j) > 0: + A = U_masked.T @ U_masked + self.lam * np.eye(self.k) + b = U_masked.T @ R_j + self.V[j] = np.linalg.solve(A, b) + + def predict(self): + """ + Returns the full reconstructed rating matrix. + """ + return self.U @ self.V.T + + def predict_single(self, user_idx, item_idx): + """ + Predict a single rating for a user-item pair. + + Parameters: + - user_idx (int): index of the user + - item_idx (int): index of the item + + Returns: + - float: predicted rating + """ + return self.U[user_idx] @ self.V[item_idx] +``` + + +This example demonstrates **collaborative filtering via matrix factorization** using the **Frobenius norm** to minimize reconstruction error: + +```{code-cell} ipython3 +:tags: [hide-input] +# Re-import necessary packages after kernel reset +import numpy as np +import matplotlib.pyplot as plt + +# Set random seed for reproducibility +np.random.seed(42) + +# Generate a low-rank user-item matrix (simulating ratings) +num_users = 10 +num_items = 8 +rank = 2 # desired low-rank structure + +# Latent user and item factors +U_true = np.random.randn(num_users, rank) +V_true = np.random.randn(num_items, rank) + +# Generate full rating matrix (low-rank) +R_true = U_true @ V_true.T + +# Simulate missing entries by masking some values +mask = np.random.rand(num_users, num_items) < 0.5 +R_observed = R_true * mask + +model = MatrixFactorization(k=rank, steps=1000, lam=0.1) +model.fit(R_observed, mask) +R_pred = model.predict() + +# Plotting the true, observed, and predicted matrices +fig, axs = plt.subplots(1, 3, figsize=(15, 4)) +im0 = axs[0].imshow(R_true, cmap='coolwarm', vmin=-5, vmax=5) +axs[0].set_title("True Rating Matrix") +im1 = axs[1].imshow(np.where(mask, R_observed, np.nan), cmap='coolwarm', vmin=-5, vmax=5) +axs[1].set_title("Observed Ratings (with Missing)") +im2 = axs[2].imshow(R_pred, cmap='coolwarm', vmin=-5, vmax=5) +axs[2].set_title("Predicted Ratings via MF") + +for ax in axs: + ax.set_xlabel("Items") + ax.set_ylabel("Users") + +fig.colorbar(im2, ax=axs, orientation='vertical', fraction=0.02, pad=0.04) + +plt.show() +``` +* **Left panel**: The true user-item rating matrix (low-rank structure). +* **Middle panel**: The observed entries, with \~50% missing. +* **Right panel**: The matrix reconstructed via alternating least squares (ALS). diff --git a/book/chapter_decompositions/matrix_rank.md b/book/chapter_decompositions/matrix_rank.md new file mode 100644 index 0000000..5066210 --- /dev/null +++ b/book/chapter_decompositions/matrix_rank.md @@ -0,0 +1,115 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +# Rank of a Matrix + +Let $\mathbf{A} \in \mathbb{R}^{m \times n}$ be a real matrix. + +The **rank** of $\mathbf{A}$, denoted $\operatorname{rank}(\mathbf{A})$, is defined as: + +$$ +\operatorname{rank}(\mathbf{A}) = \text{the maximum number of linearly independent rows or columns of } \mathbf{A} +$$ + +Equivalently, it's the **dimension of the image** (or column space) of $\mathbf{A}$: + +$$ +\operatorname{rank}(\mathbf{A}) = \dim(\operatorname{Im}(\mathbf{A})) = \dim(\text{Col}(\mathbf{A})) +$$ + +--- + +## ✅ Interpretations + +* **Column Rank**: The number of linearly independent **columns** +* **Row Rank**: The number of linearly independent **rows** + +> For all matrices, the **row rank equals the column rank**, even if $m \neq n$. This is a deep result in linear algebra. + +--- + +## ✅ Practical View + +To compute $\operatorname{rank}(\mathbf{A})$ in practice: + +* Reduce $\mathbf{A}$ to **row echelon form** (via Gaussian elimination) +* Count the number of **non-zero rows** + +--- + +## 🧠 Summary + +$$ +\boxed{ +\operatorname{rank}(\mathbf{A}) = \text{dimensionality of the space spanned by the columns (or rows) of } \mathbf{A} +} +$$ + + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Matrices of different ranks +A_full_rank = np.array([[3, 1], + [1, 2]]) + +A_rank_1 = np.array([[3, 6], + [1, 2]]) + +A_rank_0 = np.zeros((2, 2)) + +# Unit circle +theta = np.linspace(0, 2*np.pi, 100) +circle = np.stack((np.cos(theta), np.sin(theta))) + +# Unit square +square = np.array([[0, 1, 1, 0, 0], + [0, 0, 1, 1, 0]]) + +def plot_transformation(ax, A, title): + ax.set_title(title) + ax.axhline(0, color='gray', lw=0.5) + ax.axvline(0, color='gray', lw=0.5) + ax.set_xlim(-5, 5) + ax.set_ylim(-5, 5) + ax.set_aspect('equal') + ax.grid(True) + + # Plot transformed circle + ax.plot(circle[0], circle[1], "y:", label='Circle') + + + # Plot transformed circle + circ_trans = A @ circle + ax.plot(circ_trans[0], circ_trans[1], color='darkorange', label='A ∘ Circle') + + # Plot transformed square + sq_trans = A @ square + ax.plot(square[0], square[1], 'g:', label='Square') + ax.plot(sq_trans[0], sq_trans[1], color='green', label='A ∘ Square') + + ax.legend() + +# Plot +fig, axes = plt.subplots(1, 3, figsize=(18, 6)) + +plot_transformation(axes[0], A_full_rank, "Rank 2: Full Rank (ℝ² → ℝ²)") +plot_transformation(axes[1], A_rank_1, "Rank 1: Collapse to Line") +plot_transformation(axes[2], A_rank_0, "Rank 0: Collapse to Origin") + +plt.suptitle("Geometric Effect of Rank: Vectors, Circle, and Square Transformed", fontsize=16) +plt.tight_layout(rect=[0, 0, 1, 0.93]) +plt.show() +``` \ No newline at end of file diff --git a/book/chapter_decompositions/orthogonal_matrices.md b/book/chapter_decompositions/orthogonal_matrices.md new file mode 100644 index 0000000..262758a --- /dev/null +++ b/book/chapter_decompositions/orthogonal_matrices.md @@ -0,0 +1,169 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Orthogonal matrices + +A matrix $\mathbf{Q} \in \mathbb{R}^{n \times n}$ is said to be +**orthogonal** if its columns are pairwise orthonormal. + + +This definition implies that + +$$\mathbf{Q}^{\!\top\!} \mathbf{Q} = \mathbf{Q}\mathbf{Q}^{\!\top\!} = \mathbf{I}$$ + +or equivalently, $\mathbf{Q}^{\!\top\!} = \mathbf{Q}^{-1}$. + +A nice thing about orthogonal matrices is that they preserve inner products: + +$$(\mathbf{Q}\mathbf{x})^{\!\top\!}(\mathbf{Q}\mathbf{y}) = \mathbf{x}^{\!\top\!} \mathbf{Q}^{\!\top\!} \mathbf{Q}\mathbf{y} = \mathbf{x}^{\!\top\!} \mathbf{I}\mathbf{y} = \mathbf{x}^{\!\top\!}\mathbf{y}$$ + +A direct result of this fact is that they also preserve 2-norms: + +$$\|\mathbf{Q}\mathbf{x}\|_2 = \sqrt{(\mathbf{Q}\mathbf{x})^{\!\top\!}(\mathbf{Q}\mathbf{x})} = \sqrt{\mathbf{x}^{\!\top\!}\mathbf{x}} = \|\mathbf{x}\|_2$$ + +Therefore multiplication by an orthogonal matrix can be considered as a +transformation that preserves length, but may rotate or reflect the +vector about the origin. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Asymmetrical vector set +vectors = np.array([[1, 0.5, -0.5], + [0, 1, 0.5]]) + +# Orthogonal matrices +theta = np.pi / 4 +Q_rot = np.array([[np.cos(theta), -np.sin(theta)], + [np.sin(theta), np.cos(theta)]]) +Q_reflect = np.array([[1, 0], + [0, -1]]) + +# Transform vectors +rotated_vectors = Q_rot @ vectors +reflected_vectors = Q_reflect @ vectors + +# Unit square +square = np.array([[0, 1, 1, 0, 0], + [0, 0, 1, 1, 0]]) +square_rotated = Q_rot @ square +square_reflected = Q_reflect @ square + +# Plotting +fig, axes = plt.subplots(1, 3, figsize=(15, 5)) + +# Function to plot a frame +def plot_frame(ax, vecs, square, title, color): + ax.quiver(np.zeros(vecs.shape[1]), np.zeros(vecs.shape[1]), + vecs[0], vecs[1], angles='xy', scale_units='xy', scale=1, color=color) + ax.plot(square[0], square[1], 'k--', lw=1.5, label='Transformed Unit Square') + ax.fill(square[0], square[1], facecolor='lightgray', alpha=0.3) + ax.set_title(title) + ax.set_xlim(-2, 2) + ax.set_ylim(-2, 2) + ax.set_aspect('equal') + ax.axhline(0, color='gray', lw=0.5) + ax.axvline(0, color='gray', lw=0.5) + ax.grid(True) + ax.legend() + +# Original +plot_frame(axes[0], vectors, square, "Original Vectors and Unit Square", 'blue') + +# Rotation +plot_frame(axes[1], rotated_vectors, square_rotated, "Rotation (Orthogonal Q)", 'green') + +# Reflection +plot_frame(axes[2], reflected_vectors, square_reflected, "Reflection (Orthogonal Q)", 'red') + +plt.suptitle("Orthogonal Transformations: Vectors and Unit Square", fontsize=16) +plt.tight_layout(rect=[0, 0, 1, 0.93]) +plt.show() +``` + +This enhanced visualization shows how **orthogonal transformations** affect both: + +* A set of **asymmetric vectors**, and + +* The **unit square**, which is preserved in shape and size but transformed in orientation: + +* **Left**: The original setup with vectors and the unit square. + +* **Middle**: A **rotation** — vectors and the square are rotated without distortion. + +* **Right**: A **reflection** — vectors and the square are flipped, but all lengths and angles remain unchanged. + +✅ This highlights that orthogonal matrices are **distance- and angle-preserving**, making them key to rigid transformations like rotations and reflections. + + +--- +:::{prf:theorem} Determinant of an Orthogonal Matrix +:label: thm-determinant-orthogonal-matrix +:nonumber: + +Let $\mathbf{Q} \in \mathbb{R}^{n \times n}$ be an **orthogonal matrix**, meaning: + +$$ +\mathbf{Q}^\top \mathbf{Q} = \mathbf{I} +$$ + +Then: + +$$ +\boxed{ +\det(\mathbf{Q}) = \pm 1 +} +$$ +::: + +:::{prf:proof} + +We start with the identity: + +$$ +\mathbf{Q}^\top \mathbf{Q} = \mathbf{I} +$$ + +Now take the determinant of both sides: + +$$ +\det(\mathbf{Q}^\top \mathbf{Q}) = \det(\mathbf{I}) = 1 +$$ + +Using the **multiplicativity of determinants** and the fact that $\det(\mathbf{Q}^\top) = \det(\mathbf{Q})$ (since $\det(\mathbf{A}^\top) = \det(\mathbf{A})$): + +$$ +\det(\mathbf{Q}^\top) \cdot \det(\mathbf{Q}) = (\det(\mathbf{Q}))^2 = 1 +$$ + +Taking square roots: + +$$ +\boxed{ +\det(\mathbf{Q}) = \pm 1 +} +$$ + +Thus, the determinant of any orthogonal matrix is either $+1$ (rotation) or $-1$ (reflection). + +$\quad \blacksquare$ +::: +--- + +## 🧠 Interpretation + +* **$\det(\mathbf{Q}) = 1$**: The transformation preserves orientation — e.g., **rotation**. +* **$\det(\mathbf{Q}) = -1$**: The transformation flips orientation — e.g., **reflection**. + +This theorem is foundational in rigid body transformations, 3D graphics, PCA, and more. diff --git a/book/chapter_decompositions/orthogonal_projections.md b/book/chapter_decompositions/orthogonal_projections.md new file mode 100644 index 0000000..3881026 --- /dev/null +++ b/book/chapter_decompositions/orthogonal_projections.md @@ -0,0 +1,408 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Orthogonal projections + +We now consider a particular kind of optimization problem referred to as **projection onto a subspace**: + +Given some point $\mathbf{x}$ in an inner product space $V$, find the +closest point to $\mathbf{x}$ in a subspace $S$ of $V$. + + +The following diagram should make it geometrically clear that, at least +in Euclidean space, the solution is intimately related to orthogonality +and the Pythagorean theorem: + +```{code-cell} ipython3 +:tags: [hide-input] +# Re-import required packages due to kernel reset +import numpy as np +import matplotlib.pyplot as plt + +# Define subspace S spanned by vector e1 +e1 = np.array([1, 2]) +e1 = e1 / np.linalg.norm(e1) # normalize to make it orthonormal + +# Define arbitrary point x not in the subspace +x = np.array([2, 1]) + +# Compute projection of x onto the subspace spanned by e1 +x_proj = np.dot(x, e1) * e1 + +# Define a second point y in the subspace (for triangle) +y = 3 * e1 + +# Set up plot +fig, ax = plt.subplots(figsize=(6, 6)) + +# Draw vectors +origin = np.array([0, 0]) +ax.quiver(*origin, *x, angles='xy', scale_units='xy', scale=1, color='blue', label=r'$\mathbf{x}$') +ax.quiver(*origin, *x_proj, angles='xy', scale_units='xy', scale=1, color='green', label=r'$\mathbf{y}^* = P\mathbf{x}$') +ax.quiver(*origin, *y, angles='xy', scale_units='xy', scale=1, color='gray', alpha=0.5, label=r'$\mathbf{y} \in S$') + +# Draw dashed lines to form triangle +ax.plot([x[0], x_proj[0]], [x[1], x_proj[1]], 'k--', lw=1) +ax.plot([y[0], x[0]], [y[1], x[1]], 'k--', lw=1) +ax.plot([y[0], x_proj[0]], [y[1], x_proj[1]], 'k--', lw=1) + +# Annotate +ax.text(*(x + 0.2), r'$\mathbf{x}$', fontsize=12) +ax.text(*(x_proj + 0.2), r'$\mathbf{y}^*$', fontsize=12) +ax.text(*(y + 0.2), r'$\mathbf{y}$', fontsize=12) + +# Draw subspace line +line_extent = np.linspace(-10, 10, 100) +s_line = np.outer(line_extent, e1) +ax.plot(s_line[:, 0], s_line[:, 1], 'r-', lw=1, label=r'Subspace $S$') + +# Formatting +ax.set_xlim(-0.5, 3) +ax.set_ylim(-0.5, 3) +ax.set_aspect('equal') +ax.grid(True) +ax.legend() +ax.set_title(r"Orthogonal Projection of $\mathbf{x}$ onto Subspace $S$") + +plt.tight_layout() +plt.show() +``` +In this diagram, the blue vector $\mathbf{x}$ is an arbitrary point in the +inner product space $V$, the green vector $\mathbf{y}^* = \mathbf{P}\mathbf{x}$ is +the projection of $\mathbf{x}$ onto the subspace $S$, and the gray vector +$\mathbf{y}$ is an arbitrary point in $S$. + +The dashed lines form a right triangle with $\mathbf{x}$, $\mathbf{y}^*$, and $\mathbf{y}$ as vertices. +The right triangle formed by these three points illustrates the +relationship between the projection and orthogonality: the line segment +from $\mathbf{x}$ to $\mathbf{y}^*$ is perpendicular to the subspace $S$, +and the distance from $\mathbf{x}$ to $\mathbf{y}^*$ is the shortest +distance from $\mathbf{x}$ to any point in $S$. + +This is a direct +consequence of the Pythagorean theorem, which states that in a right +triangle, the square of the length of the hypotenuse (in this case, +$\|\mathbf{x}-\mathbf{y}\|$) is equal to the sum of the squares of the +lengths of the other two sides (in this case, $\|\mathbf{x}-\mathbf{y}^*\|$ and $\|\mathbf{y}^*-\mathbf{y}\|$). + +Here $\mathbf{y}$ is an arbitrary element of the subspace $S$, and +$\mathbf{y}^*$ is the point in $S$ such that $\mathbf{x}-\mathbf{y}^*$ +is perpendicular to $S$. The hypotenuse of a right triangle (in this +case $\|\mathbf{x}-\mathbf{y}\|$) is always longer than either of the +legs (in this case $\|\mathbf{x}-\mathbf{y}^*\|$ and +$\|\mathbf{y}^*-\mathbf{y}\|$), and when $\mathbf{y} \neq \mathbf{y}^*$ +there always exists such a triangle between $\mathbf{x}$, $\mathbf{y}$, +and $\mathbf{y}^*$. + +Our intuition from Euclidean space suggests that the closest point to +$\mathbf{x}$ in $S$ has the perpendicularity property described above, +and we now show that this is indeed the case. + +:::{prf:proposition} Ortogonal projection and unique minimizer +:label: prop-unique-minimizer +:nonumber: +Let $S$ be a subspace of an inner product space $V$ and let $\mathbf{x} \in V$ and $\mathbf{y} \in S$. + +Then $\mathbf{y}^*$ +is the unique minimizer of $\|\mathbf{x}-\mathbf{y}\|$ over +$\mathbf{y} \in S$ if and only if $\mathbf{x}-\mathbf{y}^* \perp S$. +::: + +:::{prf:proof} + +$(\implies)$ Suppose $\mathbf{y}^*$ is the unique minimizer of +$\|\mathbf{x}-\mathbf{y}\|$ over $\mathbf{y} \in S$. + +That is, +$\|\mathbf{x}-\mathbf{y}^*\| \leq \|\mathbf{x}-\mathbf{y}\|$ for all +$\mathbf{y} \in S$, with equality only if $\mathbf{y} = \mathbf{y}^*$. + +Fix $\mathbf{v} \in S$ and observe that + +$$\begin{aligned} +g(t) :&= \|\mathbf{x}-\mathbf{y}^*+t\mathbf{v}\|^2 \\ +&= \langle \mathbf{x}-\mathbf{y}^*+t\mathbf{v}, \mathbf{x}-\mathbf{y}^*+t\mathbf{v} \rangle \\ +&= \langle \mathbf{x}-\mathbf{y}^*, \mathbf{x}-\mathbf{y}^* \rangle - 2t\langle \mathbf{x}-\mathbf{y}^*, \mathbf{v} \rangle + t^2\langle \mathbf{v}, \mathbf{v} \rangle \\ +&= \|\mathbf{x}-\mathbf{y}^*\|^2 - 2t\langle \mathbf{x}-\mathbf{y}^*, \mathbf{v} \rangle + t^2\|\mathbf{v}\|^2 +\end{aligned}$$ + +must have a minimum at $t = 0$ as a consequence of this +assumption. + +Thus + +$$0 = g'(0) = \left.-2\langle \mathbf{x}-\mathbf{y}^*, \mathbf{v} \rangle + 2t\|\mathbf{v}\|^2\right|_{t=0} = -2\langle \mathbf{x}-\mathbf{y}^*, \mathbf{v} \rangle$$ + +giving $\mathbf{x}-\mathbf{y}^* \perp \mathbf{v}$. Since $\mathbf{v}$ +was arbitrary in $S$, we have $\mathbf{x}-\mathbf{y}^* \perp S$ as +claimed. + +$(\impliedby)$ Suppose $\mathbf{x}-\mathbf{y}^* \perp S$. + +Observe that +for any $\mathbf{y} \in S$, $\mathbf{y}^*-\mathbf{y} \in S$ because +$\mathbf{y}^* \in S$ and $S$ is closed under subtraction. + +Under the +hypothesis, $\mathbf{x}-\mathbf{y}^* \perp \mathbf{y}^*-\mathbf{y}$, so +by the Pythagorean theorem, + +$$\|\mathbf{x}-\mathbf{y}\| = \|\mathbf{x}-\mathbf{y}^*+\mathbf{y}^*-\mathbf{y}\| = \|\mathbf{x}-\mathbf{y}^*\| + \|\mathbf{y}^*-\mathbf{y}\| \geq \|\mathbf{x} - \mathbf{y}^*\|$$ + +and in fact the inequality is strict when $\mathbf{y} \neq \mathbf{y}^*$ +since this implies $\|\mathbf{y}^*-\mathbf{y}\| > 0$. + +Thus +$\mathbf{y}^*$ is the unique minimizer of $\|\mathbf{x}-\mathbf{y}\|$ +over $\mathbf{y} \in S$. ◻ +::: + +Since a unique minimizer in $S$ can be found for any $\mathbf{x} \in V$, +we can define an operator + +$$\mathbf{P}\mathbf{x} = \operatorname{argmin}_{\mathbf{y} \in S} \|\mathbf{x}-\mathbf{y}\|$$ + +Observe that $\mathbf{P}\mathbf{y} = \mathbf{y}$ for any $\mathbf{y} \in S$, +since $\mathbf{y}$ has distance zero from itself and every other point +in $S$ has positive distance from $\mathbf{y}$. + +Thus +$\mathbf{\mathbf{P}}(\mathbf{\mathbf{P}}\mathbf{x}) = \mathbf{P}\mathbf{x}$ for any $\mathbf{x}$ (i.e., $\mathbf{P}^2 = \mathbf{P}$) +because $\mathbf{P}\mathbf{x} \in S$. + +The identity $\mathbf{P}^2 = \mathbf{P}$ is actually one of +the defining properties of a **projection**, the other being linearity. + +An immediate consequence of the previous result is that +$\mathbf{x} - \mathbf{P}\mathbf{x} \perp S$ for any $\mathbf{x} \in V$, and +conversely that $\mathbf{P}$ is the unique operator that satisfies this property +for all $\mathbf{x} \in V$. For this reason, $\mathbf{P}$ is known as an +**orthogonal projection**. + +If we choose an orthonormal basis for the target subspace $S$, it is +possible to write down a more specific expression for $\mathbf{P}$. + +:::{prf:proposition} +:label: prop-orthonormal-basis-projection +:nonumber: + +If $\mathbf{e}_1, \dots, \mathbf{e}_m$ is an orthonormal basis for $S$, +then + +$$\mathbf{P}\mathbf{x} = \sum_{i=1}^m \langle \mathbf{x}, \mathbf{e}_i \rangle\mathbf{e}_i$$ +::: + + +:::{prf:proof} +Let $\mathbf{e}_1, \dots, \mathbf{e}_m$ be an orthonormal basis +for $S$, and suppose $\mathbf{x} \in V$. + +Then for all $j = 1, \dots, m$, + +$$\begin{aligned} +\left\langle \mathbf{x}-\sum_{i=1}^m \langle \mathbf{x}, \mathbf{e}_i \rangle\mathbf{e}_i, \mathbf{e}_j \right\rangle &= \langle \mathbf{x}, \mathbf{e}_j \rangle - \sum_{i=1}^m \langle \mathbf{x}, \mathbf{e}_i \rangle\underbrace{\langle \mathbf{e}_i, \mathbf{e}_j \rangle}_{\delta_{ij}} \\ +&= \langle \mathbf{x}, \mathbf{e}_j \rangle - \langle \mathbf{x}, \mathbf{e}_j \rangle \\ +&= 0 +\end{aligned}$$ + +We have shown that the claimed expression, call it +$\tilde{\mathbf{P}}\mathbf{x}$, satisfies +$\mathbf{x} - \tilde{\mathbf{P}}\mathbf{x} \perp \mathbf{e}_j$ for every element +$\mathbf{e}_j$ of the orthonormal basis for $S$. + +It follows (by +linearity of the inner product) that +$\mathbf{x} - \tilde{\mathbf{P}}\mathbf{x} \perp S$. + +So the previous result +implies $\mathbf{P} = \tilde{\mathbf{P}}$. ◻ +::: + +The fact that $\mathbf{P}$ is a linear operator (and thus a proper projection, as +earlier we showed $\mathbf{P}^2 = \mathbf{P}$) follows readily from this result. + + +## **Matrix Representation of Projection Operators** + +Given a subspace $S \subset \mathbb{R}^n$, the **orthogonal projection** of a vector $\mathbf{x} \in \mathbb{R}^n$ onto $S$ is the unique vector $\mathbf{P}\mathbf{x} \in S$ such that: + +* $\mathbf{x} - \mathbf{P}\mathbf{x} \perp S$ (residual is orthogonal) +* $\mathbf{P}\mathbf{x} \in S$ (lies in the subspace) +* $\|\mathbf{x} - \mathbf{P}\mathbf{x}\|$ is minimized + +This leads us to define the projection operator $\mathbf{P} \in \mathbb{R}^{n \times n}$ as a **linear map** satisfying key properties — two of which are: + +* **idempotence** $(\mathbf{P}^2 = \mathbf{P})$ +* **symmetry** $(\mathbf{P}^\top = \mathbf{P})$ + +Let's now examine *why* they are essential. + +### Idempotence $\mathbf{P}^2 = \mathbf{P}$ is Required + +Idempotence ensures that once you've projected a vector onto the subspace, projecting it again **does nothing**: + +$$ +\mathbf{P}(\mathbf{P}\mathbf{x}) = \mathbf{P}\mathbf{x} +$$ + +### **Why it's required:** + +* Geometrically: The image $\mathbf{P}\mathbf{x}$ lies in the subspace. If projecting it again changed it, that would mean the subspace is not invariant under the projection — contradicting the notion of projection. +* Algebraically: If $\mathbf{P}^2 \neq \mathbf{P}$, then $\mathbf{P}$ is not consistent — it cannot define a *fixed* mapping to the subspace. + + +## Why Symmetry $\mathbf{P}^\top = \mathbf{P}$ is Required + +Symmetry ensures that the projection is **orthogonal**: the difference between $\mathbf{x}$ and its projection is orthogonal to the subspace: + +$$ +\langle \mathbf{x} - \mathbf{P}\mathbf{x}, \mathbf{P}\mathbf{x} \rangle = 0 +\quad \Leftrightarrow \quad +\mathbf{P}^\top = \mathbf{P} +$$ + +### **Why it's required:** + +* Without symmetry, $\mathbf{P}$ could project onto the subspace in a skewed or oblique manner — not orthogonally. +* Orthogonal projections are characterized by **minimal distance**, and this only occurs when the residual is orthogonal to the subspace. +* If $\mathbf{P} \neq \mathbf{P}^\top$, the projection may preserve direction, but **not minimize distance**. + +### **Geometric Consequence**: + +A non-symmetric idempotent matrix defines an **oblique projection**, which is still a projection but not orthogonal. It does not minimize distance to the subspace. + + +### Summary Table + +| Property | Meaning | Why Required | +| ------------ | ------------------------ | ---------------------------------------------------- | +| $\mathbf{P}^2 = \mathbf{P}$ | Idempotence / Stability | Ensures projecting twice gives same result | +| $\mathbf{P}^\top = \mathbf{P}$ | Symmetry / Orthogonality | Ensures projection is shortest-distance (orthogonal) | + +--- + + +## Basis Representation of Orthogonal Projection Matrices +Orthogonal projections can be expressed using matrices when the subspace is defined by a basis: + +If $S = \operatorname{span}(\mathbf{e}_1, \dots, \mathbf{e}_m)$, where the $\mathbf{e}_i$ are **orthonormal**, then the projection matrix is: + +$$ +\mathbf{P} = \sum_{i=1}^m \mathbf{e}_i \mathbf{e}_i^\top +$$ + +In matrix form, if $ E \in \mathbb{R}^{n \times m} $ has columns $\mathbf{e}_i$, then + +$$ +\mathbf{P} = EE^\top \quad \text{and} \quad \mathbf{P}\mathbf{x} = EE^\top \mathbf{x} +$$ + +:::{prf:theorem} Basis Representation of the Orthogonal Projection Matrix +:label: thm-orthogonal-projection-matrix +:nonumber: + +Let $\mathbf{e}_1, \dots, \mathbf{e}_m \in \mathbb{R}^n$ be orthonormal vectors, and define the matrix: + +$$ +E = [\mathbf{e}_1 \,\, \mathbf{e}_2 \,\, \cdots \,\, \mathbf{e}_m] \in \mathbb{R}^{n \times m} +$$ + +Then the matrix: + +$$ +\mathbf{P} = EE^\top \in \mathbb{R}^{n \times n} +$$ + +is the **orthogonal projection** onto the subspace $S = \operatorname{Col}(E) = \operatorname{span}(\mathbf{e}_1, \dots, \mathbf{e}_m)$. + +That is, for any $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{P}\mathbf{x} \in S$, and $\mathbf{x} - \mathbf{P}\mathbf{x} \perp S$. +::: + +:::{prf:proof} + +Let’s verify the three key properties of orthogonal projections. + +--- + +### 1. **$\mathbf{P}$ is symmetric:** + +$$ +\mathbf{P}^\top = (EE^\top)^\top = EE^\top = \mathbf{P} \quad \text{✓} +$$ + +--- + +### 2. **$\mathbf{P}$ is idempotent:** + +$$ +\mathbf{P}^2 = (EE^\top)(EE^\top) = E(E^\top E)E^\top +$$ + +But since $\{\mathbf{e}_i\}$ are orthonormal, we have: + +$$ +E^\top E = I_m \Rightarrow \mathbf{P}^2 = E I E^\top = EE^\top = \mathbf{P} \quad \text{✓} +$$ + +--- + +### 3. **$\mathbf{P}\mathbf{x} \in S$ and $\mathbf{x} - \mathbf{P}\mathbf{x} \perp S$:** + +Let $\mathbf{x} \in \mathbb{R}^n$. Then: + +$$ +\mathbf{P}\mathbf{x} = EE^\top \mathbf{x} \in \operatorname{Col}(E) = S +$$ + +Let $\mathbf{v} \in S$, so $\mathbf{v} = E\mathbf{a}$ for some $\mathbf{a} \in \mathbb{R}^m$. Then: + +$$ +\langle \mathbf{x} - \mathbf{P}\mathbf{x}, \mathbf{v} \rangle += \langle \mathbf{x} - EE^\top \mathbf{x}, E\mathbf{a} \rangle += \langle \mathbf{x}, E\mathbf{a} \rangle - \langle EE^\top \mathbf{x}, E\mathbf{a} \rangle +$$ + +Use $\langle \mathbf{x}, E\mathbf{a} \rangle = \langle E^\top \mathbf{x}, \mathbf{a} \rangle$, and similarly for the second term: + +$$ += \langle E^\top \mathbf{x}, \mathbf{a} \rangle - \langle E^\top EE^\top \mathbf{x}, \mathbf{a} \rangle += \langle E^\top \mathbf{x}, \mathbf{a} \rangle - \langle E^\top \mathbf{x}, \mathbf{a} \rangle = 0 +$$ + +So: + +$$ +\mathbf{x} - \mathbf{P}\mathbf{x} \perp S \quad \text{✓} +$$ + +We conclude that $\mathbf{P} = EE^\top = \sum_{i=1}^m \mathbf{e}_i \mathbf{e}_i^\top$ is indeed the orthogonal projection onto the subspace spanned by $\{\mathbf{e}_1, \dots, \mathbf{e}_m\}$. + +::: + + +### **Application Example: Least Squares Regression** +In least squares regression, we want to find the best-fitting line (or hyperplane) through a set of points. + +This can be framed as an orthogonal projection problem: + +Given a design matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ and target vector $\mathbf{y} \in \mathbb{R}^n$, the goal is to find coefficients $\boldsymbol{\beta} \in \mathbb{R}^d$ such that: +$$ +\hat{\boldsymbol{\beta}} = \operatorname{argmin}_{\boldsymbol{\beta}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 +$$ + +This is equivalent to projecting $\mathbf{y}$ onto the column space of $\mathbf{X}$, which can be expressed using the projection matrix: + +$$ +\mathbf{P} = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{X}\mathbf{X}^+ +$$ + +This projection minimizes the distance between $\mathbf{y}$ and the subspace spanned by the columns of $\mathbf{X}$, yielding the least squares solution. diff --git a/drafts/chapter_decompositions/overview_decompositions.md b/book/chapter_decompositions/overview_decompositions.md similarity index 100% rename from drafts/chapter_decompositions/overview_decompositions.md rename to book/chapter_decompositions/overview_decompositions.md diff --git a/book/chapter_decompositions/pca.md b/book/chapter_decompositions/pca.md new file mode 100644 index 0000000..cc06f65 --- /dev/null +++ b/book/chapter_decompositions/pca.md @@ -0,0 +1,405 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: math4ml + language: python + name: python3 +--- +# Principal Components Analysis + +Pricnipal Components Analysis (PCA) performs the orthogonal projection of the data onto a lower dimensional linear space. The goal is to find the directions (principal components) in which the variance of the data is maximized. +An alternative definition of PCA is based on minimizing the sum-of-sqares of the projection errors. + +## Formal definition + +Given a dataset $\mathbf{X} \in \mathbb{R}^{N \times D}$ (rows are samples, columns are features), we aim to find an orthonormal basis $\mathbf{U}_k \in \mathbb{R}^{D \times k}$, $k < D$, such that the projection of the data onto the subspace spanned by $\mathbf{U}_k$ captures **as much variance** (energy) as possible. + +In the following example, we visualize how PCA both minimizes reconstruction error in the original space and extracts a lower-dimensional, variance-preserving representation. + +```{code-cell} ipython3 +:tags: [hide-input] + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from pysnptools.snpreader import Bed +from mpl_toolkits.mplot3d import Axes3D + +# Generate synthetic 3D data +np.random.seed(42) +n_samples = 20 +covariance_3d = np.array([ + [5, 0.5, 0.7], + [0.5, 1, 0], + [0.7, 0, 10] +]) +rotation_3d = np.linalg.cholesky(covariance_3d) +data_3d = np.random.randn(n_samples, 3) @ rotation_3d.T + +# Center the data +mean_3d = np.mean(data_3d, axis=0) +data_centered_3d = data_3d - mean_3d + +# Compute SVD +U, S, Vt = np.linalg.svd(data_centered_3d, full_matrices=False) +V = Vt.T +S2 = S[:2] / np.sqrt(n_samples) +V2 = V[:, :2] + +# Project and reconstruct +proj_2d = data_centered_3d @ V2 +recon_3d = (proj_2d @ V2.T) + mean_3d[np.newaxis, :] + +# Create a mesh grid for the 2D PCA plane +grid_range = np.linspace(-2, 2, 100) +xx, yy = np.meshgrid(grid_range, grid_range) +plane_points = np.stack([xx.ravel(), yy.ravel()], axis=1) +plane_points *= S2[np.newaxis, :] +plane_3d = mean_3d[np.newaxis, :] + (plane_points @ V2.T) + +# Plot: 3D PCA + 2D Projection with principal components added in 2D view + +fig = plt.figure(figsize=(16, 6)) + +# 3D plot +ax1 = fig.add_subplot(121, projection='3d') +ax1.scatter(data_3d[:, 0], data_3d[:, 1], data_3d[:, 2], alpha=0.2, label='Original Data') +ax1.scatter(recon_3d[:, 0], recon_3d[:, 1], recon_3d[:, 2], alpha=0.6, label='Projected (Reconstructed) Points') +for i in range(n_samples): + ax1.plot( + [data_3d[i, 0], recon_3d[i, 0]], + [data_3d[i, 1], recon_3d[i, 1]], + [data_3d[i, 2], recon_3d[i, 2]], + 'gray', lw=0.5, alpha=0.5 + ) +ax1.plot_trisurf(plane_3d[:, 0], plane_3d[:, 1], plane_3d[:, 2], alpha=0.3, color='orange') +origin = mean_3d +ax1.quiver(*origin, *V[:, 0]*S2[0]*2, color='r', lw=2) +ax1.quiver(*origin, *V[:, 1]*S2[1]*2, color='blue', lw=2) + +ax1.set_title("PCA in 3D: Projection onto First Two PCs") +ax1.set_xlabel("X") +ax1.set_ylabel("Y") +ax1.set_zlabel("Z") +ax1.legend() + +# 2D projection plot +ax2 = fig.add_subplot(122) +ax2.scatter(proj_2d[:, 0], proj_2d[:, 1], alpha=0.8, c='orange', label='2D Projection') +# draw PC directions +ax2.plot([0, S2[0]*2], [0, 0], color='r', lw=2, label='1st PC') # x-axis +ax2.plot([0, 0], [0, S2[1]*2], color='blue', lw=2, label='2nd PC') # y-axis +ax2.axhline(0, color='gray', lw=0.5) +ax2.axvline(0, color='gray', lw=0.5) +ax2.set_title("Data Projected onto First Two Principal Components") +ax2.set_xlabel("1st PC") +ax2.set_ylabel("2nd PC") +ax2.axis('equal') +ax2.grid(True) +ax2.legend() + +plt.tight_layout() +plt.show() +``` +* **Left panel**: The original 3D data, its projection onto the best-fit 2D PCA plane (orange), and reconstruction lines showing projection error. +* **Right panel**: The same data projected onto the first two principal components, visualized in 2D. + +### Step 1: Center the Data + +We begin by centering the dataset so that the empirical mean is 0: + +$$ +\bar{\mathbf{x}} = \frac{1}{N} \sum_{i=1}^N \mathbf{x}_i, \quad \mathbf{X}_{\text{centered}} = \mathbf{X} - \mathbf{1}_N \bar{\mathbf{x}}^\top +$$ + +Define $\mathbf{X} \leftarrow \mathbf{X}_{\text{centered}}$ for the rest of the derivation. + +--- + +### Step 2: Define the Projection + +Let $\mathbf{U}_k \in \mathbb{R}^{D \times k}$ be an orthonormal matrix: $\mathbf{U}_k^\top \mathbf{U}_k = \mathbf{I}_k$. + +Project each sample $\mathbf{x}_i \in \mathbb{R}^D$ onto the subspace: + +$$ +\mathbf{z}_i = \mathbf{U}_k^\top \mathbf{x}_i \quad \text{(coordinates in the new basis)} +$$ + +$$ +\hat{\mathbf{x}}_i = \mathbf{U}_k \mathbf{z}_i = \mathbf{U}_k \mathbf{U}_k^\top \mathbf{x}_i \quad \text{(projected vector)} +$$ + +The projection matrix is: + +$$ +\mathbf{P} = \mathbf{U}_k \mathbf{U}_k^\top +$$ + +--- + +### Step 3: Define the Reconstruction Error + +We want to **minimize** the total squared reconstruction error (projection error): + +$$ +\sum_{i=1}^N \left\| \mathbf{x}_i - \hat{\mathbf{x}}_i \right\|^2 += \sum_{i=1}^N \left\| \mathbf{x}_i - \mathbf{U}_k \mathbf{U}_k^\top \mathbf{x}_i \right\|^2 +$$ + +In matrix form: + +$$ +\mathcal{L}(\mathbf{U}_k) = \left\| \mathbf{X} - \mathbf{X} \mathbf{U}_k \mathbf{U}_k^\top \right\|_F^2 +$$ + +where $\|\cdot\|_F$ denotes the Frobenius norm. + + +--- + +### Step 4: Reformulate as a Maximization Problem + +Instead of minimizing reconstruction error, we **maximize the variance (energy) retained**: + +$$ +\text{maximize } \text{tr}\left( \mathbf{U}_k^\top \mathbf{X}^\top \mathbf{X} \mathbf{U}_k \right) \quad \text{subject to } \mathbf{U}_k^\top \mathbf{U}_k = \mathbf{I} +$$ + +This comes from noting: + +$$ +\|\mathbf{X} \mathbf{U}_k\|_F^2 = \sum_{i=1}^N \|\mathbf{U}_k^\top \mathbf{x}_i\|^2 = \text{tr}\left( \mathbf{U}_k^\top \mathbf{X}^\top \mathbf{X} \mathbf{U}_k \right) +$$ + +--- + +### Step 5: Solve Using the Spectral Theorem + +Let $\mathbf{X}^\top \mathbf{X} = \mathbf{M} \in \mathbb{R}^{D \times D}$. This matrix is symmetric and positive semidefinite. + +By the **spectral theorem**, there exists an orthonormal basis of eigenvectors $\mathbf{u}_1, \dots, \mathbf{u}_D$ with eigenvalues $\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_D \ge 0$, such that: + +$$ +\mathbf{M} = \mathbf{X}^\top \mathbf{X} = \mathbf{U} \Lambda \mathbf{U}^\top +$$ + +Choose $\mathbf{U}_k = [\mathbf{u}_1, \dots, \mathbf{u}_k]$ to maximize $\text{tr}( \mathbf{U}_k^\top \mathbf{M} \mathbf{U}_k )$. + +This is optimal because trace is maximized by choosing eigenvectors with **largest** eigenvalues (known from Rayleigh-Ritz and Courant-Fischer principles). + +## PCA Derivation Summary + +- **Input**: Centered data matrix \(\mathbf{X} \in \mathbb{R}^{N \times D}\) +- **Goal**: Find orthonormal matrix \(\mathbf{U}_k \in \mathbb{R}^{D \times k}\) that captures most variance +- **Solution**: Maximize \( \text{tr}(\mathbf{U}_k^\top \mathbf{X}^\top \mathbf{X} \mathbf{U}_k) \), subject to \( \mathbf{U}_k^\top \mathbf{U}_k = \mathbf{I} \) +- **Optimal**: Columns of \(\mathbf{U}_k\) are top \(k\) eigenvectors of \( \mathbf{X}^\top \mathbf{X} \) +- **Projection**: \( \mathbf{Z} = \mathbf{X} \mathbf{U}_k \) +- **Reconstruction**: \( \tilde{\mathbf{X}} = \mathbf{Z} \mathbf{U}_k^\top \) + +## PCA algorithm step by step + +1. Calculate the mean of the data + +$$ \mathbf{\bar{x}} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{x}_i $$ + +2. Calculate the covariance matrix $\mathbf{S}$ of the data: + +$$ \mathbf{S} = \frac{1}{N-1} \sum_{i=1}^{N} (\mathbf{x}_i - \mathbf{\bar{x}})(\mathbf{x}_i - \mathbf{\bar{x}})^T $$ + +Both the mean and the covariance matrix are calculated by `empirical_covariance` function. + +3. Calculate the eigenvalues $\lambda_i$ and eigenvectors $\mathbf{u}_i$ of the covariance matrix $\mathbf{S}$ + +4. Sort the eigenvalues in descending order and then sort the eigenvectors accordingly. +Create a principal components matrix $\mathbf{U}$ by taking the first $k$ eigenvectors, where $k$ is the number of dimensions we want to keep. +This step is implemented in the `fit` method of the `PCA` class. + +5. To project the data onto the new space, we can use the following formula: + +$$ \mathbf{Y} = \mathbf{X} \cdot \mathbf{U} $$ +This step is implemented in the `transform` method of the `PCA` class. + +6. To reconstruct the data, we can use the following formula: + +$$ \mathbf{\tilde{X}} = \mathbf{Y} \cdot \mathbf{U}^T + \mathbf{\bar{x}} $$ +This step is implemented in the `inverse_transform` method of the `PCA` class. + +Note that recontructing the data will not give us the original data: $\mathbf{X} \neq \mathbf{\tilde{X}}$. + +## Implementation + +For the PCA algorithm we implement `empirical_covariance` method that would be usef do calculating the covariance of the data. + +```{code-cell} ipython3 +def empirical_covariance(X): + """ + Calculates the empirical covariance matrix for a given dataset. + + Parameters: + X (numpy.ndarray): A 2D numpy array where rows represent samples and columns represent features. + + Returns: + tuple: A tuple containing the mean of the dataset and the covariance matrix. + """ + N = X.shape[0] # Number of samples + mean = X.mean(axis=0) # Calculate the mean of each feature + X_centered = X - mean[np.newaxis, :] # Center the data by subtracting the mean + covariance = X_centered.T @ X_centered / (N - 1) # Compute the covariance matrix + return mean, covariance +``` + +We also impmlement `PCA` class with `fit`, `transform` and `reverse_transform` methods. + +```{code-cell} ipython3 +class PCA: + def __init__(self, k=None): + """ + Initializes the PCA class without any components. + + Parameters: + k (int, optional): Number of principal components to use. + """ + self.pc_variances = None # Eigenvalues of the covariance matrix + self.principal_components = None # Eigenvectors of the covariance matrix + self.mean = None # Mean of the dataset + self.k = k # the number of dimensions + + def fit(self, X): + """ + Fit the PCA model to the dataset by computing the covariance matrix and its eigen decomposition. + + Parameters: + X (numpy.ndarray): The data to fit the model on. + """ + self.mean, covariance = empirical_covariance(X=X) + eig_values, eig_vectors = np.linalg.eigh(covariance) # Compute eigenvalues and eigenvectors + self.pc_variances = eig_values[::-1] # the eigenvalues are returned by eigh in ascending order. We want them in descending order (largest first) + self.principal_components = eig_vectors[:, ::-1] # the eigenvectors in same order as eingevalues + if self.k is not None: + self.pc_variances = self.pc_variances[:self.k] + self.principal_components = self.principal_components[:,:self.k] + + def transform(self, X): + """ + Transform the data into the principal component space. + + Parameters: + X (numpy.ndarray): Data to transform. + + Returns: + numpy.ndarray: Transformed data. + """ + X_centered = X - self.mean + return X_centered @ self.principal_components + + def reverse_transform(self, Z): + """ + Transform data back to its original space. + + Parameters: + Z (numpy.ndarray): Transformed data to invert. + + Returns: + numpy.ndarray: Data in its original space. + """ + return Z @ self.principal_components.T + self.mean + + def variance_explained(self): + """ + Returns the amount of variance explained by the first k principal components. + + Returns: + numpy.ndarray: Variances explained by the first k components. + """ + return self.pc_variances +``` + +In the example below, we will use the PCA algorithm to reduce the dimensionality of a genetic dataset from the 1000 genomes project [1,2]. + +[1] Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015) + +[2] Altshuler, D. M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010) + +After reducing the dimensionality, we will plot the results and examine whether clusters of ancestries are visible. + +We consider five ancestries in the dataset: + +- **EUR** - European +- **AFR** - African +- **EAS** - East Asian +- **SAS** - South Asian +- **AMR** - Native American + +```{code-cell} ipython3 +:tags: [hide-input] +snpreader = Bed('../../datasets/genetic_data_1kg/example2.bed', count_A1=True) +data = snpreader.read() +print(data.shape) +# y includes our labels and x includes our features +labels = pd.read_csv("../../datasets/genetic_data_1kg/1kg_annotations_edit.txt", sep="\t", index_col="Sample") +list1 = data.iid[:,1].tolist() #list with the Sample numbers present in genetic dataset +labels = labels[labels.index.isin(list1)] #filter labels DataFrame so it only contains the sampleIDs present in genetic data +y = labels.SuperPopulation # EUR, AFR, AMR, EAS, SAS +X = data.val[:, ~np.isnan(data.val).any(axis=0)] #load genetic data to X, removing NaN values +pca = PCA() +pca.fit(X=X) + +X_pc = pca.transform(X) +X_reconstruction_full = pca.reverse_transform(X_pc) +print("L1 reconstruction error for full PCA : %.4E " % (np.absolute(X - X_reconstruction_full).sum())) + +for rank in range(5): #more correct: X_pc.shape[1]+1 + pca_lowrank = PCA(k=rank) + pca_lowrank.fit(X=X) + X_lowrank = pca_lowrank.transform(X) + X_reconstruction = pca_lowrank.reverse_transform(X_lowrank) + print("L1 reconstruction error for rank %i PCA : %.4E " % (rank, np.absolute(X - X_reconstruction).sum())) + +fig = plt.figure() +plt.plot(X_pc[y=="EUR"][:,0], X_pc[y=="EUR"][:,1],'.', alpha = 0.3) +plt.plot(X_pc[y=="AFR"][:,0], X_pc[y=="AFR"][:,1],'.', alpha = 0.3) +plt.plot(X_pc[y=="EAS"][:,0], X_pc[y=="EAS"][:,1],'.', alpha = 0.3) +plt.plot(X_pc[y=="AMR"][:,0], X_pc[y=="AMR"][:,1],'.', alpha = 0.3) +plt.plot(X_pc[y=="SAS"][:,0], X_pc[y=="SAS"][:,1],'.', alpha = 0.3) +plt.xlabel("PC 1") +plt.ylabel("PC 2") +plt.legend(["EUR", "AFR","EAS","AMR","SAS"]) + +fig2 = plt.figure() +plt.plot(X_pc[y=="EUR"][:,0], X_pc[y=="EUR"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="AFR"][:,0], X_pc[y=="AFR"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="EAS"][:,0], X_pc[y=="EAS"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="AMR"][:,0], X_pc[y=="AMR"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="SAS"][:,0], X_pc[y=="SAS"][:,2],'.', alpha = 0.3) +plt.xlabel("PC 1") +plt.ylabel("PC 3") +plt.legend(["EUR", "AFR","EAS","AMR","SAS"]) + + +fig3 = plt.figure() +plt.plot(X_pc[y=="EUR"][:,1], X_pc[y=="EUR"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="AFR"][:,1], X_pc[y=="AFR"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="EAS"][:,1], X_pc[y=="EAS"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="AMR"][:,1], X_pc[y=="AMR"][:,2],'.', alpha = 0.3) +plt.plot(X_pc[y=="SAS"][:,1], X_pc[y=="SAS"][:,2],'.', alpha = 0.3) +plt.xlabel("PC 2") +plt.ylabel("PC 3") +plt.legend(["EUR", "AFR","EAS","AMR","SAS"]) + +fig4 = plt.figure() +plt.plot(pca.variance_explained()) +plt.xlabel("PC dimension") +plt.ylabel("variance explained") + +fig4 = plt.figure() +plt.plot(pca.variance_explained().cumsum() / pca.variance_explained().sum()) +plt.xlabel("PC dimension") +plt.ylabel("cumulative fraction of variance explained") +plt.show() +``` \ No newline at end of file diff --git a/drafts/chapter_decompositions/psd_matrices.md b/book/chapter_decompositions/psd_matrices.md similarity index 63% rename from drafts/chapter_decompositions/psd_matrices.md rename to book/chapter_decompositions/psd_matrices.md index ab14ce8..90fdf61 100644 --- a/drafts/chapter_decompositions/psd_matrices.md +++ b/book/chapter_decompositions/psd_matrices.md @@ -1,62 +1,94 @@ -## Positive (semi-)definite matrices - -A symmetric matrix $\mathbf{A}$ is **positive semi-definite** if for all -$\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \geq 0$. Sometimes people -write $\mathbf{A} \succeq 0$ to indicate that $\mathbf{A}$ is positive +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Positive (semi-)definite matrices + +>A symmetric matrix $\mathbf{A}$ is **positive semi-definite** if for all $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \geq 0$. +> +>Sometimes people write $\mathbf{A} \succeq 0$ to indicate that $\mathbf{A}$ is positive semi-definite. -A symmetric matrix $\mathbf{A}$ is **positive definite** if for all -nonzero $\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} > 0$. Sometimes people write -$\mathbf{A} \succ 0$ to indicate that $\mathbf{A}$ is positive definite. +> A symmetric matrix $\mathbf{A}$ is **positive definite** if for all nonzero $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} > 0$. +> +>Sometimes people write $\mathbf{A} \succ 0$ to indicate that $\mathbf{A}$ is positive definite. + Note that positive definiteness is a strictly stronger property than positive semi-definiteness, in the sense that every positive definite matrix is positive semi-definite but not vice-versa. These properties are related to eigenvalues in the following way. -*Proposition.* +:::{prf:proposition} Eigenvalues of Positive Definite Matrices +:label: trm-psd-eigenvalues +:nonumber: A symmetric matrix is positive semi-definite if and only if all of its eigenvalues are nonnegative, and positive definite if and only if all of its eigenvalues are positive. +::: - -*Proof.* Suppose $A$ is positive semi-definite, and let $\mathbf{x}$ be -an eigenvector of $\mathbf{A}$ with eigenvalue $\lambda$. Then +:::{prf:proof} +Suppose $A$ is positive semi-definite, and let $\mathbf{x}$ be +an eigenvector of $\mathbf{A}$ with eigenvalue $\lambda$. +Then $$0 \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}(\lambda\mathbf{x}) = \lambda\mathbf{x}^{\!\top\!}\mathbf{x} = \lambda\|\mathbf{x}\|_2^2$$ Since $\mathbf{x} \neq \mathbf{0}$ (by the assumption that it is an eigenvector), we have $\|\mathbf{x}\|_2^2 > 0$, so we can divide both -sides by $\|\mathbf{x}\|_2^2$ to arrive at $\lambda \geq 0$. If -$\mathbf{A}$ is positive definite, the inequality above holds strictly, -so $\lambda > 0$. This proves one direction. +sides by $\|\mathbf{x}\|_2^2$ to arrive at $\lambda \geq 0$. + +If $\mathbf{A}$ is positive definite, the inequality above holds strictly, +so $\lambda > 0$. + +This proves one direction. To simplify the proof of the other direction, we will use the machinery -of Rayleigh quotients. Suppose that $\mathbf{A}$ is symmetric and all -its eigenvalues are nonnegative. Then for all +of Rayleigh quotients. + +Suppose that $\mathbf{A}$ is symmetric and all +its eigenvalues are nonnegative. + +Then for all $\mathbf{x} \neq \mathbf{0}$, $$0 \leq \lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x})$$ Since $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ matches $R_\mathbf{A}(\mathbf{x})$ in sign, we conclude that $\mathbf{A}$ is -positive semi-definite. If the eigenvalues of $\mathbf{A}$ are all +positive semi-definite. + +If the eigenvalues of $\mathbf{A}$ are all strictly positive, then $0 < \lambda_{\min}(\mathbf{A})$, whence it follows that $\mathbf{A}$ is positive definite. ◻ +::: + +## Gram matrices +In many machine learning algorithms, especially those involving regression, classification, or kernel methods, we frequently work with **data matrices** $\mathbf{A} \in \mathbb{R}^{m \times n}$, where each **row** represents a sample and each **column** a feature. From such matrices, we often compute **matrices of inner products** like $\mathbf{A}^\top \mathbf{A}$. These matrices — called **Gram matrices** — encode the pairwise **similarity between features** (or, in kernelized settings, between samples), and play a central role in optimization problems such as least squares, ridge regression, and principal component analysis. +:::{prf:proposition} Gram Matrices +:label: trm-gram-matrices +:nonumber: -As an example of how these matrices arise, consider +Suppose $\mathbf{A} \in \mathbb{R}^{m \times n}$. -*Proposition.* -Suppose $\mathbf{A} \in \mathbb{R}^{m \times n}$. Then -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. If -$\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, then +Then $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. + +If $\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, then $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. +::: +:::{prf:proof} -*Proof.* For any $\mathbf{x} \in \mathbb{R}^n$, +For any $\mathbf{x} \in \mathbb{R}^n$, $$\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = (\mathbf{A}\mathbf{x})^{\!\top\!}(\mathbf{A}\mathbf{x}) = \|\mathbf{A}\mathbf{x}\|_2^2 \geq 0$$ @@ -65,28 +97,43 @@ so $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. Note that $\|\mathbf{A}\mathbf{x}\|_2^2 = 0$ implies $\|\mathbf{A}\mathbf{x}\|_2 = 0$, which in turn implies $\mathbf{A}\mathbf{x} = \mathbf{0}$ (recall that this is a property of -norms). If $\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, +norms). + +If $\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, $\mathbf{A}\mathbf{x} = \mathbf{0}$ implies $\mathbf{x} = \mathbf{0}$, so $\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = 0$ if and only if $\mathbf{x} = \mathbf{0}$, and thus $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. ◻ +::: + +We observe that kernel matrices computed for all pairs of instances in a data set are positive semi definite. In fact, many kernel functions, like for example the RBF kernel, guarantee positive definiteness of the kernel matrix as long as all data points are pairwise distinct. + + +## Invertibility Positive definite matrices are invertible (since their eigenvalues are -nonzero), whereas positive semi-definite matrices might not be. However, -if you already have a positive semi-definite matrix, it is possible to +nonzero), whereas positive semi-definite matrices might not be. + +However, if you already have a positive semi-definite matrix, it is possible to perturb its diagonal slightly to produce a positive definite matrix. -*Proposition.* +:::{prf:proposition} +:label: trm-A-plus-eps +:nonumber: + If $\mathbf{A}$ is positive semi-definite and $\epsilon > 0$, then $\mathbf{A} + \epsilon\mathbf{I}$ is positive definite. +::: -*Proof.* Assuming $\mathbf{A}$ is positive semi-definite and +:::{prf:proof} +Assuming $\mathbf{A}$ is positive semi-definite and $\epsilon > 0$, we have for any $\mathbf{x} \neq \mathbf{0}$ that $$\mathbf{x}^{\!\top\!}(\mathbf{A}+\epsilon\mathbf{I})\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} + \epsilon\mathbf{x}^{\!\top\!}\mathbf{I}\mathbf{x} = \underbrace{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}_{\geq 0} + \underbrace{\epsilon\|\mathbf{x}\|_2^2}_{> 0} > 0$$ as claimed. ◻ +::: An obvious but frequently useful consequence of the two propositions we have just shown is that @@ -94,22 +141,28 @@ $\mathbf{A}^{\!\top\!}\mathbf{A} + \epsilon\mathbf{I}$ is positive definite (and in particular, invertible) for *any* matrix $\mathbf{A}$ and any $\epsilon > 0$. -### The geometry of positive definite quadratic forms +## The geometry of positive definite quadratic forms A useful way to understand quadratic forms is by the geometry of their -level sets. A **level set** or **isocontour** of a function is the set +level sets. +A **level set** or **isocontour** of a function is the set of all inputs such that the function applied to those inputs yields a -given output. Mathematically, the $c$-isocontour of $f$ is +given output. + +Mathematically, the $c$-isocontour of $f$ is $\{\mathbf{x} \in \operatorname{dom} f : f(\mathbf{x}) = c\}$. Let us consider the special case $f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ where -$\mathbf{A}$ is a positive definite matrix. Since $\mathbf{A}$ is +$\mathbf{A}$ is a positive definite matrix. + +Since $\mathbf{A}$ is positive definite, it has a unique matrix square root $\mathbf{A}^{\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, where $\mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$ is the eigendecomposition of $\mathbf{A}$ and $\mathbf{\Lambda}^{\frac{1}{2}} = \operatorname{diag}(\sqrt{\lambda_1}, \dots \sqrt{\lambda_n})$. + It is easy to see that this matrix $\mathbf{A}^{\frac{1}{2}}$ is positive definite (consider its eigenvalues) and satisfies $\mathbf{A}^{\frac{1}{2}}\mathbf{A}^{\frac{1}{2}} = \mathbf{A}$. Fixing @@ -121,10 +174,16 @@ $$c = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A where we have used the symmetry of $\mathbf{A}^{\frac{1}{2}}$. Making the change of variable $\mathbf{z} = \mathbf{A}^{\frac{1}{2}}\mathbf{x}$, we have the condition -$\|\mathbf{z}\|_2 = \sqrt{c}$. That is, the values $\mathbf{z}$ lie on a -sphere of radius $\sqrt{c}$. These can be parameterized as +$\|\mathbf{z}\|_2 = \sqrt{c}$. + +That is, the values $\mathbf{z}$ lie on a +sphere of radius $\sqrt{c}$. + +These can be parameterized as $\mathbf{z} = \sqrt{c}\hat{\mathbf{z}}$ where $\hat{\mathbf{z}}$ has -$\|\hat{\mathbf{z}}\|_2 = 1$. Then since +$\|\hat{\mathbf{z}}\|_2 = 1$. + +Then since $\mathbf{A}^{-\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, we have @@ -132,7 +191,9 @@ $$\mathbf{x} = \mathbf{A}^{-\frac{1}{2}}\mathbf{z} = \mathbf{Q}\mathbf{\Lambda}^ where $\tilde{\mathbf{z}} = \mathbf{Q}^{\!\top\!}\hat{\mathbf{z}}$ also satisfies $\|\tilde{\mathbf{z}}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Using this parameterization, we see that the solution set +orthogonal. + +Using this parameterization, we see that the solution set $\{\mathbf{x} \in \mathbb{R}^n : f(\mathbf{x}) = c\}$ is the image of the unit sphere $\{\tilde{\mathbf{z}} \in \mathbb{R}^n : \|\tilde{\mathbf{z}}\|_2 = 1\}$ @@ -164,7 +225,7 @@ $$\mathbf{Q}\mathbf{e}_i = \sum_{j=1}^n [\mathbf{e}_i]_j\mathbf{q}_j = \mathbf{q where we have used the matrix-vector product identity from earlier. -In summary: the isocontours of +**In summary:** the isocontours of $f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ are ellipsoids such that the axes point in the directions of the eigenvectors of $\mathbf{A}$, and the radii of these axes are @@ -172,3 +233,5 @@ proportional to the inverse square roots of the corresponding eigenvalues. +To demonstrate the eigenvalue decomposition of a positive semi-definite matrix, we will be looking at Principal Component Analysis (PCA) algorithm in the next section. The algorithm is a technique used for applications like dimensionality reduction, lossy data compression, feature extraction and data visualization. + diff --git a/book/chapter_decompositions/pseudoinverse.md b/book/chapter_decompositions/pseudoinverse.md new file mode 100644 index 0000000..17bde17 --- /dev/null +++ b/book/chapter_decompositions/pseudoinverse.md @@ -0,0 +1,217 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Moore-Penrose Pseudoinverse +The Moore-Penrose pseudoinverse is a generalization of the matrix inverse that can be applied to non-square or singular matrices. It is denoted as $ \mathbf{A}^+ $ for a matrix $ \mathbf{A} $. + +The pseudoinverse satisfies the following defining properties: +- $ \mathbf{A} \mathbf{A}^+ \mathbf{A} = \mathbf{A} $ +- $ \mathbf{A}^+ \mathbf{A} \mathbf{A}^+ = \mathbf{A}^+ $ + +From these properties, we can derive the following additional properties: +- $ (\mathbf{A} \mathbf{A}^+)^\top = \mathbf{A} \mathbf{A}^+ $ +- $ (\mathbf{A}^+ \mathbf{A})^\top = \mathbf{A}^+ \mathbf{A} $ +- **Existence**: The pseudoinverse exists for any matrix $ \mathbf{A} $. +- **Uniqueness**: The pseudoinverse is unique. +- **Rank**: The rank of $ \mathbf{A}^+ $ is equal to the rank of $ \mathbf{A} $. + +**Least Squares Solution**: The pseudoinverse provides a least squares solution to the equation $ \mathbf{A}\mathbf{x} = \mathbf{b} $ when $ \mathbf{A} $ is not square or has no unique solution. The least squares solution is given by: + + $$ + \mathbf{x} = \mathbf{A}^+ \mathbf{b} + $$ +**Geometric Interpretation**: The pseudoinverse can be interpreted geometrically as the projection of a vector onto the column space of $ \mathbf{A} $. +**Computational Considerations**: The computation of the pseudoinverse can be done efficiently using numerical methods, such as the SVD, especially for large matrices. +**Limitations**: The pseudoinverse may not be suitable for all applications, especially when the matrix is ill-conditioned or has a high condition number. + + +## The Pseudoinverse in Linear Regression + +In linear regression, we often encounter the problem of finding the best-fitting line (or hyperplane) through a set of data points. The Moore-Penrose pseudoinverse provides a tool for solving this problem, especially when the design matrix is not square or is singular. + +### 1. **Ordinary Least Squares (OLS) Problem** + +Given data matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$, and target $\mathbf{y} \in \mathbb{R}^{n}$, the OLS problem is: + +**Objective**: Minimize the squared error between predictions and targets: + +$$ +\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} |\mathbf{X}\boldsymbol{\beta} - \mathbf{y}|^2 +$$ + +This is a quadratic problem with a closed-form solution if $ \mathbf{X}^\top \mathbf{X} $ is invertible: + +**OLS solution**: + +$$ +\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} +$$ + +--- + +### 2. **Observe: This Has the Structure of a Pseudoinverse** + +We now define: + +$$ +\mathbf{X}^+ := (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top +$$ + +We claim that $ \mathbf{X}^+ $ is the **Moore–Penrose pseudoinverse of $ \mathbf{X} $**, and it satisfies the four defining properties — **if $ \mathbf{X} $ has full column rank**: + +1. $ \mathbf{X}\mathbf{X}^+\mathbf{X} = \mathbf{X} $ +2. $ \mathbf{X}^+\mathbf{X}\mathbf{X}^+ = \mathbf{X}^+ $ +3. $ (\mathbf{X}\mathbf{X}^+)^\top = \mathbf{X}\mathbf{X}^+ $ +4. $ (\mathbf{X}^+\mathbf{X})^\top = \mathbf{X}^+\mathbf{X} $ + +--- + +### 3. **State the General Case: Unique Pseudoinverse Always Exists** + +Regardless of whether $ \mathbf{X} $ has full column rank or not, there is a **unique** matrix $ \mathbf{X}^+ \in \mathbb{R}^{d \times n} $ satisfying all four Moore–Penrose conditions. + +This is the **Moore–Penrose pseudoinverse**, and: + +$$ +\hat{\boldsymbol{\beta}} = \mathbf{X}^+ \mathbf{y} +$$ + +still gives the **minimum-norm least squares solution**, even if $\mathbf{X}$ is not full rank. + +--- + +### 4. **Numerical Example Using NumPy’s `pinv`** + +```{code-cell} ipython3 +import numpy as np +import matplotlib.pyplot as plt + +# Simulate linear data with collinearity +np.random.seed(1) +n, d = 100, 5 +X = np.random.randn(n, d) +X[:, 3] = X[:, 1] + 0.01 * np.random.randn(n) # make column 3 nearly linearly dependent on column 1 +beta_true = np.array([2.0, -1.0, 0.0, 0.5, 3.0]) +y = X @ beta_true + np.random.randn(n) * 0.5 + +# Compute OLS via pseudoinverse +X_pinv = np.linalg.pinv(X) # Moore–Penrose pseudoinverse +beta_hat = X_pinv @ y + +# Compare with np.linalg.lstsq (which uses SVD internally) +beta_lstsq, *_ = np.linalg.lstsq(X, y, rcond=None) + +print("True coefficients: ", beta_true) +print("Estimated (pinv): ", beta_hat) +print("Estimated (lstsq): ", beta_lstsq) + +# Prediction +y_hat = X @ beta_hat + +# Visualization +plt.scatter(y, y_hat, alpha=0.6) +plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', label='Ideal') +plt.xlabel('True y') +plt.ylabel(r'Predicted $\hat{y}$') +plt.title('Linear Regression via Pseudoinverse') +plt.legend() +plt.grid(True) +plt.show() +``` +--- + +The **OLS formula** is a special case of the **pseudoinverse**, valid under full column rank. +The **pseudoinverse is unique** and always provides a **least-squares solution**. + +Note that `numpy.linalg.pinv` computes the **Moore–Penrose pseudoinverse** using the **Singular Value Decomposition (SVD)**. This method is both **general** and **numerically stable**, making it well-suited for pseudoinverse computation even when the matrix is not full rank. + +## How `np.linalg.pinv` Works Internally + +Suppose you have a matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$. + +The pseudoinverse $\mathbf{X}^+ \in \mathbb{R}^{d \times n}$ is computed as follows: + +### **Step 1: Perform reduced SVD** + +```python +U, S, Vt = np.linalg.svd(X, full_matrices=False) +``` + +This gives: + +$$ +\mathbf{X} = \mathbf{U}_r \boldsymbol{\Sigma}_r \mathbf{V}_r^\top +$$ + +Where: + +* $\mathbf{U}_r \in \mathbb{R}^{n \times r}$, with orthonormal columns +* $\boldsymbol{\Sigma}_r \in \mathbb{R}^{r \times r}$, diagonal matrix with singular values $\sigma_1, \dots, \sigma_r$ +* $\mathbf{V}_r \in \mathbb{R}^{d \times r}$, with orthonormal columns +* $r = \text{rank}(\mathbf{X})$ + +--- + +### **Step 2: Invert the Non-Zero Singular Values** + +You construct the diagonal matrix $\boldsymbol{\Sigma}_r^+ \in \mathbb{R}^{r \times r}$ as: + +$$ +\Sigma^+_{ii} = \begin{cases} +1/\sigma_i & \text{if } \sigma_i > \text{rcond} \cdot \sigma_{\max} \\ +0 & \text{otherwise} +\end{cases} +$$ + +This step **thresholds small singular values** using the `rcond` parameter (default: machine epsilon). + +--- + +### **Step 3: Recompose the Pseudoinverse** + +The pseudoinverse is then: + +$$ +\mathbf{X}^+ = \mathbf{V}_r \boldsymbol{\Sigma}_r^+ \mathbf{U}_r^\top +$$ + +In code: + +```{code-cell} ipython3 +def pinv_manual(X, rcond=1e-15): + U, S, Vt = np.linalg.svd(X, full_matrices=False) + S_inv = np.array([1/s if s > rcond * S[0] else 0 for s in S]) + return Vt.T @ np.diag(S_inv) @ U.T +``` + +--- + +### ✅ Advantages of SVD-Based Pseudoinverse + +* **Numerically stable**: even if $\mathbf{X}^\top \mathbf{X}$ is ill-conditioned +* **General**: works for rank-deficient or rectangular matrices +* **Gives minimum-norm solution** to $\mathbf{X}\boldsymbol{\beta} = \mathbf{y}$ + +--- + +### 🧪 Check with NumPy + +You can verify this approach: + +```{code-cell} ipython3 +X = np.random.randn(5, 3) +X_pinv_np = np.linalg.pinv(X) +X_pinv_manual = pinv_manual(X) + +np.allclose(X_pinv_np, X_pinv_manual) # Should be True +``` + diff --git a/book/chapter_decompositions/row_equivalence.md b/book/chapter_decompositions/row_equivalence.md new file mode 100644 index 0000000..2823d90 --- /dev/null +++ b/book/chapter_decompositions/row_equivalence.md @@ -0,0 +1,1047 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Gaussian Elimination and the PLU Decomposition + +Gaussian elimination is one of the most fundamental algorithms in linear algebra. +It provides a systematic procedure for simplifying matrices using elementary row operations and lies at the heart of solving linear systems, computing inverses, determining rank, and understanding matrix structure. + +This section introduces the core concepts and forms related to Gaussian elimination: + +* **Row Echelon Form (REF)**: A simplified form of a matrix that resembles an upper-triangular structure. REF is sufficient for solving linear systems using **back substitution**. +* **Reduced Row Echelon Form (RREF)**: A further simplified and canonical form where each pivot is 1 and the only nonzero entry in its column. RREF enables direct reading of solutions to linear systems. +* **Row Equivalence**: The idea that matrices related through row operations preserve important properties such as solvability and rank. +* **Gaussian Elimination**: The algorithm used to transform matrices into REF, using a sequence of elementary row operations. + +Throughout this section, we will define these forms, illustrate them with examples, and demonstrate how they relate to one another and to solving equations of the form $\mathbf{A}\mathbf{x} = \mathbf{b}$. +We will also discuss when a matrix is invertible based on its row-reduced form, and how to use back substitution after performing Gaussian elimination. + +This foundation is essential for understanding many areas of applied linear algebra, from numerical methods to machine learning. + + +--- + +## Elementary Row Operations + +One of the most important facts underlying **Gaussian elimination** is that the following **elementary row operations** do not change the solution set of a linear system. +That is, if we apply these operations to both the matrix $\mathbf{A}$ and the right-hand side $\mathbf{b}$ in the system $\mathbf{A}\mathbf{x} = \mathbf{b}$, we obtain an **equivalent system** with the **same solutions**. + +1. **Swap** two rows + $R_i \leftrightarrow R_j$ + +2. **Scale** a row by a nonzero scalar + $R_i \to \alpha R_i$, $\alpha \neq 0$ + +3. **Add a multiple** of one row to another + $R_i \to R_i + \beta R_j$ + +--- + +:::{prf:theorem} Elementary Row Operations Preserve Solution Sets +:label: trm-elementary-rop-operations + +Let $\mathbf{A} \mathbf{x} = \mathbf{b}$ be a system of linear equations. + +If we apply a finite sequence of **elementary row operations** to both $\mathbf{A}$ and $\mathbf{b}$, the resulting system has the **same solution set** as the original. + +That is, $\mathbf{A} \sim \mathbf{A}'$ implies: + +$$ +\mathbf{A} \mathbf{x} = \mathbf{b} \iff \mathbf{A}' \mathbf{x} = \mathbf{b}' +$$ + +::: + +:::{prf:proof} + +Each elementary row operation corresponds to **left-multiplication** of both sides of the equation by an **invertible matrix** $\mathbf{C}$: + +$$ +\mathbf{A} \mathbf{x} = \mathbf{b} \iff \mathbf{C}\mathbf{A} \mathbf{x} = \mathbf{C}\mathbf{b} +$$ + +* Swapping rows ↔ permutation matrix +* Scaling a row ↔ diagonal matrix +* Adding a multiple of one row to another ↔ elementary row matrix + +Since invertible matrices preserve linear equivalence, applying these operations preserves the solution set. +::: + +## Row Echelon Form (REF) (🇩🇪 **Zeilen-Stufenform**) + +A matrix is said to be in **row echelon form (REF)** if it satisfies the following conditions: + +1. **All zero rows** (if any) appear **at the bottom** of the matrix. +2. The **leading entry** (or pivot) of each nonzero row is strictly to the **right** of the leading entry of the row above it. +3. All entries **below** a pivot are **zero**. + +--- + +The following is a matrix in row echelon form: + +$$ +\begin{bmatrix} +1 & 2 & 3 \\ +0 & 1 & 4 \\ +0 & 0 & 5 +\end{bmatrix} +$$ + +But this is **not** in row echelon form (pivot in row 3 is not to the right of the pivot in row 2): + +$$ +\begin{bmatrix} +1 & 2 & 3 \\ +0 & 0 & 1 \\ +0 & 1 & 0 +\end{bmatrix} +$$ + +A matrix in **REF** is a "staircase" matrix with each nonzero row starting further to the right, and all entries below each pivot are zero: + +$$ +\boxed{ +\text{REF = Upper triangular-like form from which back substitution is possible} +} +$$ + + +## Reduced Row Echelon Form (RREF) + +A matrix is in **reduced row echelon form (RREF)** if it satisfies **all the conditions of row echelon form (REF)**, *plus* two additional conditions: + +--- + +### Conditions for RREF + +1. **REF conditions**: + + * All nonzero rows are above any all-zero rows. + * Each leading (nonzero) entry of a row is strictly to the right of the leading entry of the row above it. + * All entries below a pivot are zero. + +2. **Additional conditions**: + + * Each **pivot is equal to 1** (i.e. all leading entries are 1). + * Each pivot is the **only nonzero entry in its column**. + +--- + +$$ +\begin{bmatrix} +1 & 0 & 2 \\ +0 & 1 & -3 \\ +0 & 0 & 0 +\end{bmatrix} +$$ + +This is in **RREF** because: + +* Each pivot is 1. +* Each pivot column contains only one nonzero entry (the pivot itself). +* Pivots step to the right as you go down the rows. +* Zero row is at the bottom. + +--- + +$$ +\begin{bmatrix} +2 & 4 & 6 \\ +0 & 3 & 9 \\ +0 & 0 & 1 +\end{bmatrix} +$$ + +This is in **REF**, but not RREF: + +* It satisfies the "staircase" structure. +* But pivots are not 1. +* There are other nonzero entries in pivot columns. + +--- + +## REF vs. RREF: Key Differences + +| Feature | REF | RREF | +| -------------------------------- | ---------------- | ---- | +| Zero rows at bottom | ✅ | ✅ | +| Pivots step to the right | ✅ | ✅ | +| Zeros below pivots | ✅ | ✅ | +| Pivots are 1 | ❌ (not required) | ✅ | +| Zeros **above and below** pivots | ❌ | ✅ | + +--- + +## Gaussian Elimination + +**Gaussian elimination** is a method for solving systems of linear equations by systematically transforming the coefficient matrix into **row echelon form** (REF) using the **elementary row operations** defined above. + +It is one of the fundamental algorithms in linear algebra and underlies techniques such as solving $\mathbf{A}\mathbf{x} = \mathbf{b}$, computing the rank, and inverting matrices. +If we track the elementary operations of Gaussian Elimination, we obtain the PLU decomposition of a $\mathbf{A}$. + +--- + +### Goal + +Transform a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ into an **upper triangular matrix** (or REF), such that all entries below the pivot positions (leading entries in each row) are zero. + +--- + +### Steps of Gaussian Elimination + +1. **Identify the leftmost column** that contains a nonzero entry (the pivot column). +2. **Swap rows** (if necessary) so that the pivot entry is at the top of the current submatrix. +3. **Normalize** the pivot row so the pivot equals 1 (optional — standard Gaussian elimination doesn’t require this). +4. **Eliminate below the pivot**: Subtract suitable multiples of the pivot row from rows below to make all entries in the pivot column below the pivot zero. +5. **Move to the submatrix** that lies below and to the right, and repeat until the entire matrix is in row echelon form. + + +--- + + +### Example: Gaussian Elimination by Hand + +Start with: + +$$ +\mathbf{A} = +\begin{bmatrix} +2 & 4 & -2 \\ +4 & 9 & -3 \\ +-2 & -1 & 7 +\end{bmatrix} +$$ + +--- + +**Step 1: Normalize the first pivot** + +Pivot is $A_{11} = 2$. We'll eliminate below it. + +Eliminate row 2: + +$$ +R_2 \leftarrow R_2 - 2 \cdot R_1 +$$ + +$$ +\begin{bmatrix} +2 & 4 & -2 \\ +0 & 1 & 1 \\ +-2 & -1 & 7 +\end{bmatrix} +$$ + +Eliminate row 3: + +$$ +R_3 \leftarrow R_3 + R_1 +$$ + +$$ +\begin{bmatrix} +2 & 4 & -2 \\ +0 & 1 & 1 \\ +0 & 3 & 5 +\end{bmatrix} +$$ + +--- + +**Step 2: Eliminate below second pivot** + +Pivot at $A_{22} = 1$. Eliminate below. + +$$ +R_3 \leftarrow R_3 - 3 \cdot R_2 +$$ + +$$ +\begin{bmatrix} +2 & 4 & -2 \\ +0 & 1 & 1 \\ +0 & 0 & 2 +\end{bmatrix} +$$ + +--- + +**Step 3: Normalize all pivots (optional, for RREF)** + +We can divide each pivot row to make the pivots equal to 1: + +$$ +R_1 \leftarrow \frac{1}{2} R_1, \quad +R_3 \leftarrow \frac{1}{2} R_3 +$$ + +$$ +\begin{bmatrix} +1 & 2 & -1 \\ +0 & 1 & 1 \\ +0 & 0 & 1 +\end{bmatrix} +$$ + +--- + +Final Result: Row Echelon Form (REF) + +$$ +\boxed{ +\begin{bmatrix} +1 & 2 & -1 \\ +0 & 1 & 1 \\ +0 & 0 & 1 +\end{bmatrix} +} +$$ + +This is the **upper triangular matrix** resulting from Gaussian elimination. + +## Solving Linear Systems via Gaussian Elimination + +To solve a linear system $\mathbf{A}\mathbf{x} = \mathbf{b}$ using **Gaussian elimination followed by back substitution**, it's not enough to row-reduce $\mathbf{A}$ alone — we must also apply the **same row operations** to the right-hand side vector $\mathbf{b}$. This gives a consistent system that we can solve in the transformed space. + +Gaussian elimination turns the system: + +$$ +\mathbf{A} \mathbf{x} = \mathbf{b} +$$ + +into an equivalent upper-triangular system: + +$$ +\mathbf{U} \mathbf{x} = \mathbf{c} +$$ + +where: + +* $\mathbf{U}$ is the row-reduced form of $\mathbf{A}$ (usually REF) +* $\mathbf{c}$ is the result of applying the **same row operations** to $\mathbf{b}$ + +Only with this consistent transformation can back substitution be applied. + +### Example Linear System + +Let’s extend the example to **solve a system** $\mathbf{A} \mathbf{x} = \mathbf{b}$ using **Gaussian elimination + back substitution**. + +Given: + +$$ +\mathbf{A} = +\begin{bmatrix} +2 & 4 & -2 \\ +4 & 9 & -3 \\ +-2 & -1 & 7 +\end{bmatrix}, \quad +\mathbf{b} = +\begin{bmatrix} +2 \\ +8 \\ +10 +\end{bmatrix} +$$ + +We want to solve: + +$$ +\mathbf{A} \mathbf{x} = \mathbf{b} +$$ + +--- + +**Step 1: Augmented Matrix** + +Form the augmented matrix $[\mathbf{A} \mid \mathbf{b}]$: + +$$ +\begin{bmatrix} +2 & 4 & -2 & \big| & 2 \\ +4 & 9 & -3 & \big| & 8 \\ +-2 & -1 & 7 & \big| & 10 +\end{bmatrix} +$$ + +--- + +**Step 2: Apply Gaussian Elimination** + +**Eliminate below pivot (row 1)** + +* $R_2 \leftarrow R_2 - 2 \cdot R_1$ +* $R_3 \leftarrow R_3 + R_1$ + +$$ +\begin{bmatrix} +2 & 4 & -2 & \big| & 2 \\ +0 & 1 & 1 & \big| & 4 \\ +0 & 3 & 5 & \big| & 12 +\end{bmatrix} +$$ + +**Eliminate below pivot (row 2)** + +* $R_3 \leftarrow R_3 - 3 \cdot R_2$ + +$$ +\begin{bmatrix} +2 & 4 & -2 & \big| & 2 \\ +0 & 1 & 1 & \big| & 4 \\ +0 & 0 & 2 & \big| & 0 +\end{bmatrix} +$$ + +**Normalize pivots (optional)** + +* $R_1 \leftarrow \frac{1}{2} R_1$ +* $R_3 \leftarrow \frac{1}{2} R_3$ + +$$ +\boxed{ +\begin{bmatrix} +1 & 2 & -1 & \big| & 1 \\ +0 & 1 & 1 & \big| & 4 \\ +0 & 0 & 1 & \big| & 0 +\end{bmatrix} +} +$$ + +This is the system in **row echelon form**. + +--- + +**Step 3: Back Substitution** + +Let the system be: + +$$ +\begin{aligned} +x_1 + 2x_2 - x_3 &= 1 \quad \text{(Row 1)} \\ +\quad\;\;\;\; x_2 + x_3 &= 4 \quad \text{(Row 2)} \\ +\quad\quad\quad\quad\; x_3 &= 0 \quad \text{(Row 3)} +\end{aligned} +$$ + +Back-substitute from bottom to top: + +1. $x_3 = 0$ +2. $x_2 + x_3 = 4 \Rightarrow x_2 = 4$ +3. $x_1 + 2x_2 - x_3 = 1 \Rightarrow x_1 + 8 = 1 \Rightarrow x_1 = -7$ + +--- + +Final Solution + +$$ +\boxed{ +\mathbf{x} = +\begin{bmatrix} +-7 \\ +4 \\ +0 +\end{bmatrix} +} +$$ + + + +--- + +## Interpretation of Solving Systems by Gaussian Elimination + +Think of it as row-reducing the **augmented matrix**: + +$$ +[\mathbf{A} \mid \mathbf{b}] \quad \longrightarrow \quad [\mathbf{U} \mid \mathbf{c}] +$$ + +You solve the simplified system $\mathbf{U} \mathbf{x} = \mathbf{c}$, not the original one. + +--- + +$$ +\boxed{ +\text{Gaussian elimination modifies both } \mathbf{A} \text{ and } \mathbf{b} \text{ together.} +} +$$ + + +Then you can solve $\mathbf{A} \mathbf{x} = \mathbf{b}$ via **back substitution**. + +--- + +## Back Substitution + +Once a matrix has been transformed into **row echelon form (REF)** using Gaussian elimination, we can solve a system of equations $\mathbf{A}\mathbf{x} = \mathbf{b}$ using **back substitution**. + +This method proceeds from the bottom row of the triangular system upward, solving for each variable one at a time. + +--- + +### General Idea + +Suppose, after Gaussian elimination, we have the augmented system: + +$$ +\begin{aligned} +x_3 &= c_3 \\ +x_2 + a_{23}x_3 &= c_2 \\ +x_1 + a_{12}x_2 + a_{13}x_3 &= c_1 +\end{aligned} +$$ + +We can compute the solution in reverse order: + +1. Solve for $x_3$ from the last equation. +2. Plug $x_3$ into the second equation and solve for $x_2$. +3. Plug $x_2$ and $x_3$ into the first equation to solve for $x_1$. + +--- + +### Back Substitution Example + +Let’s solve: + +$$ +\begin{bmatrix} +1 & 2 & -1 \\ +0 & 1 & 3 \\ +0 & 0 & 2 +\end{bmatrix} +\begin{bmatrix} +x_1 \\ x_2 \\ x_3 +\end{bmatrix} += +\begin{bmatrix} +5 \\ 4 \\ 6 +\end{bmatrix} +$$ + +We solve from the bottom up: + +* $x_3 = \frac{6}{2} = 3$ +* $x_2 + 3 \cdot 3 = 4 \Rightarrow x_2 = 4 - 9 = -5$ +* $x_1 + 2(-5) - 3 = 5 \Rightarrow x_1 = 5 + 10 + 3 = 18$ + +So the solution is: + +$$ +\boxed{ +\mathbf{x} = +\begin{bmatrix} +18 \\ +-5 \\ +3 +\end{bmatrix} +} +$$ + + +--- + +## Pivot Columns and Free Variables + +When we reduce a matrix to **row echelon form (REF)** or **reduced row echelon form (RREF)**, the position of **pivots** in the matrix gives us direct insight into the structure of the solution set of the system $\mathbf{A}\mathbf{x} = \mathbf{b}$. + +--- + +### ✅ Pivot Columns and Basic Variables + +* A **pivot** is the first nonzero entry in a row of REF or RREF. +* The **columns** of the original matrix $\mathbf{A}$ that contain pivots are called **pivot columns**. +* The variables corresponding to pivot columns are called **basic variables**. + + * These are the variables you solve for directly using back substitution. + +--- + +### 🆓 Non-Pivot Columns and Free Variables + +* The **columns that do not contain a pivot** are called **free columns**. +* The variables corresponding to these columns are called **free variables**. + + * They can take on arbitrary values. + * The values of basic variables depend on the free variables. + +--- + +### 🧠 Solution Structure + +The presence or absence of pivot positions determines the nature of the solution: + +| Situation | Interpretation | +| --------------------------------------------------------------------------------- | ---------------------------------------------- | +| Pivot in every column of $\mathbf{A}$ | **Unique solution** (if consistent) | +| Some columns with no pivot | **Infinitely many solutions** (free variables) | +| Inconsistent system (e.g., row of zeros in $\mathbf{A}$, nonzero in $\mathbf{b}$) | **No solution** | + +--- + +### 🔢 Example + +Suppose we reduce the augmented matrix to RREF: + +$$ +\left[ +\begin{array}{ccc|c} +1 & 0 & 2 & 3 \\ +0 & 1 & -1 & 1 \\ +0 & 0 & 0 & 0 +\end{array} +\right] +$$ + +* Pivots in columns 1 and 2 → $x_1$ and $x_2$ are **basic** +* No pivot in column 3 → $x_3$ is a **free variable** +* The solution has the form: + + $$ + \begin{aligned} + x_1 &= 3 - 2x_3 \\ + x_2 &= 1 + x_3 \\ + x_3 &\text{ is free} + \end{aligned} + $$ + +This system has **infinitely many solutions**, parameterized by $x_3$. + +--- + +### 🧩 Summary + +$$ +\boxed{ +\text{Pivot columns } \leftrightarrow \text{ basic variables}, \quad +\text{Non-pivot columns } \leftrightarrow \text{ free variables} +} +$$ + +Understanding this structure helps us: + +* Determine how many solutions exist +* Express the solution set explicitly +* Identify the degrees of freedom in underdetermined systems + +--- + + +## Row-Equivalent Matrices + +Two matrices $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$ are called **row-equivalent** if one can be transformed into the other using a finite sequence of the **elementary row operations** defined above. + +--- + +### Notation + +We write: + +$$ +\mathbf{A} \sim \mathbf{B} +$$ + +to denote that $\mathbf{A}$ is **row-equivalent** to $\mathbf{B}$. + +--- + +### Intuition + +* Row-equivalence preserves the **solution set** of the linear system $\mathbf{A} \mathbf{x} = \mathbf{b}$. +* It **does not** change the **row space**, and hence **preserves the rank**. +* A matrix is row-equivalent to the **identity matrix** $\mathbf{I}$ if and only if it is **invertible** (for square matrices). + +--- + +### Summary + +$$ +\boxed{ +\mathbf{A} \sim \mathbf{B} \iff \text{you can get from one to the other by row operations} +} +$$ + +### Example +Here are the step-by-step row operations showing that the matrix + +$$ +\mathbf{A} = \begin{bmatrix} +2 & 1 & -1 \\ +-3 & -1 & 2 \\ +-2 & 1 & 2 +\end{bmatrix} +$$ + +is **row-equivalent** to the identity matrix $\mathbf{I}$. +This confirms that $\mathbf{A}$ is **invertible**. + +Here are the steps of the row reduction process rendered with corresponding **elementary row operations** and resulting matrices: + +--- + +**Step 0**: Start with matrix $\mathbf{A}$ + +$$ +\mathbf{A} = +\begin{bmatrix} +2 & 1 & -1 \\ +-3 & -1 & 2 \\ +-2 & 1 & 2 +\end{bmatrix} +$$ + +--- + +**Step 1**: Normalize row 1 + +$R_1 \leftarrow \frac{1}{2} R_1$ + +$$ +\begin{bmatrix} +1 & 0.5 & -0.5 \\ +-3 & -1 & 2 \\ +-2 & 1 & 2 +\end{bmatrix} +$$ + +--- + +**Step 2**: Eliminate entries below pivot in column 1 + +$R_2 \leftarrow R_2 + 3 R_1$ +$R_3 \leftarrow R_3 + 2 R_1$ + +$$ +\begin{bmatrix} +1 & 0.5 & -0.5 \\ +0 & 0.5 & 0.5 \\ +0 & 2 & 1 +\end{bmatrix} +$$ + +--- + +**Step 3**: Normalize row 2 + +$R_2 \leftarrow \frac{1}{0.5} R_2 = 2 R_2$ + +$$ +\begin{bmatrix} +1 & 0.5 & -0.5 \\ +0 & 1 & 1 \\ +0 & 2 & 1 +\end{bmatrix} +$$ + +--- + +**Step 4**: Eliminate below pivot in column 2 + +$R_3 \leftarrow R_3 - 2 R_2$ + +$$ +\begin{bmatrix} +1 & 0.5 & -0.5 \\ +0 & 1 & 1 \\ +0 & 0 & -1 +\end{bmatrix} +$$ + +--- + +**Step 5**: Normalize row 3 + +$R_3 \leftarrow -1 \cdot R_3$ + +$$ +\begin{bmatrix} +1 & 0.5 & -0.5 \\ +0 & 1 & 1 \\ +0 & 0 & 1 +\end{bmatrix} +$$ + +--- + +We have row-reduced $\mathbf{A}$ to **row echelon form**, which is the identity matrix after further elimination above the pivots (not shown). + +Hence: + +$$ +\mathbf{A} \sim \mathbf{I} \quad \Rightarrow \quad \textbf{A is invertible.} +$$ + +--- + +## Matrix Inversion via Gaussian Elimination + +Gaussian elimination can not only be used to solve systems $\mathbf{A}\mathbf{x} = \mathbf{b}$, but also to **compute the inverse** of a square matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$, **if it exists**. + +This is done by augmenting $\mathbf{A}$ with the identity matrix $\mathbf{I}$, and applying row operations to reduce $\mathbf{A}$ to $\mathbf{I}$. + +The operations that convert $\mathbf{A}$ into $\mathbf{I}$ will simultaneously convert $\mathbf{I}$ into $\mathbf{A}^{-1}$. +This approach is a **constructive proof** of invertibility. + +--- + +### Procedure + +1. Form the augmented matrix $[\mathbf{A} \mid \mathbf{I}]$. +2. Apply **Gaussian elimination** to row-reduce the left side to the identity matrix. +3. If successful, the right side will become $\mathbf{A}^{-1}$. +4. If the left side cannot be reduced to identity (e.g. a zero row appears), $\mathbf{A}$ is **not invertible**. + +--- + +$$ +\boxed{ +[\mathbf{A} \mid \mathbf{I}] \longrightarrow [\mathbf{I} \mid \mathbf{A}^{-1}] \quad \text{via Gaussian elimination}. +} +$$ + +--- + +## PLU Decomposition + +Every square matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ has a **PLU decomposition** (or **LU decomposition with partial pivoting**): + +$$ +\boxed{ +\mathbf{A} = \mathbf{P} \mathbf{L} \mathbf{U} +} +$$ + +* $\mathbf{P}^\top$ is a **permutation matrix** (a rearrangement of the identity matrix) that tracks the row swaps +* $\mathbf{L}$ is **lower triangular** with unit diagonal (contains the elimination multipliers). +* $\mathbf{U}$ is **upper triangular** (result of Gaussian elimination). + +As $\mathbf{P}$ is a permuation matrix $\mathbf{P}^{-1} = \mathbf{P}^{\top}$, we can alternatively write + +$$ +\mathbf{P}^\top \mathbf{A} = \mathbf{L} \mathbf{U} +$$ + +The PLU decomposition +* always exists for any square matrix. +* is used in stable numerical solvers. +* is efficient for solving systems and computing inverses. + +--- + +### Example PLU Decomposition + +Let + +$$ +\mathbf{A} = +\begin{bmatrix} +0 & 2 \\ +1 & 4 +\end{bmatrix} +$$ + +To eliminate below the pivot, we need to **swap rows**, since $A_{11} = 0$. +The permutation matrix is: + +$$ +\mathbf{P} = +\begin{bmatrix} +0 & 1 \\ +1 & 0 +\end{bmatrix}, \quad +\mathbf{P}^\top \mathbf{A} = +\begin{bmatrix} +1 & 4 \\ +0 & 2 +\end{bmatrix} +$$ + +Then: + +$$ +\mathbf{L} = +\begin{bmatrix} +1 & 0 \\ +0 & 1 +\end{bmatrix}, \quad +\mathbf{U} = +\begin{bmatrix} +1 & 4 \\ +0 & 2 +\end{bmatrix} +$$ + +So: + +$$ +\mathbf{P}^\top \mathbf{A} = \mathbf{L} \mathbf{U} +$$ + +--- + +```{code-cell} ipython3 +import numpy as np +from scipy.linalg import lu + +# Example matrix +A = np.array([[0, 2, 1], + [1, 1, 0], + [2, 1, 1]], dtype=float) + +# Perform PLU decomposition +P, L, U = lu(A) + +print("P.T @ A:\n", P.T @ A) +print("L @ U:\n", L @ U) +``` + +This uses **SciPy's `lu` function**, returning: + +* $\mathbf{P}$: permutation matrix +* $\mathbf{L}$: unit lower triangular matrix +* $\mathbf{U}$: upper triangular matrix + such that: + +$$ +\text{LU decomposition with pivoting gives } \mathbf{P}, \mathbf{L}, \mathbf{U} \text{ such that } \mathbf{A} = \mathbf{P}\mathbf{L} \mathbf{U} +$$ + +--- + +To solve a linear system $\mathbf{A} \mathbf{x} = \mathbf{b}$ **given the PLU decomposition** of $\mathbf{A}$, that is: + +$$ +\boxed{ +\mathbf{P}^\top \mathbf{A} = \mathbf{L} \mathbf{U} +} +$$ + +You solve the system in **three steps**: + +**1. Permute the right-hand side** + +Multiply both sides by $\mathbf{P}^\top$ to align with the decomposition: + +$$ +\mathbf{P}^\top \mathbf{A} \mathbf{x} = \mathbf{P}^\top \mathbf{b} +\Rightarrow \mathbf{L} \mathbf{U} \mathbf{x} = \mathbf{P}^\top \mathbf{b} +$$ + +Let: + +$$ +\mathbf{c} = \mathbf{P}^\top \mathbf{b} +$$ + +--- + +**2. Forward substitution** + +Solve for intermediate vector $\mathbf{y}$ in: + +$$ +\mathbf{L} \mathbf{y} = \mathbf{c} +$$ + +Since $\mathbf{L}$ is lower triangular, this can be done **top-down**. + +--- + +**3. Backward substitution** + +Solve for final solution $\mathbf{x}$ in: + +$$ +\mathbf{U} \mathbf{x} = \mathbf{y} +$$ + +Since $\mathbf{U}$ is upper triangular, this can be done **bottom-up**. + +--- + +```{code-cell} ipython3 +from scipy.linalg import solve_triangular +# Right Hand Side of Equation +b = np.array([4, 2, 6], dtype=float) + +# Step 1: permute b +c = P @ b + +# Step 2: solve L y = P b +y = solve_triangular(L, c, lower=True) + +# Step 3: solve U x = y +x = solve_triangular(U, y) + +print("Solution x:", x) +``` + +--- + +$$ +\boxed{ +\mathbf{A} \mathbf{x} = \mathbf{b} \quad \Rightarrow \quad +\mathbf{P}^\top \mathbf{A} \mathbf{x} = \mathbf{L} \mathbf{U} \mathbf{x} = \mathbf{P}^\top \mathbf{b} +} +$$ + +Solve via: + +* $\mathbf{L} \mathbf{y} = \mathbf{P}^\top \mathbf{b}$ (forward) +* $\mathbf{U} \mathbf{x} = \mathbf{y}$ (backward) + +This is numerically efficient and stable, especially for **repeated solves** with the same $\mathbf{A}$. + +## Summary: Why Gaussian Elimination and Matrix Forms Matter in Machine Learning + +Understanding Gaussian elimination and matrix forms like **REF** and **RREF** is more than a theoretical exercise — it builds essential intuition and computational tools for many areas of **machine learning**. + +Here’s how these concepts directly relate: + +### Solving Linear Systems + +Many machine learning algorithms boil down to solving systems of linear equations. For example: + +* In **linear regression**, the optimal weights minimize a quadratic loss and satisfy the **normal equations**, which are linear: + + $$ + (\mathbf{X}^\top \mathbf{X}) \mathbf{w} = \mathbf{X}^\top \mathbf{y} + $$ + + Gaussian elimination provides an efficient way to solve these equations, especially for small- to medium-scale problems. + +--- + +### Understanding Rank and Feature Spaces + +* The **rank** of a data matrix $\mathbf{X}$ tells us the number of **linearly independent features**. +* Low-rank matrices appear naturally in **dimensionality reduction** (e.g. PCA), **collaborative filtering**, and **matrix completion**. +* Detecting whether features are redundant, or whether a system is under- or overdetermined, comes down to understanding row operations and rank. + +--- + +### Interpreting Model Structure + +* Matrices in **RREF** reveal directly which variables (features) are **pivotal** to a system — a perspective that underlies **feature selection**, **interpretability**, and **symbolic regression**. +* Understanding when systems have **unique**, **infinite**, or **no solutions** helps us reason about **well-posedness** and **overfitting** in models. + +--- + +### Numerical Stability and Preconditioning + +* Even when not done directly, Gaussian elimination underpins many **numerical algorithms** (e.g., LU decomposition, QR factorization). +* These are used in optimization, iterative solvers, and deep learning libraries for computing gradients, inverses, and solving systems in a stable way. + +--- + +### Big Picture + +> While machine learning often uses **high-level abstractions** and **automatic solvers**, understanding how these methods work at the matrix level helps build **intuition**, **debugging skills**, and **mathematical fluency** for real-world modeling. diff --git a/book/chapter_decompositions/square_matrices.md b/book/chapter_decompositions/square_matrices.md new file mode 100644 index 0000000..b1a74d7 --- /dev/null +++ b/book/chapter_decompositions/square_matrices.md @@ -0,0 +1,82 @@ +# Fundamental Equivalences for Square matrices +Yes, you're absolutely right — these statements are all **equivalent** and form a **core set of if-and-only-if conditions** that characterize invertibility of a square matrix. Let's list them more formally and then prove that they are all equivalent: + +--- +:::{prf:theorem} Fundamental Equivalences +:label: trm-fundamental-equivalences +:nonumber: + +Let $\mathbf{A} \in \mathbb{R}^{n \times n}$. The following statements are **equivalent** — that is, they are all true or all false together: + +1. $\mathbf{A}$ is **invertible** +2. $\det(\mathbf{A}) \neq 0$ +3. $\mathbf{A}$ is **full-rank**, i.e., $\operatorname{rank}(\mathbf{A}) = n$ +4. The **columns** of $\mathbf{A}$ are **linearly independent** +5. The **rows** of $\mathbf{A}$ are **linearly independent** +6. $\mathbf{A}$ is **row-equivalent** to the identity matrix +7. The system $\mathbf{A}\mathbf{x} = \mathbf{b}$ has a **unique solution for every $\mathbf{b} \in \mathbb{R}^n$** + +::: + + +:::{prf:proof} + +We'll prove the chain of implications in a **circular fashion**, which implies all are equivalent. + +--- + +**(1) ⇒ (2): Invertible ⇒ Determinant nonzero** + +If $\mathbf{A}^{-1}$ exists, then + +$$ +\det(\mathbf{A} \mathbf{A}^{-1}) = \det(\mathbf{I}) = 1 = \det(\mathbf{A}) \det(\mathbf{A}^{-1}) \Rightarrow \det(\mathbf{A}) \neq 0 +$$ + +--- + +**(2) ⇒ (3): $\det(\mathbf{A}) \neq 0$ ⇒ Full-rank** + +A square matrix has full rank $\iff$ its rows/columns span $\mathbb{R}^n$, and this happens exactly when $\det(\mathbf{A}) \neq 0$. + +If $\operatorname{rank}(\mathbf{A}) < n$, then one row or column is linearly dependent, making $\det(\mathbf{A}) = 0$. + +--- + +**(3) ⇒ (4) and (5): Full-rank ⇒ Linear independence of rows and columns** + +A matrix with rank $n$ must have linearly independent rows and columns by the definition of rank. + +--- + +**(4) ⇒ (6): Independent columns ⇒ Row-equivalent to identity** + +If the columns are linearly independent, Gaussian elimination can reduce $\mathbf{A}$ to the identity matrix $\mathbf{I}$ using row operations. + +This means $\mathbf{A}$ is row-equivalent to $\mathbf{I}$. + +--- + +**(6) ⇒ (7): Row-equivalent to $\mathbf{I}$ ⇒ Unique solution for all $\mathbf{b}$** + +If $\mathbf{A} \sim \mathbf{I}$, then solving $\mathbf{A} \mathbf{x} = \mathbf{b}$ is equivalent to solving $\mathbf{I} \mathbf{x} = \mathbf{b}'$, which always has the unique solution $\mathbf{x} = \mathbf{b}'$. + +--- + +**(7) ⇒ (1): Unique solution for all $\mathbf{b}$ ⇒ $\mathbf{A}$ is invertible** + +If $\mathbf{A} \mathbf{x} = \mathbf{b}$ has a **unique solution for all** $\mathbf{b}$, then the inverse mapping $\mathbf{b} \mapsto \mathbf{x}$ is well-defined and linear, so $\mathbf{A}^{-1}$ exists. + +--- + +**Conclusion** + +All these statements are **equivalent**: + +$$ +\boxed{ +\mathbf{A} \text{ invertible } \iff \det(\mathbf{A}) \neq 0 \iff \operatorname{rank}(\mathbf{A}) = n \iff \text{cols/rows lin. independent} \iff \mathbf{A} \sim \mathbf{I} \iff \text{unique solution for all } \mathbf{b} +} +$$ + +::: \ No newline at end of file diff --git a/drafts/chapter_decompositions/svd.md b/book/chapter_decompositions/svd.md similarity index 76% rename from drafts/chapter_decompositions/svd.md rename to book/chapter_decompositions/svd.md index d212eda..f1a4e6c 100644 --- a/drafts/chapter_decompositions/svd.md +++ b/book/chapter_decompositions/svd.md @@ -1,9 +1,12 @@ -## Singular value decomposition +# Singular value decomposition Singular value decomposition (SVD) is a widely applicable tool in linear algebra. Its strength stems partially from the fact that *every matrix* -$\mathbf{A} \in \mathbb{R}^{m \times n}$ has an SVD (even non-square -matrices)! The decomposition goes as follows: +$\mathbf{A}$ has an SVD (even non-square +matrices)! + + +The decomposition of $\mathbf{A}\in \mathbb{R}^{m \times n}$ goes as follows: $$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!}$$ @@ -36,27 +39,36 @@ $\mathbf{A}^{\!\top\!}\mathbf{A}$, and the columns of $\mathbf{U}$ (the **left-singular vectors** of $\mathbf{A}$) are eigenvectors of $\mathbf{A}\mathbf{A}^{\!\top\!}$. -The matrices $\mathbf{\Sigma}^{\!\top\!}\mathbf{\Sigma}$ and +The matrices $\mathbf{\Sigma}^{\top}\mathbf{\Sigma}$ and $\mathbf{\Sigma}\mathbf{\Sigma}^{\!\top\!}$ are not necessarily the same size, but both are diagonal with the squared singular values $\sigma_i^2$ on the diagonal (plus possibly some zeros). Thus the singular values of $\mathbf{A}$ are the square roots of the eigenvalues of $\mathbf{A}^{\!\top\!}\mathbf{A}$ (or equivalently, of -$\mathbf{A}\mathbf{A}^{\!\top\!}$)[^5]. +$\mathbf{A}\mathbf{A}^{\!\top\!}$). ## Some useful matrix identities +In the following, we present a number of important identities for the SVD. + ### Matrix-vector product as linear combination of matrix columns -*Proposition.* +:::{prf:proposition} Matrix-vector product as linear combination of columns +:label: prop-matrix-vector-product +:nonumber: + Let $\mathbf{x} \in \mathbb{R}^n$ be a vector and $\mathbf{A} \in \mathbb{R}^{m \times n}$ a matrix with columns -$\mathbf{a}_1, \dots, \mathbf{a}_n$. Then +$\mathbf{a}_1, \dots, \mathbf{a}_n$. + +Then $$\mathbf{A}\mathbf{x} = \sum_{i=1}^n x_i\mathbf{a}_i$$ +::: This identity is extremely useful in understanding linear operators in -terms of their matrices' columns. The proof is very simple (consider +terms of their matrices' columns. +The proof is very simple (consider each element of $\mathbf{A}\mathbf{x}$ individually and expand by definitions) but it is a good exercise to convince yourself. @@ -74,7 +86,10 @@ immediately obvious, but the sum of outer products is actually equivalent to an appropriate matrix-matrix product! We formalize this statement as -*Proposition.* +:::{prf:proposition} +:label: prop-sum-outer-products +:nonumber: + Let $\mathbf{a}_1, \dots, \mathbf{a}_k \in \mathbb{R}^m$ and $\mathbf{b}_1, \dots, \mathbf{b}_k \in \mathbb{R}^n$. Then @@ -83,8 +98,11 @@ $$\sum_{\ell=1}^k \mathbf{a}_\ell\mathbf{b}_\ell^{\!\top\!} = \mathbf{A}\mathbf{ where $$\mathbf{A} = \begin{bmatrix}\mathbf{a}_1 & \cdots & \mathbf{a}_k\end{bmatrix}, \hspace{0.5cm} \mathbf{B} = \begin{bmatrix}\mathbf{b}_1 & \cdots & \mathbf{b}_k\end{bmatrix}$$ +::: + +:::{prf:proof} -*Proof.* For each $(i,j)$, we have +For each $(i,j)$, we have $$\left[\sum_{\ell=1}^k \mathbf{a}_\ell\mathbf{b}_\ell^{\!\top\!}\right]_{ij} = \sum_{\ell=1}^k [\mathbf{a}_\ell\mathbf{b}_\ell^{\!\top\!}]_{ij} = \sum_{\ell=1}^k [\mathbf{a}_\ell]_i[\mathbf{b}_\ell]_j = \sum_{\ell=1}^k A_{i\ell}B_{j\ell}$$ @@ -93,17 +111,17 @@ the $i$th row of $\mathbf{A}$ and the $j$th row of $\mathbf{B}$, or equivalently the $j$th column of $\mathbf{B}^{\!\top\!}$. Hence by the definition of matrix multiplication, it is equal to $[\mathbf{A}\mathbf{B}^{\!\top\!}]_{ij}$. ◻ +::: -### Quadratic forms +## Reduced SVD +The SVD we have presented is the **full SVD**. +However, in many +applications, we are only interested in the **reduced SVD**. This is +the SVD where we only keep the first $r$ columns of $\mathbf{U}$ and +the first $r$ columns of $\mathbf{V}$, where $r$ is the rank of +$\mathbf{A}$. The reduced SVD is given by: -Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix, and -recall that the expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ -is called a quadratic form of $\mathbf{A}$. It is in some cases helpful -to rewrite the quadratic form in terms of the individual elements that -make up $\mathbf{A}$ and $\mathbf{x}$: +$$\mathbf{A} = \mathbf{U}_r\mathbf{\Sigma}_r\mathbf{V}_r^{\!\top\!}$$ -$$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \sum_{i=1}^n\sum_{j=1}^n A_{ij}x_ix_j$$ +where $\mathbf{U}_r \in \mathbb{R}^{m \times r}$, $\mathbf{\Sigma}_r \in \mathbb{R}^{r \times r}$, and $\mathbf{V}_r \in \mathbb{R}^{n \times r}$. -This identity is valid for any square matrix (need not be symmetric), -although quadratic forms are usually only discussed in the context of -symmetric matrices. diff --git a/book/chapter_decompositions/symmetric_matrices.md b/book/chapter_decompositions/symmetric_matrices.md new file mode 100644 index 0000000..5115bd2 --- /dev/null +++ b/book/chapter_decompositions/symmetric_matrices.md @@ -0,0 +1,75 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Symmetric matrices + +A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ is said to be +**symmetric** if it is equal to its own transpose +($\mathbf{A} = \mathbf{A}^{\!\top\!}$), meaning that $A_{ij} = A_{ji}$ +for all $(i,j)$. + +This definition seems harmless but turns out to +have some strong implications. + +## Spectral Decopmosition + +:::{prf:theorem} Spectral Theorem +:label: trm-spectral-decomposition +:nonumber: + +If $\mathbf{A} \in \mathbb{R}^{n \times n}$ is +symmetric, then there exists an orthonormal basis for $\mathbb{R}^n$ +consisting of eigenvectors of $\mathbf{A}$. +::: + +The practical application of this theorem is a particular factorization +of symmetric matrices, referred to as the **eigendecomposition** or +**spectral decomposition**. + +Denote the orthonormal basis of eigenvectors +$\mathbf{q}_1, \dots, \mathbf{q}_n$ and their eigenvalues +$\lambda_1, \dots, \lambda_n$. + +Let $\mathbf{Q}$ be an orthogonal matrix with $\mathbf{q}_1, \dots, \mathbf{q}_n$ as its columns, and + +$$\mathbf{\Lambda} = \operatorname{diag}(\lambda_1, \dots, \lambda_n).$$ + +Since by definition $\mathbf{A}\mathbf{q}_i = \lambda_i\mathbf{q}_i$ for every +$i$, the following relationship holds: + +$$\mathbf{A}\mathbf{Q} = \mathbf{Q}\mathbf{\Lambda}$$ + +Right-multiplying +by $\mathbf{Q}^{\!\top\!}$, we arrive at the decomposition + +$$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$$ + +### Quadratic forms + +Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix. + +The expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ is called a +**quadratic form**. + +Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix, and +recall that the expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ +is called a quadratic form of $\mathbf{A}$. It is in some cases helpful +to rewrite the quadratic form in terms of the individual elements that +make up $\mathbf{A}$ and $\mathbf{x}$: + +$$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \sum_{i=1}^n\sum_{j=1}^n A_{ij}x_ix_j$$ + +This identity is valid for any square matrix (need not be symmetric), +although quadratic forms are usually only discussed in the context of +symmetric matrices. + +We have seen quadratic forms in the context of quadratic optimization problems, where the goal was to minimize a quadratic form. \ No newline at end of file diff --git a/book/chapter_decompositions/trace.md b/book/chapter_decompositions/trace.md new file mode 100644 index 0000000..a3cef18 --- /dev/null +++ b/book/chapter_decompositions/trace.md @@ -0,0 +1,40 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# Trace + +The **trace** of a square matrix is the sum of its diagonal entries: + +$$\operatorname{tr}(\mathbf{A}) = \sum_{i=1}^n A_{ii}$$ + +The trace has several nice algebraic properties: + +(i) $\operatorname{tr}(\mathbf{A}+\mathbf{B}) = \operatorname{tr}(\mathbf{A}) + \operatorname{tr}(\mathbf{B})$ + +(ii) $\operatorname{tr}(\alpha\mathbf{A}) = \alpha\operatorname{tr}(\mathbf{A})$ + +(iii) $\operatorname{tr}(\mathbf{A}^{\!\top\!}) = \operatorname{tr}(\mathbf{A})$ + +(iv) $\operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D}) = \operatorname{tr}(\mathbf{B}\mathbf{C}\mathbf{D}\mathbf{A}) = \operatorname{tr}(\mathbf{C}\mathbf{D}\mathbf{A}\mathbf{B}) = \operatorname{tr}(\mathbf{D}\mathbf{A}\mathbf{B}\mathbf{C})$ + +The first three properties follow readily from the definition. + +The last is known as **invariance under cyclic permutations**. + +Note that the matrices cannot be reordered arbitrarily, for example +$\operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D}) \neq \operatorname{tr}(\mathbf{B}\mathbf{A}\mathbf{C}\mathbf{D})$ +in general. + +Also, there is nothing special about the product of four matrices -- analogous rules hold for more or fewer matrices. + + + diff --git a/drafts/chapter_convexity/overview_convexity.md b/drafts/chapter_convexity/overview_convexity.md deleted file mode 100644 index 42f5a7c..0000000 --- a/drafts/chapter_convexity/overview_convexity.md +++ /dev/null @@ -1,5 +0,0 @@ -# Convexity -Convexity is a property of a function that describes the shape of its graph. - -```{tableofcontents} -``` \ No newline at end of file diff --git a/drafts/chapter_decompositions/big_picture.md b/drafts/chapter_decompositions/big_picture.md deleted file mode 100644 index 3f01694..0000000 --- a/drafts/chapter_decompositions/big_picture.md +++ /dev/null @@ -1,12 +0,0 @@ -## The fundamental subspaces of a matrix - -[]: # -[]: # The fundamental subspaces of a matrix $A$ are the four subspaces associated with the matrix and its transpose. These subspaces are important in linear algebra and numerical analysis, particularly in the context of solving linear systems and eigenvalue problems. -[]: # -[]: # 1. **Column Space (Range) of A**: The column space of a matrix $A$ is the set of all possible linear combinations of its columns. It represents the span of the columns of $A$ and is denoted as $\text{Col}(A)$ or $\text{Range}(A)$. -[]: # -[]: # 2. **Null Space (Kernel) of A**: The null space of a matrix $A$ is the set of all vectors $\mathbf{x}$ such that $A\mathbf{x} = \mathbf{0}$. It represents the solutions to the homogeneous equation associated with $A$ and is denoted as $\text{Null}(A)$ or $\text{Ker}(A)$. -[]: # -[]: # 3. **Row Space of A**: The row space of a matrix $A$ is the set of all possible linear combinations of its rows. It is equivalent to the column space of its transpose, $A^\top$, and is denoted as $\text{Row}(A)$ or $\text{Col}(A^\top)$. -[]: # -[]: # 4. **Left Null Space (Kernel) of A**: The left null space of a matrix $A$ is the set of all vectors $\mathbf{y}$ such that $A^\top\mathbf{y} = \mathbf{0}$. It represents the solutions to the homogeneous equation associated with $A^\top$ and is denoted as $\text{Null}(A^\top)$ or $\text{Ker}(A^\top)$. diff --git a/drafts/chapter_decompositions/eigenvectors.md b/drafts/chapter_decompositions/eigenvectors.md deleted file mode 100644 index 8486ba8..0000000 --- a/drafts/chapter_decompositions/eigenvectors.md +++ /dev/null @@ -1,537 +0,0 @@ -## Eigenthings - -For a square matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$, there may -be vectors which, when $\mathbf{A}$ is applied to them, are simply -scaled by some constant. We say that a nonzero vector -$\mathbf{x} \in \mathbb{R}^n$ is an **eigenvector** of $\mathbf{A}$ -corresponding to **eigenvalue** $\lambda$ if - -$$\mathbf{A}\mathbf{x} = \lambda\mathbf{x}$$ - -The zero vector is excluded -from this definition because -$\mathbf{A}\mathbf{0} = \mathbf{0} = \lambda\mathbf{0}$ for every -$\lambda$. - -We now give some useful results about how eigenvalues change after -various manipulations. - - - -{prf:proposition} Eigenvalues and eigenvectors -:label: eigenvalues_eigenvectors_properties -Let $\mathbf{x}$ be an eigenvector of $\mathbf{A}$ with corresponding -eigenvalue $\lambda$. Then - -(i) For any $\gamma \in \mathbb{R}$, $\mathbf{x}$ is an eigenvector of - $\mathbf{A} + \gamma\mathbf{I}$ with eigenvalue $\lambda + \gamma$. - -(ii) If $\mathbf{A}$ is invertible, then $\mathbf{x}$ is an eigenvector - of $\mathbf{A}^{-1}$ with eigenvalue $\lambda^{-1}$. - -(iii) $\mathbf{A}^k\mathbf{x} = \lambda^k\mathbf{x}$ for any - $k \in \mathbb{Z}$ (where $\mathbf{A}^0 = \mathbf{I}$ by - definition). - - - -:::{prf:proof} -(i) follows readily: - -$$(\mathbf{A} + \gamma\mathbf{I})\mathbf{x} = \mathbf{A}\mathbf{x} + \gamma\mathbf{I}\mathbf{x} = \lambda\mathbf{x} + \gamma\mathbf{x} = (\lambda + \gamma)\mathbf{x}$$ - -(ii) Suppose $\mathbf{A}$ is invertible. Then - -$$\mathbf{x} = \mathbf{A}^{-1}\mathbf{A}\mathbf{x} = \mathbf{A}^{-1}(\lambda\mathbf{x}) = \lambda\mathbf{A}^{-1}\mathbf{x}$$ - -Dividing by $\lambda$, which is valid because the invertibility of -$\mathbf{A}$ implies $\lambda \neq 0$, gives -$\lambda^{-1}\mathbf{x} = \mathbf{A}^{-1}\mathbf{x}$. - -(iii) The case $k \geq 0$ follows immediately by induction on $k$. -Then the general case $k \in \mathbb{Z}$ follows by combining the -$k \geq 0$ case with (ii). ◻ -::: - -## Trace - -The **trace** of a square matrix is the sum of its diagonal entries: - -$$\operatorname{tr}(\mathbf{A}) = \sum_{i=1}^n A_{ii}$$ - -The trace has several nice -algebraic properties: - -(i) $\operatorname{tr}(\mathbf{A}+\mathbf{B}) = \operatorname{tr}(\mathbf{A}) + \operatorname{tr}(\mathbf{B})$ - -(ii) $\operatorname{tr}(\alpha\mathbf{A}) = \alpha\operatorname{tr}(\mathbf{A})$ - -(iii) $\operatorname{tr}(\mathbf{A}^{\!\top\!}) = \operatorname{tr}(\mathbf{A})$ - -(iv) $\operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D}) = \operatorname{tr}(\mathbf{B}\mathbf{C}\mathbf{D}\mathbf{A}) = \operatorname{tr}(\mathbf{C}\mathbf{D}\mathbf{A}\mathbf{B}) = \operatorname{tr}(\mathbf{D}\mathbf{A}\mathbf{B}\mathbf{C})$ - -The first three properties follow readily from the definition. The last -is known as **invariance under cyclic permutations**. Note that the -matrices cannot be reordered arbitrarily, for example -$\operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D}) \neq \operatorname{tr}(\mathbf{B}\mathbf{A}\mathbf{C}\mathbf{D})$ -in general. Also, there is nothing special about the product of four -matrices -- analogous rules hold for more or fewer matrices. - -Interestingly, the trace of a matrix is equal to the sum of its -eigenvalues (repeated according to multiplicity): - -$$\operatorname{tr}(\mathbf{A}) = \sum_i \lambda_i(\mathbf{A})$$ - -## Determinant - -The **determinant** of a square matrix can be defined in several -different confusing ways, none of which are particularly important for -our purposes; go look at an introductory linear algebra text (or -Wikipedia) if you need a definition. But it's good to know the -properties: - -(i) $\det(\mathbf{I}) = 1$ - -(ii) $\det(\mathbf{A}^{\!\top\!}) = \det(\mathbf{A})$ - -(iii) $\det(\mathbf{A}\mathbf{B}) = \det(\mathbf{A})\det(\mathbf{B})$ - -(iv) $\det(\mathbf{A}^{-1}) = \det(\mathbf{A})^{-1}$ - -(v) $\det(\alpha\mathbf{A}) = \alpha^n \det(\mathbf{A})$ - -Interestingly, the determinant of a matrix is equal to the product of -its eigenvalues (repeated according to multiplicity): - -$$\det(\mathbf{A}) = \prod_i \lambda_i(\mathbf{A})$$ - -## Orthogonal matrices - -A matrix $\mathbf{Q} \in \mathbb{R}^{n \times n}$ is said to be -**orthogonal** if its columns are pairwise orthonormal. This definition -implies that - -$$\mathbf{Q}^{\!\top\!} \mathbf{Q} = \mathbf{Q}\mathbf{Q}^{\!\top\!} = \mathbf{I}$$ - -or equivalently, $\mathbf{Q}^{\!\top\!} = \mathbf{Q}^{-1}$. A nice thing -about orthogonal matrices is that they preserve inner products: - -$$(\mathbf{Q}\mathbf{x})^{\!\top\!}(\mathbf{Q}\mathbf{y}) = \mathbf{x}^{\!\top\!} \mathbf{Q}^{\!\top\!} \mathbf{Q}\mathbf{y} = \mathbf{x}^{\!\top\!} \mathbf{I}\mathbf{y} = \mathbf{x}^{\!\top\!}\mathbf{y}$$ - -A direct result of this fact is that they also preserve 2-norms: - -$$\|\mathbf{Q}\mathbf{x}\|_2 = \sqrt{(\mathbf{Q}\mathbf{x})^{\!\top\!}(\mathbf{Q}\mathbf{x})} = \sqrt{\mathbf{x}^{\!\top\!}\mathbf{x}} = \|\mathbf{x}\|_2$$ - -Therefore multiplication by an orthogonal matrix can be considered as a -transformation that preserves length, but may rotate or reflect the -vector about the origin. - -## Symmetric matrices - -A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ is said to be -**symmetric** if it is equal to its own transpose -($\mathbf{A} = \mathbf{A}^{\!\top\!}$), meaning that $A_{ij} = A_{ji}$ -for all $(i,j)$. This definition seems harmless enough but turns out to -have some strong implications. We summarize the most important of these -as - -*Theorem.* -(Spectral Theorem) If $\mathbf{A} \in \mathbb{R}^{n \times n}$ is -symmetric, then there exists an orthonormal basis for $\mathbb{R}^n$ -consisting of eigenvectors of $\mathbf{A}$. - -The practical application of this theorem is a particular factorization -of symmetric matrices, referred to as the **eigendecomposition** or -**spectral decomposition**. Denote the orthonormal basis of eigenvectors -$\mathbf{q}_1, \dots, \mathbf{q}_n$ and their eigenvalues -$\lambda_1, \dots, \lambda_n$. Let $\mathbf{Q}$ be an orthogonal matrix -with $\mathbf{q}_1, \dots, \mathbf{q}_n$ as its columns, and -$\mathbf{\Lambda} = \operatorname{diag}(\lambda_1, \dots, \lambda_n)$. Since by -definition $\mathbf{A}\mathbf{q}_i = \lambda_i\mathbf{q}_i$ for every -$i$, the following relationship holds: - -$$\mathbf{A}\mathbf{Q} = \mathbf{Q}\mathbf{\Lambda}$$ - -Right-multiplying -by $\mathbf{Q}^{\!\top\!}$, we arrive at the decomposition - -$$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$$ - -### Rayleigh quotients - -Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix. The -expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ is called a -**quadratic form**. - -There turns out to be an interesting connection between the quadratic -form of a symmetric matrix and its eigenvalues. This connection is -provided by the **Rayleigh quotient** - -$$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}$$ - -The Rayleigh quotient has a couple of important properties which the -reader can (and should!) easily verify from the definition: - -(i) **Scale invariance**: for any vector $\mathbf{x} \neq \mathbf{0}$ - and any scalar $\alpha \neq 0$, - $R_\mathbf{A}(\mathbf{x}) = R_\mathbf{A}(\alpha\mathbf{x})$. - -(ii) If $\mathbf{x}$ is an eigenvector of $\mathbf{A}$ with eigenvalue - $\lambda$, then $R_\mathbf{A}(\mathbf{x}) = \lambda$. - -We can further show that the Rayleigh quotient is bounded by the largest -and smallest eigenvalues of $\mathbf{A}$. But first we will show a -useful special case of the final result. - -{prf:proposition} Rayleigh quotient bounds -:label: rayleigh_quotient_bounds -For any $\mathbf{x}$ such that $\|\mathbf{x}\|_2 = 1$, - -$$\lambda_{\min}(\mathbf{A}) \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$$ - -with equality if and only if $\mathbf{x}$ is a corresponding -eigenvector. - - -:::{prf:proof} -*Proof.* We show only the $\max$ case because the argument for the -$\min$ case is entirely analogous. - -Since $\mathbf{A}$ is symmetric, we can decompose it as -$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$. Then use -the change of variable $\mathbf{y} = \mathbf{Q}^{\!\top\!}\mathbf{x}$, -noting that the relationship between $\mathbf{x}$ and $\mathbf{y}$ is -one-to-one and that $\|\mathbf{y}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Hence - -$$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \max_{\|\mathbf{y}\|_2 = 1} \mathbf{y}^{\!\top\!}\mathbf{\Lambda}\mathbf{y} = \max_{y_1^2+\dots+y_n^2=1} \sum_{i=1}^n \lambda_i y_i^2$$ - -Written this way, it is clear that $\mathbf{y}$ maximizes this -expression exactly if and only if it satisfies -$\sum_{i \in I} y_i^2 = 1$ where -$I = \{i : \lambda_i = \max_{j=1,\dots,n} \lambda_j = \lambda_{\max}(\mathbf{A})\}$ -and $y_j = 0$ for $j \not\in I$. That is, $I$ contains the index or -indices of the largest eigenvalue. In this case, the maximal value of -the expression is - -$$\sum_{i=1}^n \lambda_i y_i^2 = \sum_{i \in I} \lambda_i y_i^2 = \lambda_{\max}(\mathbf{A}) \sum_{i \in I} y_i^2 = \lambda_{\max}(\mathbf{A})$$ - -Then writing $\mathbf{q}_1, \dots, \mathbf{q}_n$ for the columns of -$\mathbf{Q}$, we have - -$$\mathbf{x} = \mathbf{Q}\mathbf{Q}^{\!\top\!}\mathbf{x} = \mathbf{Q}\mathbf{y} = \sum_{i=1}^n y_i\mathbf{q}_i = \sum_{i \in I} y_i\mathbf{q}_i$$ - -where we have used the matrix-vector product identity. - -Recall that $\mathbf{q}_1, \dots, \mathbf{q}_n$ are eigenvectors of -$\mathbf{A}$ and form an orthonormal basis for $\mathbb{R}^n$. Therefore -by construction, the set $\{\mathbf{q}_i : i \in I\}$ forms an -orthonormal basis for the eigenspace of $\lambda_{\max}(\mathbf{A})$. -Hence $\mathbf{x}$, which is a linear combination of these, lies in that -eigenspace and thus is an eigenvector of $\mathbf{A}$ corresponding to -$\lambda_{\max}(\mathbf{A})$. - -We have shown that -$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \lambda_{\max}(\mathbf{A})$, -from which we have the general inequality -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$ -for all unit-length $\mathbf{x}$. ◻ -::: - -By the scale invariance of the Rayleigh quotient, we immediately have as -a corollary (since -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = R_{\mathbf{A}}(\mathbf{x})$ -for unit $\mathbf{x}$) - -{prf:theorem} Min-max theorem -*Theorem.* For all $\mathbf{x} \neq \mathbf{0}$, - -$$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ - -with equality if and only if $\mathbf{x}$ is a corresponding -eigenvector. - - -## Positive (semi-)definite matrices - -A symmetric matrix $\mathbf{A}$ is **positive semi-definite** if for all -$\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \geq 0$. Sometimes people -write $\mathbf{A} \succeq 0$ to indicate that $\mathbf{A}$ is positive -semi-definite. - -A symmetric matrix $\mathbf{A}$ is **positive definite** if for all -nonzero $\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} > 0$. Sometimes people write -$\mathbf{A} \succ 0$ to indicate that $\mathbf{A}$ is positive definite. -Note that positive definiteness is a strictly stronger property than -positive semi-definiteness, in the sense that every positive definite -matrix is positive semi-definite but not vice-versa. - -These properties are related to eigenvalues in the following way. - -*Proposition.* -A symmetric matrix is positive semi-definite if and only if all of its -eigenvalues are nonnegative, and positive definite if and only if all of -its eigenvalues are positive. - - -*Proof.* Suppose $A$ is positive semi-definite, and let $\mathbf{x}$ be -an eigenvector of $\mathbf{A}$ with eigenvalue $\lambda$. Then - -$$0 \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}(\lambda\mathbf{x}) = \lambda\mathbf{x}^{\!\top\!}\mathbf{x} = \lambda\|\mathbf{x}\|_2^2$$ - -Since $\mathbf{x} \neq \mathbf{0}$ (by the assumption that it is an -eigenvector), we have $\|\mathbf{x}\|_2^2 > 0$, so we can divide both -sides by $\|\mathbf{x}\|_2^2$ to arrive at $\lambda \geq 0$. If -$\mathbf{A}$ is positive definite, the inequality above holds strictly, -so $\lambda > 0$. This proves one direction. - -To simplify the proof of the other direction, we will use the machinery -of Rayleigh quotients. Suppose that $\mathbf{A}$ is symmetric and all -its eigenvalues are nonnegative. Then for all -$\mathbf{x} \neq \mathbf{0}$, - -$$0 \leq \lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x})$$ - -Since $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ matches -$R_\mathbf{A}(\mathbf{x})$ in sign, we conclude that $\mathbf{A}$ is -positive semi-definite. If the eigenvalues of $\mathbf{A}$ are all -strictly positive, then $0 < \lambda_{\min}(\mathbf{A})$, whence it -follows that $\mathbf{A}$ is positive definite. ◻ - - -As an example of how these matrices arise, consider - -*Proposition.* -Suppose $\mathbf{A} \in \mathbb{R}^{m \times n}$. Then -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. If -$\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, then -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. - - -*Proof.* For any $\mathbf{x} \in \mathbb{R}^n$, - -$$\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = (\mathbf{A}\mathbf{x})^{\!\top\!}(\mathbf{A}\mathbf{x}) = \|\mathbf{A}\mathbf{x}\|_2^2 \geq 0$$ - -so $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. - -Note that $\|\mathbf{A}\mathbf{x}\|_2^2 = 0$ implies -$\|\mathbf{A}\mathbf{x}\|_2 = 0$, which in turn implies -$\mathbf{A}\mathbf{x} = \mathbf{0}$ (recall that this is a property of -norms). If $\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, -$\mathbf{A}\mathbf{x} = \mathbf{0}$ implies $\mathbf{x} = \mathbf{0}$, -so -$\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = 0$ -if and only if $\mathbf{x} = \mathbf{0}$, and thus -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. ◻ - -Positive definite matrices are invertible (since their eigenvalues are -nonzero), whereas positive semi-definite matrices might not be. However, -if you already have a positive semi-definite matrix, it is possible to -perturb its diagonal slightly to produce a positive definite matrix. - -*Proposition.* -If $\mathbf{A}$ is positive semi-definite and $\epsilon > 0$, then -$\mathbf{A} + \epsilon\mathbf{I}$ is positive definite. - -*Proof.* Assuming $\mathbf{A}$ is positive semi-definite and -$\epsilon > 0$, we have for any $\mathbf{x} \neq \mathbf{0}$ that - -$$\mathbf{x}^{\!\top\!}(\mathbf{A}+\epsilon\mathbf{I})\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} + \epsilon\mathbf{x}^{\!\top\!}\mathbf{I}\mathbf{x} = \underbrace{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}_{\geq 0} + \underbrace{\epsilon\|\mathbf{x}\|_2^2}_{> 0} > 0$$ - -as claimed. ◻ - -An obvious but frequently useful consequence of the two propositions we -have just shown is that -$\mathbf{A}^{\!\top\!}\mathbf{A} + \epsilon\mathbf{I}$ is positive -definite (and in particular, invertible) for *any* matrix $\mathbf{A}$ -and any $\epsilon > 0$. - -### The geometry of positive definite quadratic forms - -A useful way to understand quadratic forms is by the geometry of their -level sets. A **level set** or **isocontour** of a function is the set -of all inputs such that the function applied to those inputs yields a -given output. Mathematically, the $c$-isocontour of $f$ is -$\{\mathbf{x} \in \operatorname{dom} f : f(\mathbf{x}) = c\}$. - -Let us consider the special case -$f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ where -$\mathbf{A}$ is a positive definite matrix. Since $\mathbf{A}$ is -positive definite, it has a unique matrix square root -$\mathbf{A}^{\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, -where $\mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$ is the -eigendecomposition of $\mathbf{A}$ and -$\mathbf{\Lambda}^{\frac{1}{2}} = \operatorname{diag}(\sqrt{\lambda_1}, \dots \sqrt{\lambda_n})$. -It is easy to see that this matrix $\mathbf{A}^{\frac{1}{2}}$ is -positive definite (consider its eigenvalues) and satisfies -$\mathbf{A}^{\frac{1}{2}}\mathbf{A}^{\frac{1}{2}} = \mathbf{A}$. Fixing -a value $c \geq 0$, the $c$-isocontour of $f$ is the set of -$\mathbf{x} \in \mathbb{R}^n$ such that - -$$c = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A}^{\frac{1}{2}}\mathbf{A}^{\frac{1}{2}}\mathbf{x} = \|\mathbf{A}^{\frac{1}{2}}\mathbf{x}\|_2^2$$ - -where we have used the symmetry of $\mathbf{A}^{\frac{1}{2}}$. Making -the change of variable -$\mathbf{z} = \mathbf{A}^{\frac{1}{2}}\mathbf{x}$, we have the condition -$\|\mathbf{z}\|_2 = \sqrt{c}$. That is, the values $\mathbf{z}$ lie on a -sphere of radius $\sqrt{c}$. These can be parameterized as -$\mathbf{z} = \sqrt{c}\hat{\mathbf{z}}$ where $\hat{\mathbf{z}}$ has -$\|\hat{\mathbf{z}}\|_2 = 1$. Then since -$\mathbf{A}^{-\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, -we have - -$$\mathbf{x} = \mathbf{A}^{-\frac{1}{2}}\mathbf{z} = \mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\mathbf{Q}^{\!\top\!}\sqrt{c}\hat{\mathbf{z}} = \sqrt{c}\mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\tilde{\mathbf{z}}$$ - -where $\tilde{\mathbf{z}} = \mathbf{Q}^{\!\top\!}\hat{\mathbf{z}}$ also -satisfies $\|\tilde{\mathbf{z}}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Using this parameterization, we see that the solution set -$\{\mathbf{x} \in \mathbb{R}^n : f(\mathbf{x}) = c\}$ is the image of -the unit sphere -$\{\tilde{\mathbf{z}} \in \mathbb{R}^n : \|\tilde{\mathbf{z}}\|_2 = 1\}$ -under the invertible linear map -$\mathbf{x} = \sqrt{c}\mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\tilde{\mathbf{z}}$. - -What we have gained with all these manipulations is a clear algebraic -understanding of the $c$-isocontour of $f$ in terms of a sequence of -linear transformations applied to a well-understood set. We begin with -the unit sphere, then scale every axis $i$ by -$\lambda_i^{-\frac{1}{2}}$, resulting in an axis-aligned ellipsoid. -Observe that the axis lengths of the ellipsoid are proportional to the -inverse square roots of the eigenvalues of $\mathbf{A}$. Hence larger -eigenvalues correspond to shorter axis lengths, and vice-versa. - -Then this axis-aligned ellipsoid undergoes a rigid transformation (i.e. -one that preserves length and angles, such as a rotation/reflection) -given by $\mathbf{Q}$. The result of this transformation is that the -axes of the ellipse are no longer along the coordinate axes in general, -but rather along the directions given by the corresponding eigenvectors. -To see this, consider the unit vector $\mathbf{e}_i \in \mathbb{R}^n$ -that has $[\mathbf{e}_i]_j = \delta_{ij}$. In the pre-transformed space, -this vector points along the axis with length proportional to -$\lambda_i^{-\frac{1}{2}}$. But after applying the rigid transformation -$\mathbf{Q}$, the resulting vector points in the direction of the -corresponding eigenvector $\mathbf{q}_i$, since - -$$\mathbf{Q}\mathbf{e}_i = \sum_{j=1}^n [\mathbf{e}_i]_j\mathbf{q}_j = \mathbf{q}_i$$ - -where we have used the matrix-vector product identity from earlier. - -In summary: the isocontours of -$f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ are -ellipsoids such that the axes point in the directions of the -eigenvectors of $\mathbf{A}$, and the radii of these axes are -proportional to the inverse square roots of the corresponding -eigenvalues. - -## Singular value decomposition - -Singular value decomposition (SVD) is a widely applicable tool in linear -algebra. Its strength stems partially from the fact that *every matrix* -$\mathbf{A} \in \mathbb{R}^{m \times n}$ has an SVD (even non-square -matrices)! The decomposition goes as follows: - -$$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!}$$ - -where -$\mathbf{U} \in \mathbb{R}^{m \times m}$ and -$\mathbf{V} \in \mathbb{R}^{n \times n}$ are orthogonal matrices and -$\mathbf{\Sigma} \in \mathbb{R}^{m \times n}$ is a diagonal matrix with -the **singular values** of $\mathbf{A}$ (denoted $\sigma_i$) on its -diagonal. - -By convention, the singular values are given in non-increasing order, -i.e. - -$$\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_{\min(m,n)} \geq 0$$ - -Only the first $r$ singular values are nonzero, where $r$ is the rank of -$\mathbf{A}$. - -Observe that the SVD factors provide eigendecompositions for -$\mathbf{A}^{\!\top\!}\mathbf{A}$ and $\mathbf{A}\mathbf{A}^{\!\top\!}$: - -$$\begin{aligned} -\mathbf{A}^{\!\top\!}\mathbf{A} &= (\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!})^{\!\top\!}\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!} = \mathbf{V}\mathbf{\Sigma}^{\!\top\!}\mathbf{U}^{\!\top\!}\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!} = \mathbf{V}\mathbf{\Sigma}^{\!\top\!}\mathbf{\Sigma}\mathbf{V}^{\!\top\!} \\ -\mathbf{A}\mathbf{A}^{\!\top\!} &= \mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!}(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!})^{\!\top\!} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top\!}\mathbf{V}\mathbf{\Sigma}^{\!\top\!}\mathbf{U}^{\!\top\!} = \mathbf{U}\mathbf{\Sigma}\mathbf{\Sigma}^{\!\top\!}\mathbf{U}^{\!\top\!} -\end{aligned}$$ - -It follows immediately that the columns of $\mathbf{V}$ -(the **right-singular vectors** of $\mathbf{A}$) are eigenvectors of -$\mathbf{A}^{\!\top\!}\mathbf{A}$, and the columns of $\mathbf{U}$ (the -**left-singular vectors** of $\mathbf{A}$) are eigenvectors of -$\mathbf{A}\mathbf{A}^{\!\top\!}$. - -The matrices $\mathbf{\Sigma}^{\!\top\!}\mathbf{\Sigma}$ and -$\mathbf{\Sigma}\mathbf{\Sigma}^{\!\top\!}$ are not necessarily the same -size, but both are diagonal with the squared singular values -$\sigma_i^2$ on the diagonal (plus possibly some zeros). Thus the -singular values of $\mathbf{A}$ are the square roots of the eigenvalues -of $\mathbf{A}^{\!\top\!}\mathbf{A}$ (or equivalently, of -$\mathbf{A}\mathbf{A}^{\!\top\!}$)[^5]. - -## Some useful matrix identities - -### Matrix-vector product as linear combination of matrix columns - -*Proposition.* -Let $\mathbf{x} \in \mathbb{R}^n$ be a vector and -$\mathbf{A} \in \mathbb{R}^{m \times n}$ a matrix with columns -$\mathbf{a}_1, \dots, \mathbf{a}_n$. Then - -$$\mathbf{A}\mathbf{x} = \sum_{i=1}^n x_i\mathbf{a}_i$$ - -This identity is extremely useful in understanding linear operators in -terms of their matrices' columns. The proof is very simple (consider -each element of $\mathbf{A}\mathbf{x}$ individually and expand by -definitions) but it is a good exercise to convince yourself. - -### Sum of outer products as matrix-matrix product - -An **outer product** is an expression of the form -$\mathbf{a}\mathbf{b}^{\!\top\!}$, where $\mathbf{a} \in \mathbb{R}^m$ -and $\mathbf{b} \in \mathbb{R}^n$. By inspection it is not hard to see -that such an expression yields an $m \times n$ matrix such that - -$$[\mathbf{a}\mathbf{b}^{\!\top\!}]_{ij} = a_ib_j$$ - -It is not -immediately obvious, but the sum of outer products is actually -equivalent to an appropriate matrix-matrix product! We formalize this -statement as - -*Proposition.* -Let $\mathbf{a}_1, \dots, \mathbf{a}_k \in \mathbb{R}^m$ and -$\mathbf{b}_1, \dots, \mathbf{b}_k \in \mathbb{R}^n$. Then - -$$\sum_{\ell=1}^k \mathbf{a}_\ell\mathbf{b}_\ell^{\!\top\!} = \mathbf{A}\mathbf{B}^{\!\top\!}$$ - -where - -$$\mathbf{A} = \begin{bmatrix}\mathbf{a}_1 & \cdots & \mathbf{a}_k\end{bmatrix}, \hspace{0.5cm} \mathbf{B} = \begin{bmatrix}\mathbf{b}_1 & \cdots & \mathbf{b}_k\end{bmatrix}$$ - -*Proof.* For each $(i,j)$, we have - -$$\left[\sum_{\ell=1}^k \mathbf{a}_\ell\mathbf{b}_\ell^{\!\top\!}\right]_{ij} = \sum_{\ell=1}^k [\mathbf{a}_\ell\mathbf{b}_\ell^{\!\top\!}]_{ij} = \sum_{\ell=1}^k [\mathbf{a}_\ell]_i[\mathbf{b}_\ell]_j = \sum_{\ell=1}^k A_{i\ell}B_{j\ell}$$ - -This last expression should be recognized as an inner product between -the $i$th row of $\mathbf{A}$ and the $j$th row of $\mathbf{B}$, or -equivalently the $j$th column of $\mathbf{B}^{\!\top\!}$. Hence by the -definition of matrix multiplication, it is equal to -$[\mathbf{A}\mathbf{B}^{\!\top\!}]_{ij}$. ◻ - -### Quadratic forms - -Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix, and -recall that the expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ -is called a quadratic form of $\mathbf{A}$. It is in some cases helpful -to rewrite the quadratic form in terms of the individual elements that -make up $\mathbf{A}$ and $\mathbf{x}$: - -$$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \sum_{i=1}^n\sum_{j=1}^n A_{ij}x_ix_j$$ - -This identity is valid for any square matrix (need not be symmetric), -although quadratic forms are usually only discussed in the context of -symmetric matrices. - diff --git a/drafts/chapter_decompositions/matrix_norms.md b/drafts/chapter_decompositions/matrix_norms.md deleted file mode 100644 index cff10a6..0000000 --- a/drafts/chapter_decompositions/matrix_norms.md +++ /dev/null @@ -1,44 +0,0 @@ -## Matrix Norms -[]: # -[]: # for metric in metrics: -[]: # clf = FlexibleNearestCentroidClassifier(metric=metric) -[]: # clf.fit(X_train, y_train) -[]: # predictions = clf.predict(X_test) -[]: # print(f"Accuracy with {metric} metric: {accuracy_score(y_test, predictions)}") -[]: # ``` -[]: # -[]: # --- -[]: # -[]: # ### Conclusion -[]: # -[]: # The choice of distance metric can significantly affect the performance of the nearest centroid classifier. By experimenting with different metrics, you can gain insights into how they influence classification boundaries and model performance. -[]: # -[]: # --- -[]: # -[]: # ### Further Reading -[]: # -[]: # - [Understanding Distance Metrics](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html) -[]: # - [Nearest Centroid Classifier in Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestCentroid.html) -[]: # -[]: # - [Distance Metrics in Machine Learning](https://towardsdatascience.com/distance-metrics-in-machine-learning-1f3b2a0c4d7e) -[]: # -[]: # --- -[]: # -[]: # ### References -[]: # -[]: # - Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. -[]: # - Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. -[]: # - Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. -[]: # -[]: # --- -[]: # -[]: # ### License -[]: # -[]: # This notebook is licensed under the [MIT License](https://opensource.org/licenses/MIT). -[]: # -[]: # --- -[]: # -[]: # ### Acknowledgments -[]: # -[]: # - [Scikit-learn](https://scikit-learn.org/stable/) for providing the machine learning library used in this notebook. -[]: # - [Matplotlib](https://matplotlib.org/) for the visualization tools used in this notebook. \ No newline at end of file diff --git a/drafts/chapter_decompositions/orthogonal_matrices.md b/drafts/chapter_decompositions/orthogonal_matrices.md deleted file mode 100644 index 9835778..0000000 --- a/drafts/chapter_decompositions/orthogonal_matrices.md +++ /dev/null @@ -1,318 +0,0 @@ -## Orthogonal matrices - -A matrix $\mathbf{Q} \in \mathbb{R}^{n \times n}$ is said to be -**orthogonal** if its columns are pairwise orthonormal. This definition -implies that - -$$\mathbf{Q}^{\!\top\!} \mathbf{Q} = \mathbf{Q}\mathbf{Q}^{\!\top\!} = \mathbf{I}$$ - -or equivalently, $\mathbf{Q}^{\!\top\!} = \mathbf{Q}^{-1}$. A nice thing -about orthogonal matrices is that they preserve inner products: - -$$(\mathbf{Q}\mathbf{x})^{\!\top\!}(\mathbf{Q}\mathbf{y}) = \mathbf{x}^{\!\top\!} \mathbf{Q}^{\!\top\!} \mathbf{Q}\mathbf{y} = \mathbf{x}^{\!\top\!} \mathbf{I}\mathbf{y} = \mathbf{x}^{\!\top\!}\mathbf{y}$$ - -A direct result of this fact is that they also preserve 2-norms: - -$$\|\mathbf{Q}\mathbf{x}\|_2 = \sqrt{(\mathbf{Q}\mathbf{x})^{\!\top\!}(\mathbf{Q}\mathbf{x})} = \sqrt{\mathbf{x}^{\!\top\!}\mathbf{x}} = \|\mathbf{x}\|_2$$ - -Therefore multiplication by an orthogonal matrix can be considered as a -transformation that preserves length, but may rotate or reflect the -vector about the origin. - -## Symmetric matrices - -A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ is said to be -**symmetric** if it is equal to its own transpose -($\mathbf{A} = \mathbf{A}^{\!\top\!}$), meaning that $A_{ij} = A_{ji}$ -for all $(i,j)$. This definition seems harmless enough but turns out to -have some strong implications. We summarize the most important of these -as - -*Theorem.* -(Spectral Theorem) If $\mathbf{A} \in \mathbb{R}^{n \times n}$ is -symmetric, then there exists an orthonormal basis for $\mathbb{R}^n$ -consisting of eigenvectors of $\mathbf{A}$. - -The practical application of this theorem is a particular factorization -of symmetric matrices, referred to as the **eigendecomposition** or -**spectral decomposition**. Denote the orthonormal basis of eigenvectors -$\mathbf{q}_1, \dots, \mathbf{q}_n$ and their eigenvalues -$\lambda_1, \dots, \lambda_n$. Let $\mathbf{Q}$ be an orthogonal matrix -with $\mathbf{q}_1, \dots, \mathbf{q}_n$ as its columns, and -$\mathbf{\Lambda} = \operatorname{diag}(\lambda_1, \dots, \lambda_n)$. Since by -definition $\mathbf{A}\mathbf{q}_i = \lambda_i\mathbf{q}_i$ for every -$i$, the following relationship holds: - -$$\mathbf{A}\mathbf{Q} = \mathbf{Q}\mathbf{\Lambda}$$ - -Right-multiplying -by $\mathbf{Q}^{\!\top\!}$, we arrive at the decomposition - -$$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$$ - -### Rayleigh quotients - -Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix. The -expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ is called a -**quadratic form**. - -There turns out to be an interesting connection between the quadratic -form of a symmetric matrix and its eigenvalues. This connection is -provided by the **Rayleigh quotient** - -$$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}$$ - -The Rayleigh quotient has a couple of important properties which the -reader can (and should!) easily verify from the definition: - -(i) **Scale invariance**: for any vector $\mathbf{x} \neq \mathbf{0}$ - and any scalar $\alpha \neq 0$, - $R_\mathbf{A}(\mathbf{x}) = R_\mathbf{A}(\alpha\mathbf{x})$. - -(ii) If $\mathbf{x}$ is an eigenvector of $\mathbf{A}$ with eigenvalue - $\lambda$, then $R_\mathbf{A}(\mathbf{x}) = \lambda$. - -We can further show that the Rayleigh quotient is bounded by the largest -and smallest eigenvalues of $\mathbf{A}$. But first we will show a -useful special case of the final result. - -*Proposition.* -For any $\mathbf{x}$ such that $\|\mathbf{x}\|_2 = 1$, - -$$\lambda_{\min}(\mathbf{A}) \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$$ - -with equality if and only if $\mathbf{x}$ is a corresponding -eigenvector. - -*Proof.* We show only the $\max$ case because the argument for the -$\min$ case is entirely analogous. - -Since $\mathbf{A}$ is symmetric, we can decompose it as -$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$. Then use -the change of variable $\mathbf{y} = \mathbf{Q}^{\!\top\!}\mathbf{x}$, -noting that the relationship between $\mathbf{x}$ and $\mathbf{y}$ is -one-to-one and that $\|\mathbf{y}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Hence - -$$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \max_{\|\mathbf{y}\|_2 = 1} \mathbf{y}^{\!\top\!}\mathbf{\Lambda}\mathbf{y} = \max_{y_1^2+\dots+y_n^2=1} \sum_{i=1}^n \lambda_i y_i^2$$ - -Written this way, it is clear that $\mathbf{y}$ maximizes this -expression exactly if and only if it satisfies -$\sum_{i \in I} y_i^2 = 1$ where -$I = \{i : \lambda_i = \max_{j=1,\dots,n} \lambda_j = \lambda_{\max}(\mathbf{A})\}$ -and $y_j = 0$ for $j \not\in I$. That is, $I$ contains the index or -indices of the largest eigenvalue. In this case, the maximal value of -the expression is - -$$\sum_{i=1}^n \lambda_i y_i^2 = \sum_{i \in I} \lambda_i y_i^2 = \lambda_{\max}(\mathbf{A}) \sum_{i \in I} y_i^2 = \lambda_{\max}(\mathbf{A})$$ - -Then writing $\mathbf{q}_1, \dots, \mathbf{q}_n$ for the columns of -$\mathbf{Q}$, we have - -$$\mathbf{x} = \mathbf{Q}\mathbf{Q}^{\!\top\!}\mathbf{x} = \mathbf{Q}\mathbf{y} = \sum_{i=1}^n y_i\mathbf{q}_i = \sum_{i \in I} y_i\mathbf{q}_i$$ - -where we have used the matrix-vector product identity. - -Recall that $\mathbf{q}_1, \dots, \mathbf{q}_n$ are eigenvectors of -$\mathbf{A}$ and form an orthonormal basis for $\mathbb{R}^n$. Therefore -by construction, the set $\{\mathbf{q}_i : i \in I\}$ forms an -orthonormal basis for the eigenspace of $\lambda_{\max}(\mathbf{A})$. -Hence $\mathbf{x}$, which is a linear combination of these, lies in that -eigenspace and thus is an eigenvector of $\mathbf{A}$ corresponding to -$\lambda_{\max}(\mathbf{A})$. - -We have shown that -$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \lambda_{\max}(\mathbf{A})$, -from which we have the general inequality -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$ -for all unit-length $\mathbf{x}$. ◻ - - -By the scale invariance of the Rayleigh quotient, we immediately have as -a corollary (since -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = R_{\mathbf{A}}(\mathbf{x})$ -for unit $\mathbf{x}$) - -*Theorem.* -(Min-max theorem) For all $\mathbf{x} \neq \mathbf{0}$, - -$$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ - -with equality if and only if $\mathbf{x}$ is a corresponding -eigenvector. - - -## Positive (semi-)definite matrices - -A symmetric matrix $\mathbf{A}$ is **positive semi-definite** if for all -$\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \geq 0$. Sometimes people -write $\mathbf{A} \succeq 0$ to indicate that $\mathbf{A}$ is positive -semi-definite. - -A symmetric matrix $\mathbf{A}$ is **positive definite** if for all -nonzero $\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} > 0$. Sometimes people write -$\mathbf{A} \succ 0$ to indicate that $\mathbf{A}$ is positive definite. -Note that positive definiteness is a strictly stronger property than -positive semi-definiteness, in the sense that every positive definite -matrix is positive semi-definite but not vice-versa. - -These properties are related to eigenvalues in the following way. - -*Proposition.* -A symmetric matrix is positive semi-definite if and only if all of its -eigenvalues are nonnegative, and positive definite if and only if all of -its eigenvalues are positive. - - -*Proof.* Suppose $A$ is positive semi-definite, and let $\mathbf{x}$ be -an eigenvector of $\mathbf{A}$ with eigenvalue $\lambda$. Then - -$$0 \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}(\lambda\mathbf{x}) = \lambda\mathbf{x}^{\!\top\!}\mathbf{x} = \lambda\|\mathbf{x}\|_2^2$$ - -Since $\mathbf{x} \neq \mathbf{0}$ (by the assumption that it is an -eigenvector), we have $\|\mathbf{x}\|_2^2 > 0$, so we can divide both -sides by $\|\mathbf{x}\|_2^2$ to arrive at $\lambda \geq 0$. If -$\mathbf{A}$ is positive definite, the inequality above holds strictly, -so $\lambda > 0$. This proves one direction. - -To simplify the proof of the other direction, we will use the machinery -of Rayleigh quotients. Suppose that $\mathbf{A}$ is symmetric and all -its eigenvalues are nonnegative. Then for all -$\mathbf{x} \neq \mathbf{0}$, - -$$0 \leq \lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x})$$ - -Since $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ matches -$R_\mathbf{A}(\mathbf{x})$ in sign, we conclude that $\mathbf{A}$ is -positive semi-definite. If the eigenvalues of $\mathbf{A}$ are all -strictly positive, then $0 < \lambda_{\min}(\mathbf{A})$, whence it -follows that $\mathbf{A}$ is positive definite. ◻ - - -As an example of how these matrices arise, consider - -*Proposition.* -Suppose $\mathbf{A} \in \mathbb{R}^{m \times n}$. Then -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. If -$\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, then -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. - - -*Proof.* For any $\mathbf{x} \in \mathbb{R}^n$, - -$$\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = (\mathbf{A}\mathbf{x})^{\!\top\!}(\mathbf{A}\mathbf{x}) = \|\mathbf{A}\mathbf{x}\|_2^2 \geq 0$$ - -so $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. - -Note that $\|\mathbf{A}\mathbf{x}\|_2^2 = 0$ implies -$\|\mathbf{A}\mathbf{x}\|_2 = 0$, which in turn implies -$\mathbf{A}\mathbf{x} = \mathbf{0}$ (recall that this is a property of -norms). If $\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, -$\mathbf{A}\mathbf{x} = \mathbf{0}$ implies $\mathbf{x} = \mathbf{0}$, -so -$\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = 0$ -if and only if $\mathbf{x} = \mathbf{0}$, and thus -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. ◻ - -Positive definite matrices are invertible (since their eigenvalues are -nonzero), whereas positive semi-definite matrices might not be. However, -if you already have a positive semi-definite matrix, it is possible to -perturb its diagonal slightly to produce a positive definite matrix. - -*Proposition.* -If $\mathbf{A}$ is positive semi-definite and $\epsilon > 0$, then -$\mathbf{A} + \epsilon\mathbf{I}$ is positive definite. - -*Proof.* Assuming $\mathbf{A}$ is positive semi-definite and -$\epsilon > 0$, we have for any $\mathbf{x} \neq \mathbf{0}$ that - -$$\mathbf{x}^{\!\top\!}(\mathbf{A}+\epsilon\mathbf{I})\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} + \epsilon\mathbf{x}^{\!\top\!}\mathbf{I}\mathbf{x} = \underbrace{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}_{\geq 0} + \underbrace{\epsilon\|\mathbf{x}\|_2^2}_{> 0} > 0$$ - -as claimed. ◻ - -An obvious but frequently useful consequence of the two propositions we -have just shown is that -$\mathbf{A}^{\!\top\!}\mathbf{A} + \epsilon\mathbf{I}$ is positive -definite (and in particular, invertible) for *any* matrix $\mathbf{A}$ -and any $\epsilon > 0$. - -### The geometry of positive definite quadratic forms - -A useful way to understand quadratic forms is by the geometry of their -level sets. A **level set** or **isocontour** of a function is the set -of all inputs such that the function applied to those inputs yields a -given output. Mathematically, the $c$-isocontour of $f$ is -$\{\mathbf{x} \in \operatorname{dom} f : f(\mathbf{x}) = c\}$. - -Let us consider the special case -$f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ where -$\mathbf{A}$ is a positive definite matrix. Since $\mathbf{A}$ is -positive definite, it has a unique matrix square root -$\mathbf{A}^{\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, -where $\mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$ is the -eigendecomposition of $\mathbf{A}$ and -$\mathbf{\Lambda}^{\frac{1}{2}} = \operatorname{diag}(\sqrt{\lambda_1}, \dots \sqrt{\lambda_n})$. -It is easy to see that this matrix $\mathbf{A}^{\frac{1}{2}}$ is -positive definite (consider its eigenvalues) and satisfies -$\mathbf{A}^{\frac{1}{2}}\mathbf{A}^{\frac{1}{2}} = \mathbf{A}$. Fixing -a value $c \geq 0$, the $c$-isocontour of $f$ is the set of -$\mathbf{x} \in \mathbb{R}^n$ such that - -$$c = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A}^{\frac{1}{2}}\mathbf{A}^{\frac{1}{2}}\mathbf{x} = \|\mathbf{A}^{\frac{1}{2}}\mathbf{x}\|_2^2$$ - -where we have used the symmetry of $\mathbf{A}^{\frac{1}{2}}$. Making -the change of variable -$\mathbf{z} = \mathbf{A}^{\frac{1}{2}}\mathbf{x}$, we have the condition -$\|\mathbf{z}\|_2 = \sqrt{c}$. That is, the values $\mathbf{z}$ lie on a -sphere of radius $\sqrt{c}$. These can be parameterized as -$\mathbf{z} = \sqrt{c}\hat{\mathbf{z}}$ where $\hat{\mathbf{z}}$ has -$\|\hat{\mathbf{z}}\|_2 = 1$. Then since -$\mathbf{A}^{-\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, -we have - -$$\mathbf{x} = \mathbf{A}^{-\frac{1}{2}}\mathbf{z} = \mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\mathbf{Q}^{\!\top\!}\sqrt{c}\hat{\mathbf{z}} = \sqrt{c}\mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\tilde{\mathbf{z}}$$ - -where $\tilde{\mathbf{z}} = \mathbf{Q}^{\!\top\!}\hat{\mathbf{z}}$ also -satisfies $\|\tilde{\mathbf{z}}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Using this parameterization, we see that the solution set -$\{\mathbf{x} \in \mathbb{R}^n : f(\mathbf{x}) = c\}$ is the image of -the unit sphere -$\{\tilde{\mathbf{z}} \in \mathbb{R}^n : \|\tilde{\mathbf{z}}\|_2 = 1\}$ -under the invertible linear map -$\mathbf{x} = \sqrt{c}\mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\tilde{\mathbf{z}}$. - -What we have gained with all these manipulations is a clear algebraic -understanding of the $c$-isocontour of $f$ in terms of a sequence of -linear transformations applied to a well-understood set. We begin with -the unit sphere, then scale every axis $i$ by -$\lambda_i^{-\frac{1}{2}}$, resulting in an axis-aligned ellipsoid. -Observe that the axis lengths of the ellipsoid are proportional to the -inverse square roots of the eigenvalues of $\mathbf{A}$. Hence larger -eigenvalues correspond to shorter axis lengths, and vice-versa. - -Then this axis-aligned ellipsoid undergoes a rigid transformation (i.e. -one that preserves length and angles, such as a rotation/reflection) -given by $\mathbf{Q}$. The result of this transformation is that the -axes of the ellipse are no longer along the coordinate axes in general, -but rather along the directions given by the corresponding eigenvectors. -To see this, consider the unit vector $\mathbf{e}_i \in \mathbb{R}^n$ -that has $[\mathbf{e}_i]_j = \delta_{ij}$. In the pre-transformed space, -this vector points along the axis with length proportional to -$\lambda_i^{-\frac{1}{2}}$. But after applying the rigid transformation -$\mathbf{Q}$, the resulting vector points in the direction of the -corresponding eigenvector $\mathbf{q}_i$, since - -$$\mathbf{Q}\mathbf{e}_i = \sum_{j=1}^n [\mathbf{e}_i]_j\mathbf{q}_j = \mathbf{q}_i$$ - -where we have used the matrix-vector product identity from earlier. - -In summary: the isocontours of -$f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ are -ellipsoids such that the axes point in the directions of the -eigenvectors of $\mathbf{A}$, and the radii of these axes are -proportional to the inverse square roots of the corresponding -eigenvalues. - - diff --git a/drafts/chapter_decompositions/pseudoinverse.md b/drafts/chapter_decompositions/pseudoinverse.md deleted file mode 100644 index af58d9b..0000000 --- a/drafts/chapter_decompositions/pseudoinverse.md +++ /dev/null @@ -1,23 +0,0 @@ -## Moore-Penrose Pseudoinverse -The Moore-Penrose pseudoinverse is a generalization of the matrix inverse that can be applied to non-square or singular matrices. It is denoted as \( A^+ \) for a matrix \( A \). The pseudoinverse satisfies the following properties: -1. **Existence**: The pseudoinverse exists for any matrix \( A \). -2. **Uniqueness**: The pseudoinverse is unique. -3. **Properties**: - - \( A A^+ A = A \) - - \( A^+ A A^+ = A^+ \) - - \( (A A^+)^\top = A A^+ \) - - \( (A^+ A)^\top = A^+ A \) -4. **Rank**: The rank of \( A^+ \) is equal to the rank of \( A \). -5. **Singular Value Decomposition (SVD)**: The pseudoinverse can be computed using the singular value decomposition of \( A \). If \( A = U \Sigma V^\top \), where \( U \) and \( V \) are orthogonal matrices and \( \Sigma \) is a diagonal matrix with singular values, then: - \[ - A^+ = V \Sigma^+ U^\top - \] - where \( \Sigma^+ \) is obtained by taking the reciprocal of the non-zero singular values in \( \Sigma \) and transposing the resulting matrix. -6. **Applications**: The pseudoinverse is used in various applications, including solving linear systems, least squares problems, and in machine learning algorithms such as linear regression. -7. **Least Squares Solution**: The pseudoinverse provides a least squares solution to the equation \( Ax = b \) when \( A \) is not square or has no unique solution. The least squares solution is given by: - \[ - x = A^+ b - \] -8. **Geometric Interpretation**: The pseudoinverse can be interpreted geometrically as the projection of a vector onto the column space of \( A \). -9. **Computational Considerations**: The computation of the pseudoinverse can be done efficiently using numerical methods, such as the SVD, especially for large matrices. -10. **Limitations**: The pseudoinverse may not be suitable for all applications, especially when the matrix is ill-conditioned or has a high condition number. diff --git a/drafts/chapter_decompositions/spectral_theorem_self-adjoint.md b/drafts/chapter_decompositions/spectral_theorem_self-adjoint.md new file mode 100644 index 0000000..6818c98 --- /dev/null +++ b/drafts/chapter_decompositions/spectral_theorem_self-adjoint.md @@ -0,0 +1,273 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.7 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- +# 📜 Spectral Theorem for Compact Self-Adjoint Operators + +:::{prf:theorem} Compact, Self-Adjoint Linear Operator +:label: def-compact-selfadjoint-operator +:nonumber: + +Let $ \mathcal{H} $ be a Hilbert space. + +A linear operator $ T : \mathcal{H} \to \mathcal{H} $ is called a **compact, self-adjoint linear operator** if it satisfies the following properties: + +1. **Linearity**: + + $$ + T(\alpha f + \beta g) = \alpha T(f) + \beta T(g) + \quad \text{for all } f, g \in \mathcal{H}, \ \alpha, \beta \in \mathbb{R} \text{ (or } \mathbb{C} \text{)} + $$ + +2. **Self-Adjointness**: + + $$ + \langle T f, g \rangle = \langle f, T g \rangle + \quad \text{for all } f, g \in \mathcal{H} + $$ + +3. **Compactness**: + For every bounded sequence $ \{f_n\} \subset \mathcal{H} $, the sequence $ \{T f_n\} $ has a **convergent subsequence** in $ \mathcal{H} $. +::: + +Here is a clear and formal example — written as a MyST proof block — of a **compact, self-adjoint linear operator** on the Hilbert space $L^2([0, 1])$, using an **integral operator with a continuous symmetric kernel**. + +--- + + +:::{prf:theorem} Integral Operator on $ L^2([0, 1]) $ +:label: ex-integral-operator-compact-selfadjoint +:nonumber: + +Let $ \mathcal{H} = L^2([0, 1]) $ and let $ k : [0, 1] \times [0, 1] \to \mathbb{R} $ be a **continuous**, **symmetric** function, i.e., + +$$ +k(x, y) = k(y, x) \quad \text{for all } x, y \in [0, 1] +$$ + +Define the operator $ T : \mathcal{H} \to \mathcal{H} $ by: + +$$ +(Tf)(x) = \int_0^1 k(x, y) f(y) \, dy +$$ + +Then $ T $ is a **compact, self-adjoint linear operator**: + +- **Linearity**: follows directly from the linearity of the integral. +- **Self-adjointness**: for all $ f, g \in L^2([0, 1]) $, + +$$ +\langle T f, g \rangle = \int_0^1 \left( \int_0^1 k(x, y) f(y) \, dy \right) g(x) \, dx += \int_0^1 f(y) \left( \int_0^1 k(x, y) g(x) \, dx \right) dy += \langle f, T g \rangle +$$ + +by symmetry of $ k(x, y) $. + +- **Compactness**: Since $ k $ is continuous on a compact domain $ [0, 1]^2 $, the operator $ T $ is compact (by the Arzelà–Ascoli theorem or the Hilbert–Schmidt theorem). + +Thus, $ T $ satisfies all the conditions of a compact, self-adjoint linear operator. +::: + + +:::{prf:theorem} RBF Kernel Operator on $ L^2([0, 1]) $ +:label: ex-rbf-kernel-operator +:nonumber: + +Let $ \mathcal{H} = L^2([0, 1]) $, and let $ \gamma > 0 $. Define the kernel: + +$$ +k(x, y) = \exp(-\gamma (x - y)^2) +$$ + +This is the **Radial Basis Function (RBF) kernel**, which is: + +- **continuous** on $ [0, 1]^2 $, +- **symmetric**, i.e., $ k(x, y) = k(y, x) $, +- **positive definite**, meaning it induces a positive semi-definite kernel matrix for any finite sample. + +Then the integral operator + +$$ +(Tf)(x) = \int_0^1 \exp(-\gamma (x - y)^2) f(y) \, dy +$$ + +defines a **compact, self-adjoint linear operator** on $ L^2([0, 1]) $. + +::: + +:::{prf:theorem} Brownian Motion Kernel Operator on $ L^2([0, 1]) $ +:label: ex-min-kernel-operator +:nonumber: + +Let $ k(x, y) = \min(x, y) $, defined on $ [0, 1] \times [0, 1] $. This kernel is: + +- **continuous** and **symmetric**: $ \min(x, y) = \min(y, x) $ +- **positive semi-definite**: it corresponds to the covariance function of standard Brownian motion. + +The integral operator: + +$$ +(Tf)(x) = \int_0^1 \min(x, y) f(y) \, dy +$$ + +is known as the **Volterra operator** associated with Brownian motion. It is: + +- **linear** +- **self-adjoint** (via symmetry of $ \min(x, y) $) +- **compact**, since it is a Hilbert–Schmidt operator with square-integrable kernel. + +Thus, it is a **compact, self-adjoint linear operator** on $ L^2([0, 1]) $. + +::: + + + + +Let $\mathcal{H}$ be a real or complex **Hilbert space**, and let +$T : \mathcal{H} \to \mathcal{H}$ be a **compact, self-adjoint linear operator**. + +> Then: +> +> 1. There exists an **orthonormal basis** $\{\phi_i\}_{i \in \mathbb{N}}$ of $\overline{\operatorname{im}(T)} \subseteq \mathcal{H}$ consisting of **eigenvectors of $T$**. +> +> 2. The corresponding eigenvalues $\{\lambda_i\} \subset \mathbb{R}$ are real, with $\lambda_i \to 0$. +> +> 3. $T$ has at most countably many non-zero eigenvalues, and each non-zero eigenvalue has **finite multiplicity**. +> +> 4. For all $f \in \mathcal{H}$, we have: +> +> $$ +> T f = \sum_{i=1}^\infty \lambda_i \langle f, \phi_i \rangle \phi_i +> $$ +> +> where the sum converges in norm (i.e., in $\mathcal{H}$). + +--- + +## 🧠 Intuition + +* Compactness of $T$ is like “finite rank behavior” at infinity. +* Self-adjointness ensures that the eigenvalues are real, and eigenvectors for distinct eigenvalues are orthogonal. +* The spectrum of $T$ consists of **eigenvalues only**, accumulating at 0. +* We can **diagonalize** $T$ in an orthonormal eigenbasis — exactly like symmetric matrices. + +--- + +## ✍️ Sketch of the Proof + +We split the proof into a sequence of known results. + +--- + +### 1. **Existence of a Maximum Eigenvalue** + +Let $T$ be compact and self-adjoint. Define: + +$$ +\lambda_1 = \sup_{\|f\| = 1} \langle Tf, f \rangle +$$ + +This is the **Rayleigh quotient**, and it gives the largest eigenvalue in magnitude. The supremum is **attained** (due to compactness), and the maximizer $f_1$ satisfies: + +$$ +Tf_1 = \lambda_1 f_1 +$$ + +--- + +### 2. **Orthogonalization and Iteration (like Gram-Schmidt)** + +Define $\mathcal{H}_1 = \{f \in \mathcal{H} : \langle f, f_1 \rangle = 0\}$. Restrict $T$ to $\mathcal{H}_1$, where it remains compact and self-adjoint. Then find the next eigenpair $(\lambda_2, f_2)$, and repeat. + +This gives an **orthonormal sequence** of eigenfunctions $\{f_i\}$ with real eigenvalues $\lambda_i \to 0$, due to compactness. + +--- + +### 3. **Convergence of Spectral Expansion** + +For any $f \in \mathcal{H}$, let: + +$$ +f = \sum_{i=1}^\infty \langle f, \phi_i \rangle \phi_i + f_\perp +$$ + +where $f_\perp \in \ker(T)$. Then: + +$$ +Tf = \sum_{i=1}^\infty \lambda_i \langle f, \phi_i \rangle \phi_i +$$ + +The convergence is in $\mathcal{H}$-norm, using Parseval's identity and the fact that $\lambda_i \to 0$. + +--- + +### ✅ Summary Box + +**Spectral Theorem (Compact Self-Adjoint Operators)** + +Let $ T : \mathcal{H} \to \mathcal{H} $ be compact and self-adjoint. + +Then there exists an orthonormal basis $ \{\phi_i\} \subset \mathcal{H} $ consisting of eigenvectors of $ T $, with corresponding real eigenvalues $ \lambda_i \to 0 $, such that: + +$$ +T f = \sum_{i=1}^\infty \lambda_i \langle f, \phi_i \rangle \phi_i +\quad \text{for all } f \in \mathcal{H} +$$ + + +--- + +This result is the infinite-dimensional generalization of the fact that a real symmetric matrix has an orthonormal eigenbasis and can be diagonalized. + +```{code-cell} ipython3 +:tags: [hide-input] +import numpy as np +import matplotlib.pyplot as plt + +# Define domain +n = 200 +x = np.linspace(0, 1, n) +X, Y = np.meshgrid(x, x) + +# Define RBF kernel +gamma = 50 +rbf_kernel = np.exp(-gamma * (X - Y) ** 2) + +# Define min kernel +min_kernel = np.minimum(X, Y) + +# Plotting both kernels side-by-side +fig, axs = plt.subplots(1, 2, figsize=(14, 5)) + +# Plot RBF kernel +im0 = axs[0].imshow(rbf_kernel, extent=[0,1,0,1], origin='lower', cmap='viridis') +axs[0].set_title('RBF Kernel $k(x,y) = \exp(-\\gamma (x-y)^2)$') +axs[0].set_xlabel('x') +axs[0].set_ylabel('y') +fig.colorbar(im0, ax=axs[0]) + +# Plot min kernel +im1 = axs[1].imshow(min_kernel, extent=[0,1,0,1], origin='lower', cmap='viridis') +axs[1].set_title('Min Kernel $k(x,y) = \min(x, y)$') +axs[1].set_xlabel('x') +axs[1].set_ylabel('y') +fig.colorbar(im1, ax=axs[1]) + +plt.tight_layout() +plt.show() +``` +This visualization shows two symmetric, continuous kernels defined on $[0, 1]^2$, each inducing a compact, self-adjoint integral operator on $L^2([0, 1])$: + +* **Left panel**: The RBF kernel $k(x, y) = \exp(-\gamma(x - y)^2)$, concentrated along the diagonal where $x \approx y$, modeling local similarity. +* **Right panel**: The Brownian motion kernel $k(x, y) = \min(x, y)$, forming a triangular structure that accumulates information from the origin. + +Both kernels generate PSD Gram matrices and operators with eigenfunction decompositions — perfect for illustrating Mercer's theorem in practice. diff --git a/drafts/chapter_decompositions/symmetric_matrices.md b/drafts/chapter_decompositions/symmetric_matrices.md deleted file mode 100644 index a0b5276..0000000 --- a/drafts/chapter_decompositions/symmetric_matrices.md +++ /dev/null @@ -1,297 +0,0 @@ -## Symmetric matrices - -A matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ is said to be -**symmetric** if it is equal to its own transpose -($\mathbf{A} = \mathbf{A}^{\!\top\!}$), meaning that $A_{ij} = A_{ji}$ -for all $(i,j)$. This definition seems harmless enough but turns out to -have some strong implications. We summarize the most important of these -as - -*Theorem.* -(Spectral Theorem) If $\mathbf{A} \in \mathbb{R}^{n \times n}$ is -symmetric, then there exists an orthonormal basis for $\mathbb{R}^n$ -consisting of eigenvectors of $\mathbf{A}$. - -The practical application of this theorem is a particular factorization -of symmetric matrices, referred to as the **eigendecomposition** or -**spectral decomposition**. Denote the orthonormal basis of eigenvectors -$\mathbf{q}_1, \dots, \mathbf{q}_n$ and their eigenvalues -$\lambda_1, \dots, \lambda_n$. Let $\mathbf{Q}$ be an orthogonal matrix -with $\mathbf{q}_1, \dots, \mathbf{q}_n$ as its columns, and -$\mathbf{\Lambda} = \operatorname{diag}(\lambda_1, \dots, \lambda_n)$. Since by -definition $\mathbf{A}\mathbf{q}_i = \lambda_i\mathbf{q}_i$ for every -$i$, the following relationship holds: - -$$\mathbf{A}\mathbf{Q} = \mathbf{Q}\mathbf{\Lambda}$$ - -Right-multiplying -by $\mathbf{Q}^{\!\top\!}$, we arrive at the decomposition - -$$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$$ - -### Rayleigh quotients - -Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a symmetric matrix. The -expression $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ is called a -**quadratic form**. - -There turns out to be an interesting connection between the quadratic -form of a symmetric matrix and its eigenvalues. This connection is -provided by the **Rayleigh quotient** - -$$R_\mathbf{A}(\mathbf{x}) = \frac{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}{\mathbf{x}^{\!\top\!}\mathbf{x}}$$ - -The Rayleigh quotient has a couple of important properties which the -reader can (and should!) easily verify from the definition: - -(i) **Scale invariance**: for any vector $\mathbf{x} \neq \mathbf{0}$ - and any scalar $\alpha \neq 0$, - $R_\mathbf{A}(\mathbf{x}) = R_\mathbf{A}(\alpha\mathbf{x})$. - -(ii) If $\mathbf{x}$ is an eigenvector of $\mathbf{A}$ with eigenvalue - $\lambda$, then $R_\mathbf{A}(\mathbf{x}) = \lambda$. - -We can further show that the Rayleigh quotient is bounded by the largest -and smallest eigenvalues of $\mathbf{A}$. But first we will show a -useful special case of the final result. - -*Proposition.* -For any $\mathbf{x}$ such that $\|\mathbf{x}\|_2 = 1$, - -$$\lambda_{\min}(\mathbf{A}) \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$$ - -with equality if and only if $\mathbf{x}$ is a corresponding -eigenvector. - -*Proof.* We show only the $\max$ case because the argument for the -$\min$ case is entirely analogous. - -Since $\mathbf{A}$ is symmetric, we can decompose it as -$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$. Then use -the change of variable $\mathbf{y} = \mathbf{Q}^{\!\top\!}\mathbf{x}$, -noting that the relationship between $\mathbf{x}$ and $\mathbf{y}$ is -one-to-one and that $\|\mathbf{y}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Hence - -$$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \max_{\|\mathbf{y}\|_2 = 1} \mathbf{y}^{\!\top\!}\mathbf{\Lambda}\mathbf{y} = \max_{y_1^2+\dots+y_n^2=1} \sum_{i=1}^n \lambda_i y_i^2$$ - -Written this way, it is clear that $\mathbf{y}$ maximizes this -expression exactly if and only if it satisfies -$\sum_{i \in I} y_i^2 = 1$ where -$I = \{i : \lambda_i = \max_{j=1,\dots,n} \lambda_j = \lambda_{\max}(\mathbf{A})\}$ -and $y_j = 0$ for $j \not\in I$. That is, $I$ contains the index or -indices of the largest eigenvalue. In this case, the maximal value of -the expression is - -$$\sum_{i=1}^n \lambda_i y_i^2 = \sum_{i \in I} \lambda_i y_i^2 = \lambda_{\max}(\mathbf{A}) \sum_{i \in I} y_i^2 = \lambda_{\max}(\mathbf{A})$$ - -Then writing $\mathbf{q}_1, \dots, \mathbf{q}_n$ for the columns of -$\mathbf{Q}$, we have - -$$\mathbf{x} = \mathbf{Q}\mathbf{Q}^{\!\top\!}\mathbf{x} = \mathbf{Q}\mathbf{y} = \sum_{i=1}^n y_i\mathbf{q}_i = \sum_{i \in I} y_i\mathbf{q}_i$$ - -where we have used the matrix-vector product identity. - -Recall that $\mathbf{q}_1, \dots, \mathbf{q}_n$ are eigenvectors of -$\mathbf{A}$ and form an orthonormal basis for $\mathbb{R}^n$. Therefore -by construction, the set $\{\mathbf{q}_i : i \in I\}$ forms an -orthonormal basis for the eigenspace of $\lambda_{\max}(\mathbf{A})$. -Hence $\mathbf{x}$, which is a linear combination of these, lies in that -eigenspace and thus is an eigenvector of $\mathbf{A}$ corresponding to -$\lambda_{\max}(\mathbf{A})$. - -We have shown that -$\max_{\|\mathbf{x}\|_2 = 1} \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \lambda_{\max}(\mathbf{A})$, -from which we have the general inequality -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \leq \lambda_{\max}(\mathbf{A})$ -for all unit-length $\mathbf{x}$. ◻ - - -By the scale invariance of the Rayleigh quotient, we immediately have as -a corollary (since -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = R_{\mathbf{A}}(\mathbf{x})$ -for unit $\mathbf{x}$) - -*Theorem.* -(Min-max theorem) For all $\mathbf{x} \neq \mathbf{0}$, - -$$\lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x}) \leq \lambda_{\max}(\mathbf{A})$$ - -with equality if and only if $\mathbf{x}$ is a corresponding -eigenvector. - - -## Positive (semi-)definite matrices - -A symmetric matrix $\mathbf{A}$ is **positive semi-definite** if for all -$\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} \geq 0$. Sometimes people -write $\mathbf{A} \succeq 0$ to indicate that $\mathbf{A}$ is positive -semi-definite. - -A symmetric matrix $\mathbf{A}$ is **positive definite** if for all -nonzero $\mathbf{x} \in \mathbb{R}^n$, -$\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} > 0$. Sometimes people write -$\mathbf{A} \succ 0$ to indicate that $\mathbf{A}$ is positive definite. -Note that positive definiteness is a strictly stronger property than -positive semi-definiteness, in the sense that every positive definite -matrix is positive semi-definite but not vice-versa. - -These properties are related to eigenvalues in the following way. - -*Proposition.* -A symmetric matrix is positive semi-definite if and only if all of its -eigenvalues are nonnegative, and positive definite if and only if all of -its eigenvalues are positive. - - -*Proof.* Suppose $A$ is positive semi-definite, and let $\mathbf{x}$ be -an eigenvector of $\mathbf{A}$ with eigenvalue $\lambda$. Then - -$$0 \leq \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}(\lambda\mathbf{x}) = \lambda\mathbf{x}^{\!\top\!}\mathbf{x} = \lambda\|\mathbf{x}\|_2^2$$ - -Since $\mathbf{x} \neq \mathbf{0}$ (by the assumption that it is an -eigenvector), we have $\|\mathbf{x}\|_2^2 > 0$, so we can divide both -sides by $\|\mathbf{x}\|_2^2$ to arrive at $\lambda \geq 0$. If -$\mathbf{A}$ is positive definite, the inequality above holds strictly, -so $\lambda > 0$. This proves one direction. - -To simplify the proof of the other direction, we will use the machinery -of Rayleigh quotients. Suppose that $\mathbf{A}$ is symmetric and all -its eigenvalues are nonnegative. Then for all -$\mathbf{x} \neq \mathbf{0}$, - -$$0 \leq \lambda_{\min}(\mathbf{A}) \leq R_\mathbf{A}(\mathbf{x})$$ - -Since $\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ matches -$R_\mathbf{A}(\mathbf{x})$ in sign, we conclude that $\mathbf{A}$ is -positive semi-definite. If the eigenvalues of $\mathbf{A}$ are all -strictly positive, then $0 < \lambda_{\min}(\mathbf{A})$, whence it -follows that $\mathbf{A}$ is positive definite. ◻ - - -As an example of how these matrices arise, consider - -*Proposition.* -Suppose $\mathbf{A} \in \mathbb{R}^{m \times n}$. Then -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. If -$\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, then -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. - - -*Proof.* For any $\mathbf{x} \in \mathbb{R}^n$, - -$$\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = (\mathbf{A}\mathbf{x})^{\!\top\!}(\mathbf{A}\mathbf{x}) = \|\mathbf{A}\mathbf{x}\|_2^2 \geq 0$$ - -so $\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive semi-definite. - -Note that $\|\mathbf{A}\mathbf{x}\|_2^2 = 0$ implies -$\|\mathbf{A}\mathbf{x}\|_2 = 0$, which in turn implies -$\mathbf{A}\mathbf{x} = \mathbf{0}$ (recall that this is a property of -norms). If $\operatorname{null}(\mathbf{A}) = \{\mathbf{0}\}$, -$\mathbf{A}\mathbf{x} = \mathbf{0}$ implies $\mathbf{x} = \mathbf{0}$, -so -$\mathbf{x}^{\!\top\!} (\mathbf{A}^{\!\top\!}\mathbf{A})\mathbf{x} = 0$ -if and only if $\mathbf{x} = \mathbf{0}$, and thus -$\mathbf{A}^{\!\top\!}\mathbf{A}$ is positive definite. ◻ - -Positive definite matrices are invertible (since their eigenvalues are -nonzero), whereas positive semi-definite matrices might not be. However, -if you already have a positive semi-definite matrix, it is possible to -perturb its diagonal slightly to produce a positive definite matrix. - -*Proposition.* -If $\mathbf{A}$ is positive semi-definite and $\epsilon > 0$, then -$\mathbf{A} + \epsilon\mathbf{I}$ is positive definite. - -*Proof.* Assuming $\mathbf{A}$ is positive semi-definite and -$\epsilon > 0$, we have for any $\mathbf{x} \neq \mathbf{0}$ that - -$$\mathbf{x}^{\!\top\!}(\mathbf{A}+\epsilon\mathbf{I})\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} + \epsilon\mathbf{x}^{\!\top\!}\mathbf{I}\mathbf{x} = \underbrace{\mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}}_{\geq 0} + \underbrace{\epsilon\|\mathbf{x}\|_2^2}_{> 0} > 0$$ - -as claimed. ◻ - -An obvious but frequently useful consequence of the two propositions we -have just shown is that -$\mathbf{A}^{\!\top\!}\mathbf{A} + \epsilon\mathbf{I}$ is positive -definite (and in particular, invertible) for *any* matrix $\mathbf{A}$ -and any $\epsilon > 0$. - -### The geometry of positive definite quadratic forms - -A useful way to understand quadratic forms is by the geometry of their -level sets. A **level set** or **isocontour** of a function is the set -of all inputs such that the function applied to those inputs yields a -given output. Mathematically, the $c$-isocontour of $f$ is -$\{\mathbf{x} \in \operatorname{dom} f : f(\mathbf{x}) = c\}$. - -Let us consider the special case -$f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ where -$\mathbf{A}$ is a positive definite matrix. Since $\mathbf{A}$ is -positive definite, it has a unique matrix square root -$\mathbf{A}^{\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, -where $\mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^{\!\top\!}$ is the -eigendecomposition of $\mathbf{A}$ and -$\mathbf{\Lambda}^{\frac{1}{2}} = \operatorname{diag}(\sqrt{\lambda_1}, \dots \sqrt{\lambda_n})$. -It is easy to see that this matrix $\mathbf{A}^{\frac{1}{2}}$ is -positive definite (consider its eigenvalues) and satisfies -$\mathbf{A}^{\frac{1}{2}}\mathbf{A}^{\frac{1}{2}} = \mathbf{A}$. Fixing -a value $c \geq 0$, the $c$-isocontour of $f$ is the set of -$\mathbf{x} \in \mathbb{R}^n$ such that - -$$c = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x} = \mathbf{x}^{\!\top\!}\mathbf{A}^{\frac{1}{2}}\mathbf{A}^{\frac{1}{2}}\mathbf{x} = \|\mathbf{A}^{\frac{1}{2}}\mathbf{x}\|_2^2$$ - -where we have used the symmetry of $\mathbf{A}^{\frac{1}{2}}$. Making -the change of variable -$\mathbf{z} = \mathbf{A}^{\frac{1}{2}}\mathbf{x}$, we have the condition -$\|\mathbf{z}\|_2 = \sqrt{c}$. That is, the values $\mathbf{z}$ lie on a -sphere of radius $\sqrt{c}$. These can be parameterized as -$\mathbf{z} = \sqrt{c}\hat{\mathbf{z}}$ where $\hat{\mathbf{z}}$ has -$\|\hat{\mathbf{z}}\|_2 = 1$. Then since -$\mathbf{A}^{-\frac{1}{2}} = \mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\mathbf{Q}^{\!\top\!}$, -we have - -$$\mathbf{x} = \mathbf{A}^{-\frac{1}{2}}\mathbf{z} = \mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\mathbf{Q}^{\!\top\!}\sqrt{c}\hat{\mathbf{z}} = \sqrt{c}\mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\tilde{\mathbf{z}}$$ - -where $\tilde{\mathbf{z}} = \mathbf{Q}^{\!\top\!}\hat{\mathbf{z}}$ also -satisfies $\|\tilde{\mathbf{z}}\|_2 = 1$ since $\mathbf{Q}$ is -orthogonal. Using this parameterization, we see that the solution set -$\{\mathbf{x} \in \mathbb{R}^n : f(\mathbf{x}) = c\}$ is the image of -the unit sphere -$\{\tilde{\mathbf{z}} \in \mathbb{R}^n : \|\tilde{\mathbf{z}}\|_2 = 1\}$ -under the invertible linear map -$\mathbf{x} = \sqrt{c}\mathbf{Q}\mathbf{\Lambda}^{-\frac{1}{2}}\tilde{\mathbf{z}}$. - -What we have gained with all these manipulations is a clear algebraic -understanding of the $c$-isocontour of $f$ in terms of a sequence of -linear transformations applied to a well-understood set. We begin with -the unit sphere, then scale every axis $i$ by -$\lambda_i^{-\frac{1}{2}}$, resulting in an axis-aligned ellipsoid. -Observe that the axis lengths of the ellipsoid are proportional to the -inverse square roots of the eigenvalues of $\mathbf{A}$. Hence larger -eigenvalues correspond to shorter axis lengths, and vice-versa. - -Then this axis-aligned ellipsoid undergoes a rigid transformation (i.e. -one that preserves length and angles, such as a rotation/reflection) -given by $\mathbf{Q}$. The result of this transformation is that the -axes of the ellipse are no longer along the coordinate axes in general, -but rather along the directions given by the corresponding eigenvectors. -To see this, consider the unit vector $\mathbf{e}_i \in \mathbb{R}^n$ -that has $[\mathbf{e}_i]_j = \delta_{ij}$. In the pre-transformed space, -this vector points along the axis with length proportional to -$\lambda_i^{-\frac{1}{2}}$. But after applying the rigid transformation -$\mathbf{Q}$, the resulting vector points in the direction of the -corresponding eigenvector $\mathbf{q}_i$, since - -$$\mathbf{Q}\mathbf{e}_i = \sum_{j=1}^n [\mathbf{e}_i]_j\mathbf{q}_j = \mathbf{q}_i$$ - -where we have used the matrix-vector product identity from earlier. - -In summary: the isocontours of -$f(\mathbf{x}) = \mathbf{x}^{\!\top\!}\mathbf{A}\mathbf{x}$ are -ellipsoids such that the axes point in the directions of the -eigenvectors of $\mathbf{A}$, and the radii of these axes are -proportional to the inverse square roots of the corresponding -eigenvalues. - - diff --git a/drafts/chapter_decompositions/trace_determinant.md b/drafts/chapter_decompositions/trace_determinant.md deleted file mode 100644 index 882c968..0000000 --- a/drafts/chapter_decompositions/trace_determinant.md +++ /dev/null @@ -1,51 +0,0 @@ -## Trace - -The **trace** of a square matrix is the sum of its diagonal entries: - -$$\operatorname{tr}(\mathbf{A}) = \sum_{i=1}^n A_{ii}$$ - -The trace has several nice -algebraic properties: - -(i) $\operatorname{tr}(\mathbf{A}+\mathbf{B}) = \operatorname{tr}(\mathbf{A}) + \operatorname{tr}(\mathbf{B})$ - -(ii) $\operatorname{tr}(\alpha\mathbf{A}) = \alpha\operatorname{tr}(\mathbf{A})$ - -(iii) $\operatorname{tr}(\mathbf{A}^{\!\top\!}) = \operatorname{tr}(\mathbf{A})$ - -(iv) $\operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D}) = \operatorname{tr}(\mathbf{B}\mathbf{C}\mathbf{D}\mathbf{A}) = \operatorname{tr}(\mathbf{C}\mathbf{D}\mathbf{A}\mathbf{B}) = \operatorname{tr}(\mathbf{D}\mathbf{A}\mathbf{B}\mathbf{C})$ - -The first three properties follow readily from the definition. The last -is known as **invariance under cyclic permutations**. Note that the -matrices cannot be reordered arbitrarily, for example -$\operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D}) \neq \operatorname{tr}(\mathbf{B}\mathbf{A}\mathbf{C}\mathbf{D})$ -in general. Also, there is nothing special about the product of four -matrices -- analogous rules hold for more or fewer matrices. - -Interestingly, the trace of a matrix is equal to the sum of its -eigenvalues (repeated according to multiplicity): - -$$\operatorname{tr}(\mathbf{A}) = \sum_i \lambda_i(\mathbf{A})$$ - -## Determinant - -The **determinant** of a square matrix can be defined in several -different confusing ways, none of which are particularly important for -our purposes; go look at an introductory linear algebra text (or -Wikipedia) if you need a definition. But it's good to know the -properties: - -(i) $\det(\mathbf{I}) = 1$ - -(ii) $\det(\mathbf{A}^{\!\top\!}) = \det(\mathbf{A})$ - -(iii) $\det(\mathbf{A}\mathbf{B}) = \det(\mathbf{A})\det(\mathbf{B})$ - -(iv) $\det(\mathbf{A}^{-1}) = \det(\mathbf{A})^{-1}$ - -(v) $\det(\alpha\mathbf{A}) = \alpha^n \det(\mathbf{A})$ - -Interestingly, the determinant of a matrix is equal to the product of -its eigenvalues (repeated according to multiplicity): - -$$\det(\mathbf{A}) = \prod_i \lambda_i(\mathbf{A})$$ \ No newline at end of file diff --git a/drafts/example_genetics/pca_genetics.md b/drafts/example_genetics/pca_genetics.md index 50763b2..4fadf6e 100644 --- a/drafts/example_genetics/pca_genetics.md +++ b/drafts/example_genetics/pca_genetics.md @@ -37,7 +37,6 @@ An alternative definition of PCA is based on minimizing the sum-of-sqares of the For the PCA algorithm we implement `empirical_covariance` method that would be usef do calculating the covariance of the data. We also impmlement `PCA` class with `fit`, `transform` and `reverse_transform` methods. ```{code-cell} ipython3 - def empirical_covariance(X): """ Calculates the empirical covariance matrix for a given dataset. @@ -56,11 +55,8 @@ def empirical_covariance(X): ``` -+++ {"slideshow": {"slide_type": "subslide"}} - ```{code-cell} ipython3 - class PCA: def __init__(self, k=None): """ @@ -83,9 +79,8 @@ class PCA: """ self.mean, covariance = empirical_covariance(X=X) eig_values, eig_vectors = np.linalg.eigh(covariance) # Compute eigenvalues and eigenvectors - order = np.argsort(eig_values)[::-1] # Get indices of eigenvalues in descending order - self.pc_variances = eig_values[order] # Sort the eigenvalues - self.principal_components = eig_vectors[:, order] # Sort the eigenvectors + self.pc_variances = eig_values[::-1] # the eigenvalues are returned by eigh in ascending order. We want them in descending order (largest first) + self.principal_components = eig_vectors[:, ::-1] # the eigenvectors in same order as eingevalues if self.k is not None: self.pc_variances = self.pc_variances[:self.k] self.principal_components = self.principal_components[:,:self.k] @@ -125,8 +120,6 @@ class PCA: return self.pc_variances ``` -+++ {"slideshow": {"slide_type": "slide"}} - In the example below, we will use the PCA algorithm to reduce the dimensionality of a genetic dataset from the 1000 genomes project [1,2]. [1] Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015) @@ -143,11 +136,8 @@ We consider five ancestries in the dataset: - **SAS** - South Asian - **AMR** - Native American - -+++ {"slideshow": {"slide_type": "subslide"}} - - ```{code-cell} ipython3 +:tags: [hide-input] snpreader = Bed('./genetic_data/example2.bed', count_A1=True) data = snpreader.read() print(data.shape) @@ -157,14 +147,6 @@ list1 = data.iid[:,1].tolist() #list with the Sample numbers present in genetic labels = labels[labels.index.isin(list1)] #filter labels DataFrame so it only contains the sampleIDs present in genetic data y = labels.SuperPopulation # EUR, AFR, AMR, EAS, SAS X = data.val[:, ~np.isnan(data.val).any(axis=0)] #load genetic data to X, removing NaN values -``` - - -+++ {"slideshow": {"slide_type": "subslide"}} - - -```{code-cell} ipython3 - pca = PCA() pca.fit(X=X) @@ -178,13 +160,6 @@ for rank in range(5): #more correct: X_pc.shape[1]+1 X_lowrank = pca_lowrank.transform(X) X_reconstruction = pca_lowrank.reverse_transform(X_lowrank) print("L1 reconstruction error for rank %i PCA : %.4E " % (rank, np.absolute(X - X_reconstruction).sum())) -``` - - -+++ {"slideshow": {"slide_type": "subslide"}} - - -```{code-cell} ipython3 fig = plt.figure() plt.plot(X_pc[y=="EUR"][:,0], X_pc[y=="EUR"][:,1],'.', alpha = 0.3) @@ -227,5 +202,4 @@ plt.plot(pca.variance_explained().cumsum() / pca.variance_explained().sum()) plt.xlabel("PC dimension") plt.ylabel("cumulative fraction of variance explained") plt.show() -``` - +``` \ No newline at end of file