The final math of Module 1. Eigendecomposition and SVD are how we find the hidden structure inside a matrix β which directions it stretches, which it collapses, which combinations of columns carry the most signal. They underpin PCA, recommender systems, embeddings, spectral clustering, and image compression.
Lesson 1: a matrix is a table of numbers. Lesson 2: a matrix is a transformation β rotate, stretch, shear. Lesson 3: gradients tell us how to tune parameters. This lesson asks a single sharp question: given a matrix, what are its natural axes? The directions it stretches cleanly, without rotating. That question has an exact answer (eigenvectors/eigenvalues), a more general answer that works for any matrix (SVD), and a direct application (PCA).
Most vectors change direction when you apply a matrix to them. You rotate them, skew them, and scale them at the same time β the output arrow points somewhere new. That's fine, but it makes matrices hard to reason about: "what does $A$ do?" has no short answer if every input vector has a different story.
But for almost every square matrix, there are a handful of special input vectors where $A$ does not rotate β it only stretches them along their own line. Same direction, scaled. Those directions are the eigenvectors. The stretch factor along each one is its eigenvalue, written $\lambda$ (lambda).
In words: "$A$ applied to $v$ equals $v$ scaled by $\lambda$." This is a restrictive condition β for almost every vector $v$ you pick, $Av$ will not be a scalar multiple of $v$. When you do find such a $v$, it identifies one of the directions $A$ treats as a pure axis.
Geometric: An eigenvector is a direction that $A$ doesn't rotate β it only stretches (or flips, or collapses). The eigenvalue is the stretch factor.
Algebraic: An eigenvector is a vector $v$ that satisfies $Av = \lambda v$ for some scalar $\lambda$.
Before any textbook machinery, find eigenvectors by hand. Take the matrix
Try $v = [1, 0]$. Compute $Av$:
The output is a scalar multiple of the input β specifically, $3$ times it. So $[1, 0]$ is an eigenvector with eigenvalue $3$. $[1, 0]$ happens to line up with how this particular $A$ acts.
Try $v = [1, 1]$:
Is $[4, 2]$ a scalar multiple of $[1, 1]$? No β $[4, 2]$ points in a different direction. So $[1, 1]$ is not an eigenvector of this $A$. Eigenvectors are useful because they are rare β most vectors are not eigenvectors.
Try $v = [1, -1]$:
$[1, -1]$ is an eigenvector with eigenvalue $2$. This 2Γ2 matrix has two eigenvectors β $[1, 0]$ with eigenvalue $3$, and $[1, -1]$ with eigenvalue $2$. Those two directions are the natural axes of $A$. Every other vector gets handled as a combination of these.
Guessing works for toy matrices but doesn't scale. The standard way to find eigenvalues is to solve the characteristic polynomial β the equation $\det(A - \lambda I) = 0$. You do not need to memorize this derivation; you need to know it exists, what it produces, and that NumPy handles it for you. For the same $A$:
The determinant of a 2Γ2 matrix $\begin{bmatrix} a & b \\ c & d \end{bmatrix}$ is $ad - bc$. So:
Set it to zero: $(3 - \lambda)(2 - \lambda) = 0$, giving $\lambda = 3$ or $\lambda = 2$. These are the eigenvalues we found by inspection. This is what happens under the hood when a library returns eigenvalues β it's solving a polynomial (iteratively, for bigger matrices).
Given each eigenvalue, find its eigenvector by solving $(A - \lambda I) v = 0$ for a nonzero $v$. For $\lambda = 3$: $(A - 3I) v = 0$ becomes $\begin{bmatrix} 0 & 1 \\ 0 & -1 \end{bmatrix} v = 0$, which forces $v_2 = 0$, leaving any vector like $[1, 0]$. Same answer, done mechanically.
Computing eigenvalues by solving characteristic polynomials is numerically unstable for anything larger than a toy example, and slow. Real libraries use iterative algorithms (QR iteration, Lanczos, etc.) that are both faster and more accurate. Your job is to know what eigenvalues are, read the output of np.linalg.eig, and interpret it.
import numpy as np
A = np.array([[3, 1],
[0, 2]])
eigvals, eigvecs = np.linalg.eig(A)
eigvals
# array([3., 2.])
eigvecs # each COLUMN is an eigenvector (normalized to length 1)
# array([[ 1. , 0.70710678],
# [ 0. , -0.70710678]])
# Verify: A @ v should equal lambda * v for each pair
v0 = eigvecs[:, 0]; l0 = eigvals[0] # (1, 0), lambda = 3
A @ v0, l0 * v0 # both equal array([3., 0.])
v1 = eigvecs[:, 1]; l1 = eigvals[1] # (1/β2, -1/β2), lambda = 2
A @ v1, l1 * v1 # both equal array([1.414..., -1.414...])
Two easy-to-miss details about np.linalg.eig's output. First: eigenvectors come back as columns of the eigvecs array β that is, eigvecs[:, i] is the eigenvector for eigvals[i]. Not rows. Second: they are normalized to length 1, so the output shows $[0.707..., -0.707...]$ instead of $[1, -1]$ β same direction, unit length. See the numpy.linalg.eig docs for the full contract.
You can get very far in ML without ever hand-computing an eigenvalue. Eigenvalues and eigenvectors are the mathematical engine underneath three techniques you will use by name, repeatedly:
The list of eigenvalues of a matrix (its spectrum) tells you how much signal lives along each natural axis. Big eigenvalue = strong signal. Tiny eigenvalue = probably noise. "Drop the small ones to compress and denoise" comes from this. Section 6 makes it concrete.
[90, 8, 1, 0.5, 0.3, 0.1, ...] and (B) [12, 11, 10, 10, 9, 9, ...]. Which one is a better candidate for a 2-component PCA reduction?Queries like "give me the top 10 revenue-driving products" drop the long tail. The eigenvalue move generalizes this: rank the axes of your data by how much variance they carry, keep the top ones, discard the rest. Instead of rows in a ranked list, the "things" being ranked are directions in feature space. PCA is a SELECT TOP-k on the eigenvalues of your covariance matrix.
Eigendecomposition has two limits that motivate SVD:
The data matrix $X$ in ML is almost always $n \times d$ β $n$ samples, $d$ features, and $n \neq d$. Eigendecomposition of $X$ itself does not type-check. SVD is the generalization that works for rectangles.
SVD factors any matrix $A$, of any shape $m \times n$, into three matrices:
The contract:
Every linear transformation is: rotate, stretch along axes, rotate again. An input vector $v$: $V^T$ spins it into a coordinate system where $A$ becomes pure axis-aligned stretching, $\Sigma$ does the stretching, and $U$ spins the result into the output coordinate system. Those three steps cover every matrix.
A generic matrix is an unstructured block of numbers. Its SVD is three interpretable pieces: two rotations and a scaling. The singular vectors (columns of $U$ and $V$) identify the input and output axes that $A$ treats cleanly. The singular values quantify the stretching along each. Every matrix has an SVD β no shape or diagonalizability caveats.
Try the rotation-stretch-rotation picture visually. The matrix visualizer animates how any 2Γ2 matrix transforms space; set your own matrix and watch the unit circle become an ellipse whose axes are the singular directions.
Decompose
This is $3 \times 2$, so $U$ will be $3 \times 3$, $\Sigma$ will be $3 \times 2$, and $V$ will be $2 \times 2$. Column 1 is $[3, 4, 0]$ with length $\sqrt{9 + 16} = 5$, and column 2 is $[0, 0, 5]$ with length $5$. The two columns are perpendicular. So $A$ maps the two input axes to two orthogonal output directions, each stretched by $5$. Two singular values, both equal to $5$.
import numpy as np
A = np.array([[3, 0],
[4, 0],
[0, 5]])
U, s, Vt = np.linalg.svd(A, full_matrices=False)
s # array([5., 5.]) β two singular values, both 5
U.shape # (3, 2)
Vt.shape# (2, 2)
# Reconstruct A from the factors
A_reconstructed = U @ np.diag(s) @ Vt
np.allclose(A, A_reconstructed) # True
Two details in this code. First: full_matrices=False returns the "economy" SVD, skipping the padding columns of $U$ that would multiply against zero rows of $\Sigma$. In ML the economy version is almost always what you want β cheaper, same information. Second: s comes back as a 1D array of just the diagonal values, not a full $\Sigma$ matrix; rebuild the matrix with np.diag(s) when needed. See the numpy.linalg.svd docs.
A = np.random.randn(5, 3)
U, s, Vt = np.linalg.svd(A, full_matrices=False)
U.shape, s.shape, Vt.shape # ((5, 3), (3,), (3, 3))
# s is ALWAYS in non-increasing order
np.all(s[:-1] >= s[1:]) # True
# Columns of U are left singular vectors β orthonormal axes in output space
U.T @ U # β identity (U has orthonormal columns)
# Rows of Vt are right singular vectors β orthonormal axes in input space
Vt @ Vt.T # β identity
Singular values always come back sorted descending β libraries guarantee this. It is what makes "keep the top $k$" a one-liner.
SVD has a direct application: low-rank approximation. Given a big matrix $A$, find the closest possible matrix of a chosen low rank $k$. SVD gives it directly.
Keep only the top-$k$ singular values and their corresponding columns of $U$ and rows of $V^T$:
This $A_k$ has rank at most $k$, and the EckartβYoung theorem proves it is the closest rank-$k$ matrix to $A$ in Frobenius norm. No other rank-$k$ matrix is a better approximation. Applications:
When a paper says "we took the top-$k$ components" or "the rank-$k$ approximation" or "we projected onto the principal subspace" β they did SVD or eigendecomposition and kept the $k$ biggest. Interpretable components (topics, genres, user tastes) show up when real data has genuine low-rank structure: a small number of latent factors explain most of what's going on. Many classical ML techniques are variations on this theme.
PCA in one line: PCA is SVD applied to your centered data matrix. Everything else β the "principal components," the "explained variance," the scikit-learn object β is machinery around that one idea.
Given a data matrix $X$ of shape $(n, d)$ β $n$ samples, $d$ features β PCA finds $d$ new axes, called principal components, with three properties:
PCA is a projection that collapses many correlated columns into a handful of orthogonal ones that, together, preserve most of the variance. In dbt terms: "collapse 100 correlated columns into 5 orthogonal summary columns that explain 90% of the variance" β except the 5 columns are chosen optimally by the math, not by a human guessing.
Before PCA, subtract the mean of each feature from every row. This matters because PCA looks for directions of maximum variance, and variance is only well-defined around the mean. Uncentered features leave the first "principal component" pointing roughly at wherever the data cloud sits in space β useless. Centering forces PCA to find directions of spread.
Also standardize each feature to unit variance (divide by its standard deviation) when features have wildly different scales (dollars vs. kilometers vs. counts). Without this, the feature with the biggest raw numbers hijacks the first principal component regardless of its actual informational value. This is one of the most common PCA mistakes.
salary (values 30,000-200,000) and age (20-65). Before PCA, what preprocessing do you need?Textbooks often describe PCA this second way: the principal components are the eigenvectors of the covariance matrix $C = \frac{1}{n-1} X_c^T X_c$ (where $X_c$ is the centered data), and the variances along each component are the eigenvalues. This framing and the SVD framing produce the same answer β SVD of $X_c$ gives the same directions, and the squared singular values equal the eigenvalues of $C$ (up to the $n-1$ scaling). SVD is preferred in practice because it is numerically more stable than forming $C$ explicitly.
Covariance comes properly in Module 2's statistics lessons. For now: covariance measures how features vary together, and PCA finds the directions where that joint variation is largest.
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
iris = load_iris()
X, y = iris.data, iris.target # X.shape == (150, 4)
X_scaled = StandardScaler().fit_transform(X) # center + unit variance each column
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled) # shape (150, 2) β two principal components
pca.explained_variance_ratio_
# array([0.7296, 0.2285])
pca.explained_variance_ratio_.sum()
# 0.9581 β first 2 PCs capture ~96% of the variance from 4 features
Four correlated flower measurements compressed to two numbers per flower while keeping 96% of the information β and the two numbers are orthogonal, so they are not redundant. See the sklearn PCA docs for the full API.
explained_variance_ratio_ = [0.72, 0.19, 0.06, 0.02, 0.01]. Is reducing to 2 components a reasonable choice?explained_variance_ratio_ answers "did PCA work?" It is a vector of fractions that sum to at most 1; each entry is the fraction of variance captured by that component. If the top 5 components capture 95%, the compression is useful. If the first 20 components together only capture 30%, the data does not have the low-rank structure PCA needs β reach for something else (kernel PCA, UMAP, t-SNE). Always inspect this attribute before trusting a PCA reduction.
The plot below shows X_2d for the iris dataset, colored by species. Drag the slider to see how PCA projections with different numbers of components capture cumulative variance.
| Concept | Lives on | What it tells you |
|---|---|---|
| Eigenvector $v$ | Square matrix $A$ | Direction that $A$ only stretches β no rotation. |
| Eigenvalue $\lambda$ | Paired with an eigenvector | How much stretch along that direction. Negative = flipped. Zero = collapsed. |
| Covariance matrix | Dataset $X$ | Symmetric $d \times d$ matrix; its eigenvectors are the principal components. |
| Singular value $\sigma_i$ | Any matrix (SVD) | "Generalized eigenvalue." Strength of the $i$-th axis. Always $\geq 0$, sorted descending. |
| Singular vector | Any matrix (SVD) | Columns of $U$ (left) and $V$ (right) β orthonormal axes in output/input space. |
| Rank-$k$ approximation | Any matrix | Best summary of $A$ using only $k$ axes. EckartβYoung guarantees optimality. |
| Principal Component | Centered dataset | Right singular vector of $X_c$; direction of highest remaining variance. |
| PCA | Centered dataset | Project data onto top-$k$ principal components. Uses SVD under the hood. |
Because $Av = \lambda v$ requires $Av$ and $\lambda v$ to live in the same space β same dimensionality. That only happens when $A$ maps $\mathbb{R}^n \to \mathbb{R}^n$, i.e., when $A$ is $n \times n$. If $A$ were $m \times n$, then $Av$ has length $m$ and $v$ has length $n$ β setting them equal is not a well-formed equation. SVD sidesteps this by allowing the input and output axes to live in different spaces.
X of shape (300, 50). Which function decomposes it?SVD never compares $Av$ to $v$ directly. It finds the best axes in the input space ($V$) and the best axes in the output space ($U$) so that $A$, in those new coordinates, is pure diagonal stretching ($\Sigma$). There is no requirement that input and output have the same dimension.
explained_variance_ratio_ for PCA?"explained_variance_ratio_, always. The eigenvalues (or squared singular values) are the raw variances; the ratio is what fraction of total variance each component captures. That fraction is scale-free and answers "am I keeping enough of the information?" β the question that matters in practice.
explained_variance_ratio_ = [0.60, 0.25, 0.07, 0.03, 0.02, 0.02, 0.01]. How much variance is captured by the first 4 components?sklearn's PCA does center for you (no need to call StandardScaler just for centering β it happens inside fit_transform). Standardizing β giving every feature equal variance β is still your job. Without it, a feature measured in dollars will dominate one measured in years purely due to scale, and the principal components will be garbage.
Yes, for exact arithmetic. In practice, floating-point noise means tiny-but-nonzero singular values appear where there "should" be zeros. The working definition of rank is "number of singular values above some small numerical threshold" β the threshold used in np.linalg.matrix_rank's implementation.
np.linalg.svd returns singular values s = [5.0, 0.3, 1e-15, 2e-16]. What is the numerical rank of the matrix?Answer these without scrolling back up. Each question is checkable β wrong answers reveal the full reasoning so you know what to re-read.
np.linalg.eig(A) returns eigvals = [2, 5] and eigvecs = [[1, 0], [0, 1]]. What is the eigenvector paired with eigenvalue 5?.... The first eigenvector lives at eigvecs[:, 0]; the first eigenvalue at eigvals[0]. Output should be True.U[:, :k], np.diag(s[:k]), and Vt[:k, :], chained with @. The final line checks EckartβYoung: Frobenius error equals the norm of the discarded singular values. Output should be True.explained_variance_ratio_ in sklearn.Module 1 gives you the vocabulary to read machine-learning papers and library docs without getting lost in notation. When you see:
The rest of this curriculum assumes that vocabulary. From here on, when a lesson says "we take the gradient of the loss" or "we project onto the first two PCs," those phrases carry real meaning.
Lock the vocabulary in with the timed Linear Algebra Drills β covers all four lessons. Then tackle Gradient Descent From Scratch (Jupyter notebook) β the first end-to-end "make a thing learn" exercise. Complete both before starting Module 2.
Next: statistics and probability. Computing a model prediction is not the same as knowing when to trust one, how uncertain it is, or when the data is lying. Module 2 builds that toolkit.