Module 1 · Lesson 4 · ~60 min read + play · Module capstone

Eigenvalues, SVD, and PCA

The final math of Module 1. Eigendecomposition and SVD are how we find the hidden structure inside a matrix — which directions it stretches, which it collapses, which combinations of columns carry the most signal. They underpin PCA, recommender systems, embeddings, spectral clustering, and image compression.

Where we are

Lesson 1: a matrix is a table of numbers. Lesson 2: a matrix is a transformation — rotate, stretch, shear. Lesson 3: gradients tell us how to tune parameters. This lesson asks a single sharp question: given a matrix, what are its natural axes? The directions it stretches cleanly, without rotating. That question has an exact answer (eigenvectors/eigenvalues), a more general answer that works for any matrix (SVD), and a direct application (PCA).

1. Why eigenvalues exist at all

Most vectors change direction when you apply a matrix to them. You rotate them, skew them, and scale them at the same time — the output arrow points somewhere new. That's fine, but it makes matrices hard to reason about: "what does $A$ do?" has no short answer if every input vector has a different story.

But for almost every square matrix, there are a handful of special input vectors where $A$ does not rotate — it only stretches them along their own line. Same direction, scaled. Those directions are the eigenvectors. The stretch factor along each one is its eigenvalue, written $\lambda$ (lambda).

A v = \lambda v

In words: "$A$ applied to $v$ equals $v$ scaled by $\lambda$." This is a restrictive condition — for almost every vector $v$ you pick, $Av$ will not be a scalar multiple of $v$. When you do find such a $v$, it identifies one of the directions $A$ treats as a pure axis.

Two equivalent framings

Geometric: An eigenvector is a direction that $A$ doesn't rotate — it only stretches (or flips, or collapses). The eigenvalue is the stretch factor.

Algebraic: An eigenvector is a vector $v$ that satisfies $Av = \lambda v$ for some scalar $\lambda$.

An eigenvector $v$ satisfies $Av = \lambda v$. In words, that means:

Finding eigenvectors by inspection

Before any textbook machinery, find eigenvectors by hand. Take the matrix

A = \begin{bmatrix} 3 & 1 \\ 0 & 2 \end{bmatrix}

Try $v = [1, 0]$. Compute $Av$:

A \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} (3)(1) + (1)(0) \\ (0)(1) + (2)(0) \end{bmatrix} = \begin{bmatrix} 3 \\ 0 \end{bmatrix} = 3 \cdot \begin{bmatrix} 1 \\ 0 \end{bmatrix}

The output is a scalar multiple of the input — specifically, $3$ times it. So $[1, 0]$ is an eigenvector with eigenvalue $3$. $[1, 0]$ happens to line up with how this particular $A$ acts.

Try $v = [1, 1]$:

A \begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} (3)(1) + (1)(1) \\ (0)(1) + (2)(1) \end{bmatrix} = \begin{bmatrix} 4 \\ 2 \end{bmatrix}

Is $[4, 2]$ a scalar multiple of $[1, 1]$? No — $[4, 2]$ points in a different direction. So $[1, 1]$ is not an eigenvector of this $A$. Eigenvectors are useful because they are rare — most vectors are not eigenvectors.

Try $v = [1, -1]$:

A \begin{bmatrix} 1 \\ -1 \end{bmatrix} = \begin{bmatrix} 3 - 1 \\ 0 - 2 \end{bmatrix} = \begin{bmatrix} 2 \\ -2 \end{bmatrix} = 2 \cdot \begin{bmatrix} 1 \\ -1 \end{bmatrix}

$[1, -1]$ is an eigenvector with eigenvalue $2$. This 2×2 matrix has two eigenvectors — $[1, 0]$ with eigenvalue $3$, and $[1, -1]$ with eigenvalue $2$. Those two directions are the natural axes of $A$. Every other vector gets handled as a combination of these.

Let $A = \begin{bmatrix} 2 & 1 \\ 0 & 3 \end{bmatrix}$ and $v = [2, 1]$. Is $v$ an eigenvector of $A$?

The characteristic polynomial

Guessing works for toy matrices but doesn't scale. The standard way to find eigenvalues is to solve the characteristic polynomial — the equation $\det(A - \lambda I) = 0$. You do not need to memorize this derivation; you need to know it exists, what it produces, and that NumPy handles it for you. For the same $A$:

A - \lambda I = \begin{bmatrix} 3 - \lambda & 1 \\ 0 & 2 - \lambda \end{bmatrix}

The determinant of a 2×2 matrix $\begin{bmatrix} a & b \\ c & d \end{bmatrix}$ is $ad - bc$. So:

\det(A - \lambda I) = (3 - \lambda)(2 - \lambda) - (1)(0) = (3 - \lambda)(2 - \lambda)

Set it to zero: $(3 - \lambda)(2 - \lambda) = 0$, giving $\lambda = 3$ or $\lambda = 2$. These are the eigenvalues we found by inspection. This is what happens under the hood when a library returns eigenvalues — it's solving a polynomial (iteratively, for bigger matrices).

Given each eigenvalue, find its eigenvector by solving $(A - \lambda I) v = 0$ for a nonzero $v$. For $\lambda = 3$: $(A - 3I) v = 0$ becomes $\begin{bmatrix} 0 & 1 \\ 0 & -1 \end{bmatrix} v = 0$, which forces $v_2 = 0$, leaving any vector like $[1, 0]$. Same answer, done mechanically.

Find the eigenvalues of $A = \begin{bmatrix} 2 & 1 \\ 2 & 3 \end{bmatrix}$ by setting $\det(A - \lambda I) = 0$ and solving the quadratic.

Do not hand-crank this in production

Computing eigenvalues by solving characteristic polynomials is numerically unstable for anything larger than a toy example, and slow. Real libraries use iterative algorithms (QR iteration, Lanczos, etc.) that are both faster and more accurate. Your job is to know what eigenvalues are, read the output of np.linalg.eig, and interpret it.

NumPy

import numpy as np
A = np.array([[3, 1],
              [0, 2]])
eigvals, eigvecs = np.linalg.eig(A)

eigvals
# array([3., 2.])

eigvecs              # each COLUMN is an eigenvector (normalized to length 1)
# array([[ 1.        ,  0.70710678],
#        [ 0.        , -0.70710678]])

# Verify: A @ v should equal lambda * v for each pair
v0 = eigvecs[:, 0]; l0 = eigvals[0]   # (1, 0), lambda = 3
A @ v0, l0 * v0                       # both equal array([3., 0.])

v1 = eigvecs[:, 1]; l1 = eigvals[1]   # (1/√2, -1/√2), lambda = 2
A @ v1, l1 * v1                       # both equal array([1.414..., -1.414...])

Two easy-to-miss details about np.linalg.eig's output. First: eigenvectors come back as columns of the eigvecs array — that is, eigvecs[:, i] is the eigenvector for eigvals[i]. Not rows. Second: they are normalized to length 1, so the output shows $[0.707..., -0.707...]$ instead of $[1, -1]$ — same direction, unit length. See the numpy.linalg.eig docs for the full contract.

2. Why ML cares — the spectrum

You can get very far in ML without ever hand-computing an eigenvalue. Eigenvalues and eigenvectors are the mathematical engine underneath three techniques you will use by name, repeatedly:

PCA (and everything built on it). PCA reduces correlated features to a small number of orthogonal axes. The axes are eigenvectors of the covariance matrix. The variance along each axis is the eigenvalue. There is no PCA without this.
Recommender systems and embeddings. Matrix factorization (think: the Netflix Prize) approximates a big users-by-items matrix with a low-rank product. The rank-$k$ approximation is literally built from the top $k$ singular values and vectors.
Graph algorithms. PageRank is the principal eigenvector of a web-graph matrix. Spectral clustering uses eigenvectors of a graph Laplacian to find communities.

The list of eigenvalues of a matrix (its spectrum) tells you how much signal lives along each natural axis. Big eigenvalue = strong signal. Tiny eigenvalue = probably noise. "Drop the small ones to compress and denoise" comes from this. Section 6 makes it concrete.

Two matrices have these eigenvalue spectra: (A) [90, 8, 1, 0.5, 0.3, 0.1, ...] and (B) [12, 11, 10, 10, 9, 9, ...]. Which one is a better candidate for a 2-component PCA reduction?

Bridge from what you know

Queries like "give me the top 10 revenue-driving products" drop the long tail. The eigenvalue move generalizes this: rank the axes of your data by how much variance they carry, keep the top ones, discard the rest. Instead of rows in a ranked list, the "things" being ranked are directions in feature space. PCA is a SELECT TOP-k on the eigenvalues of your covariance matrix.

In PCA, the principal components are:

3. The limits of eigendecomposition

Eigendecomposition has two limits that motivate SVD:

Only works on square matrices. The equation $Av = \lambda v$ only makes sense if $A$ maps an $n$-dimensional space back to itself. If $A$ is $m \times n$ with $m \neq n$, $Av$ lives in a different space than $v$ — comparing them with "$\lambda v$" is nonsense.
Even for square matrices, not every matrix is "nicely" diagonalizable. Some have repeated eigenvalues with too few independent eigenvectors; some have complex-valued eigenvalues; some have eigenvectors that aren't orthogonal. You can work around all this, but it's ugly.

The data matrix $X$ in ML is almost always $n \times d$ — $n$ samples, $d$ features, and $n \neq d$. Eigendecomposition of $X$ itself does not type-check. SVD is the generalization that works for rectangles.

Why is the equation $Av = \lambda v$ only well-formed when $A$ is square?

Which of these matrices has no eigendecomposition because it isn't square?

4. Singular Value Decomposition — the universal tool

SVD factors any matrix $A$, of any shape $m \times n$, into three matrices:

A = U \, \Sigma \, V^T

The contract:

$U$ is $m \times m$, orthogonal — a pure rotation (plus possible reflection) in the output space.
$\Sigma$ (capital sigma) is $m \times n$, diagonal with non-negative entries — pure stretching along each axis. Entries are the singular values, $\sigma_1 \geq \sigma_2 \geq \ldots \geq 0$.
$V$ is $n \times n$, orthogonal — a rotation in the input space. You actually use its transpose $V^T$.

The one-sentence interpretation

Every linear transformation is: rotate, stretch along axes, rotate again. An input vector $v$: $V^T$ spins it into a coordinate system where $A$ becomes pure axis-aligned stretching, $\Sigma$ does the stretching, and $U$ spins the result into the output coordinate system. Those three steps cover every matrix.

What SVD gives you

A generic matrix is an unstructured block of numbers. Its SVD is three interpretable pieces: two rotations and a scaling. The singular vectors (columns of $U$ and $V$) identify the input and output axes that $A$ treats cleanly. The singular values quantify the stretching along each. Every matrix has an SVD — no shape or diagonalizability caveats.

In the SVD $A = U\Sigma V^T$, which factors do rotation and which does stretching?

Try the rotation-stretch-rotation picture visually. The matrix visualizer animates how any 2×2 matrix transforms space; set your own matrix and watch the unit circle become an ellipse whose axes are the singular directions.

A worked example — a 3×2 matrix

Decompose

A = \begin{bmatrix} 3 & 0 \\ 4 & 0 \\ 0 & 5 \end{bmatrix}

This is $3 \times 2$, so $U$ will be $3 \times 3$, $\Sigma$ will be $3 \times 2$, and $V$ will be $2 \times 2$. Column 1 is $[3, 4, 0]$ with length $\sqrt{9 + 16} = 5$, and column 2 is $[0, 0, 5]$ with length $5$. The two columns are perpendicular. So $A$ maps the two input axes to two orthogonal output directions, each stretched by $5$. Two singular values, both equal to $5$.

import numpy as np
A = np.array([[3, 0],
              [4, 0],
              [0, 5]])
U, s, Vt = np.linalg.svd(A, full_matrices=False)

s       # array([5., 5.]) — two singular values, both 5
U.shape # (3, 2)
Vt.shape# (2, 2)

# Reconstruct A from the factors
A_reconstructed = U @ np.diag(s) @ Vt
np.allclose(A, A_reconstructed)   # True

Two details in this code. First: full_matrices=False returns the "economy" SVD, skipping the padding columns of $U$ that would multiply against zero rows of $\Sigma$. In ML the economy version is almost always what you want — cheaper, same information. Second: s comes back as a 1D array of just the diagonal values, not a full $\Sigma$ matrix; rebuild the matrix with np.diag(s) when needed. See the numpy.linalg.svd docs.

What are the singular values of $D = \begin{bmatrix} 3 & 0 \\ 0 & 7 \end{bmatrix}$?

Interpreting each factor in code

A = np.random.randn(5, 3)
U, s, Vt = np.linalg.svd(A, full_matrices=False)

U.shape, s.shape, Vt.shape     # ((5, 3), (3,), (3, 3))

# s is ALWAYS in non-increasing order
np.all(s[:-1] >= s[1:])        # True

# Columns of U are left singular vectors — orthonormal axes in output space
U.T @ U                        # ≈ identity (U has orthonormal columns)

# Rows of Vt are right singular vectors — orthonormal axes in input space
Vt @ Vt.T                      # ≈ identity

Singular values always come back sorted descending — libraries guarantee this. It is what makes "keep the top $k$" a one-liner.

5. Low-rank approximation

SVD has a direct application: low-rank approximation. Given a big matrix $A$, find the closest possible matrix of a chosen low rank $k$. SVD gives it directly.

Keep only the top-$k$ singular values and their corresponding columns of $U$ and rows of $V^T$:

A_k = U_k \, \Sigma_k \, V_k^T \approx A

This $A_k$ has rank at most $k$, and the Eckart–Young theorem proves it is the closest rank-$k$ matrix to $A$ in Frobenius norm. No other rank-$k$ matrix is a better approximation. Applications:

What is the rank of $B = \begin{bmatrix} 1 & 2 \\ 2 & 4 \\ 3 & 6 \end{bmatrix}$?

Image compression. Treat an image as a matrix. Keep only the top ~50 singular values and you get a visually near-identical image at a fraction of the bytes. The Wikipedia article on SVD has a great worked example.
Recommender systems / matrix factorization. The Netflix Prize was built on this idea: approximate the (sparse) users-by-movies rating matrix with a low-rank product. The factors become user and movie embeddings.
Topic modeling (LSA). SVD of a term-document matrix gives you topics as singular vectors.
Noise reduction. The smallest singular values often come from noise; zero them and the signal pops.
PCA. Literally implemented via SVD of the centered data matrix, because it's more numerically stable than eigendecomposing the covariance matrix.

Translator for ML-paper language

When a paper says "we took the top-$k$ components" or "the rank-$k$ approximation" or "we projected onto the principal subspace" — they did SVD or eigendecomposition and kept the $k$ biggest. Interpretable components (topics, genres, user tastes) show up when real data has genuine low-rank structure: a small number of latent factors explain most of what's going on. Many classical ML techniques are variations on this theme.

The Frobenius reconstruction error of the best rank-$k$ approximation is:

6. PCA — SVD on centered data

PCA in one line: PCA is SVD applied to your centered data matrix. Everything else — the "principal components," the "explained variance," the scikit-learn object — is machinery around that one idea.

What PCA does, in words

Given a data matrix $X$ of shape $(n, d)$ — $n$ samples, $d$ features — PCA finds $d$ new axes, called principal components, with three properties:

They are orthonormal: perpendicular to each other and of unit length. So they form a new, rotated coordinate system.
They are ranked by how much of the data's variance each one captures.
Projecting your data onto the top $k$ of them gives you the best $k$-dimensional summary of the original $d$-dimensional data, in a precise "least squared reconstruction error" sense.

PCA is a projection that collapses many correlated columns into a handful of orthogonal ones that, together, preserve most of the variance. In dbt terms: "collapse 100 correlated columns into 5 orthogonal summary columns that explain 90% of the variance" — except the 5 columns are chosen optimally by the math, not by a human guessing.

For a cloud of centered data points, what does the first principal component point at?

Why you must center the data first

Before PCA, subtract the mean of each feature from every row. This matters because PCA looks for directions of maximum variance, and variance is only well-defined around the mean. Uncentered features leave the first "principal component" pointing roughly at wherever the data cloud sits in space — useless. Centering forces PCA to find directions of spread.

Also standardize each feature to unit variance (divide by its standard deviation) when features have wildly different scales (dollars vs. kilometers vs. counts). Without this, the feature with the biggest raw numbers hijacks the first principal component regardless of its actual informational value. This is one of the most common PCA mistakes.

You have a dataset with salary (values 30,000-200,000) and age (20-65). Before PCA, what preprocessing do you need?

The connection to the covariance matrix

Textbooks often describe PCA this second way: the principal components are the eigenvectors of the covariance matrix $C = \frac{1}{n-1} X_c^T X_c$ (where $X_c$ is the centered data), and the variances along each component are the eigenvalues. This framing and the SVD framing produce the same answer — SVD of $X_c$ gives the same directions, and the squared singular values equal the eigenvalues of $C$ (up to the $n-1$ scaling). SVD is preferred in practice because it is numerically more stable than forming $C$ explicitly.

Covariance comes properly in Module 2's statistics lessons. For now: covariance measures how features vary together, and PCA finds the directions where that joint variation is largest.

PCA in five lines, on real data

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

iris = load_iris()
X, y = iris.data, iris.target                  # X.shape == (150, 4)

X_scaled = StandardScaler().fit_transform(X)   # center + unit variance each column
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)             # shape (150, 2) — two principal components

pca.explained_variance_ratio_
# array([0.7296, 0.2285])
pca.explained_variance_ratio_.sum()
# 0.9581    ← first 2 PCs capture ~96% of the variance from 4 features

Four correlated flower measurements compressed to two numbers per flower while keeping 96% of the information — and the two numbers are orthogonal, so they are not redundant. See the sklearn PCA docs for the full API.

A PCA on some dataset returns explained_variance_ratio_ = [0.72, 0.19, 0.06, 0.02, 0.01]. Is reducing to 2 components a reasonable choice?

The one attribute to inspect

explained_variance_ratio_ answers "did PCA work?" It is a vector of fractions that sum to at most 1; each entry is the fraction of variance captured by that component. If the top 5 components capture 95%, the compression is useful. If the first 20 components together only capture 30%, the data does not have the low-rank structure PCA needs — reach for something else (kernel PCA, UMAP, t-SNE). Always inspect this attribute before trusting a PCA reduction.

Iris in 2D

The plot below shows X_2d for the iris dataset, colored by species. Drag the slider to see how PCA projections with different numbers of components capture cumulative variance.

The three species are almost linearly separable in 2D — even though the original data lives in 4D. PCA found a projection where the class structure survives.
At $k=2$, ~96% of the variance is captured with half the features. At $k=3$, ~99%. The 4th component adds almost nothing.

📊 PCA on Iris — projected to 2D

Projection shown: PC1 vs. PC2 · Cumulative variance: 0.958

Components kept (k) k = 2

setosa versicolor virginica

7. The full decomposition cheat sheet

Concept	Lives on	What it tells you
Eigenvector $v$	Square matrix $A$	Direction that $A$ only stretches — no rotation.
Eigenvalue $\lambda$	Paired with an eigenvector	How much stretch along that direction. Negative = flipped. Zero = collapsed.
Covariance matrix	Dataset $X$	Symmetric $d \times d$ matrix; its eigenvectors are the principal components.
Singular value $\sigma_i$	Any matrix (SVD)	"Generalized eigenvalue." Strength of the $i$-th axis. Always $\geq 0$, sorted descending.
Singular vector	Any matrix (SVD)	Columns of $U$ (left) and $V$ (right) — orthonormal axes in output/input space.
Rank-$k$ approximation	Any matrix	Best summary of $A$ using only $k$ axes. Eckart–Young guarantees optimality.
Principal Component	Centered dataset	Right singular vector of $X_c$; direction of highest remaining variance.
PCA	Centered dataset	Project data onto top-$k$ principal components. Uses SVD under the hood.

The variance captured along each principal component is:

8. Common confusions (pre-empted)

"Why does eigendecomposition only work on square matrices?"

Because $Av = \lambda v$ requires $Av$ and $\lambda v$ to live in the same space — same dimensionality. That only happens when $A$ maps $\mathbb{R}^n \to \mathbb{R}^n$, i.e., when $A$ is $n \times n$. If $A$ were $m \times n$, then $Av$ has length $m$ and $v$ has length $n$ — setting them equal is not a well-formed equation. SVD sidesteps this by allowing the input and output axes to live in different spaces.

You have a data matrix X of shape (300, 50). Which function decomposes it?

"Why does SVD work for any matrix?"

SVD never compares $Av$ to $v$ directly. It finds the best axes in the input space ($V$) and the best axes in the output space ($U$) so that $A$, in those new coordinates, is pure diagonal stretching ($\Sigma$). There is no requirement that input and output have the same dimension.

"Do I look at eigenvalues or `explained_variance_ratio_` for PCA?"

explained_variance_ratio_, always. The eigenvalues (or squared singular values) are the raw variances; the ratio is what fraction of total variance each component captures. That fraction is scale-free and answers "am I keeping enough of the information?" — the question that matters in practice.

A PCA returns explained_variance_ratio_ = [0.60, 0.25, 0.07, 0.03, 0.02, 0.02, 0.01]. How much variance is captured by the first 4 components?

"Why center data before PCA — can't PCA handle it?"

sklearn's PCA does center for you (no need to call StandardScaler just for centering — it happens inside fit_transform). Standardizing — giving every feature equal variance — is still your job. Without it, a feature measured in dollars will dominate one measured in years purely due to scale, and the principal components will be garbage.

"Is the rank of a matrix the same as the number of nonzero singular values?"

Yes, for exact arithmetic. In practice, floating-point noise means tiny-but-nonzero singular values appear where there "should" be zeros. The working definition of rank is "number of singular values above some small numerical threshold" — the threshold used in np.linalg.matrix_rank's implementation.

np.linalg.svd returns singular values s = [5.0, 0.3, 1e-15, 2e-16]. What is the numerical rank of the matrix?

9. Self-check

Answer these without scrolling back up. Each question is checkable — wrong answers reveal the full reasoning so you know what to re-read.

Find both eigenvalues of $A = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix}$.

If $A$ has an eigenvalue $\lambda = 0$ (with a nonzero eigenvector), which of these is true?

Which of these matrices is orthogonal?

What is the rank of $M = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 5 & 7 & 9 \end{bmatrix}$?

What is the interpretation of the largest singular value $\sigma_1$ of a matrix $A$?

np.linalg.eig(A) returns eigvals = [2, 5] and eigvecs = [[1, 0], [0, 1]]. What is the eigenvector paired with eigenvalue 5?

Replace each .... The first eigenvector lives at eigvecs[:, 0]; the first eigenvalue at eigvals[0]. Output should be True.

The rank-$k$ approximation is $U_k \, \Sigma_k \, V_k^T$ — use U[:, :k], np.diag(s[:k]), and Vt[:k, :], chained with @. The final line checks Eckart–Young: Frobenius error equals the norm of the discarded singular values. Output should be True.

Glossary — all terms from this lesson

Eigenvalue: A scalar $\lambda$ such that $Av = \lambda v$ for some nonzero $v$. The stretch factor along an eigenvector.
Eigenvector: A direction $v$ that a square matrix $A$ only stretches (no rotation). Satisfies $Av = \lambda v$.
Eigendecomposition: Writing a square matrix as $A = Q \Lambda Q^{-1}$ where $Q$'s columns are eigenvectors and $\Lambda$ is diagonal with eigenvalues. Only works on diagonalizable square matrices.
Characteristic polynomial: The equation $\det(A - \lambda I) = 0$ whose roots are the eigenvalues of $A$. Useful for intuition; numerically unstable in practice.
Square matrix: A matrix with the same number of rows and columns ($n \times n$). Only square matrices have eigenvalues.
Orthogonal matrix: A square matrix $Q$ with $Q^T Q = I$. Columns are mutually perpendicular unit vectors. Geometrically a pure rotation/reflection.
Orthonormal: A set of vectors that are mutually perpendicular and each have length 1. The rows/columns of an orthogonal matrix are orthonormal.
Diagonal matrix: A matrix with zeros everywhere except the main diagonal. Acts as pure per-axis scaling when applied to a vector.
SVD: Singular Value Decomposition. Factors any matrix as $A = U \Sigma V^T$: rotation × stretch × rotation.
Singular value: A non-negative diagonal entry of $\Sigma$. The "generalized eigenvalue" — strength of the matrix along that axis. Always sorted descending.
Singular vector: A column of $U$ (left) or $V$ (right) in the SVD. Orthonormal axes in the output/input spaces.
Rank: The number of linearly independent columns (or rows) of a matrix. Equals the number of nonzero singular values.
Low-rank approximation: Approximating $A$ with a sum of only the top-$k$ singular components. The best $k$-rank approximation (Eckart–Young theorem).
Variance: A measure of how spread out a set of numbers is around their mean. Formal definition in Module 2.
Covariance matrix: The $d \times d$ matrix $C = \frac{1}{n-1} X_c^T X_c$ summarizing how pairs of features co-vary. Its eigenvectors are principal components.
Principal component: A unit-length direction that captures maximal remaining variance in the data. The $i$-th PC is the $i$-th right singular vector of the centered data.
PCA: Principal Component Analysis. Project data onto the top-$k$ principal components. Implemented via SVD of the centered data.
Explained variance: The amount of total variance captured by each principal component. Exposed as explained_variance_ratio_ in sklearn.
Standardize: Rescale each feature to zero mean and unit variance. Usually required before PCA when features have different units/scales.
Dimensionality reduction: Compressing data from many features to a few, while keeping most of the signal. PCA is the default linear method.
Normalize (vector): Divide a vector by its norm so it has length 1. Applied to eigenvectors so they have a canonical form.

Module 1 wrap

Module 1 gives you the vocabulary to read machine-learning papers and library docs without getting lost in notation. When you see:

$y = Xw + b$, you see one prediction per row — matrix-vector product plus bias.
$\nabla L$ or $\partial L / \partial \theta$, you know it's a gradient — a vector of partial derivatives that points uphill in loss.
$A = U \Sigma V^T$, you know it's SVD — rotation, stretch, rotation.
"top-$k$ principal components," "rank-$k$ approximation," "the spectrum of the matrix" — all familiar.
Shape annotations like $(n, d)$ or $(\text{batch}, \text{channels}, H, W)$ — you read them immediately.

The rest of this curriculum assumes that vocabulary. From here on, when a lesson says "we take the gradient of the loss" or "we project onto the first two PCs," those phrases carry real meaning.

Capstone before you move on

Lock the vocabulary in with the timed Linear Algebra Drills — covers all four lessons. Then tackle Gradient Descent From Scratch (Jupyter notebook) — the first end-to-end "make a thing learn" exercise. Complete both before starting Module 2.

Next: statistics and probability. Computing a model prediction is not the same as knowing when to trust one, how uncertain it is, or when the data is lying. Module 2 builds that toolkit.