Module 1 · Lesson 2 · ~60 min read + play · Math Foundations

Matrices as Transformations: The Geometry of ML

A matrix is a function that reshapes space. Viewed this way, matrix multiplication has a geometric picture: chaining transformations. PCA, embeddings, neural-net layers, and "projecting into latent space" are all instances of the same operation.

Bridge from Lesson 1

Lesson 1 gave you the computational view: a matrix is an (m, n) grid, A @ v is a stack of dot products, shapes must line up. This lesson builds the geometric view: a matrix is a function that warps space. Both views matter — computational for writing code, geometric for interpreting what the code does. When an ML paper says "we project $x$ into a lower-dimensional space," the statement only makes sense with the geometric view.

1. A matrix is a function

Pick any matrix $A$ of shape (m, n). Multiply it by a vector $v$ of length $n$. You get back a vector of length $m$. That's a function — same interface as any Python function you've ever written: input goes in, output comes out, same shape contract every time.

A: \mathbb{R}^n \to \mathbb{R}^m, \qquad v \mapsto Av

Read that notation as "$A$ is a function that takes an $n$-dimensional input and returns an $m$-dimensional output." The symbol $\mathbb{R}^n$ just means "the set of all real-valued vectors with $n$ components." If you prefer code:

def A(v):
    # A is a stand-in for a fixed (m, n) matrix.
    # v comes in with shape (n,). result comes out with shape (m,).
    return A_matrix @ v

You have a matrix $A$ of shape (5, 3) and a vector $v$ of shape (3,). What is the shape of A @ v?

Every matrix is a function like this. But matrices aren't arbitrary functions — they're a very disciplined subset called linear transformations. Two rules define them:

The origin stays put. $A \cdot \mathbf{0} = \mathbf{0}$. Whatever a matrix does, it never slides the origin somewhere else. (If you need a shift — translation — you need something more than a matrix. We'll cover that in a callout below.)
Grid lines stay straight and evenly spaced. If you draw a graph paper grid on your input space and apply $A$, the grid lines in the output might be rotated, stretched, or skewed — but they stay straight, and parallel lines stay parallel, and equal spacing stays equal. Nothing curves, nothing bunches.

Formally, the second rule is two equations:

A(u + v) = Au + Av \qquad \text{and} \qquad A(c \cdot v) = c \cdot Av

In words: applying $A$ to a sum is the same as summing the applications, and applying $A$ to a scaled vector scales the result by the same factor. That property is called linearity.

Which of these single-variable functions is a linear transformation?

Is the 2D function $f(x, y) = (x + 1,\; y)$ a linear transformation?

"Linear" in ML vs. "linear" in Excel

The word "linear" here is stricter than the everyday meaning. A function like $f(x) = 3x + 7$ is usually called "linear" in an Excel/stats sense, but a true linear transformation has no constant offset — $f(0)$ must equal $0$. The $+7$ is called an affine piece, and it gets handled separately (in ML, this is the "bias" term). For now, pure linear = no shifts, only rotation/scale/shear/projection.

2. A matrix is determined by what it does to the basis vectors

The columns of A tell you where the basis vectors go

Column $j$ of $A$ is where the $j$-th basis vector lands after the transformation. In 2D, column 1 is the new location of $\hat{i} = \begin{bmatrix}1\\0\end{bmatrix}$, and column 2 is the new location of $\hat{j} = \begin{bmatrix}0\\1\end{bmatrix}$. Knowing where $\hat{i}$ and $\hat{j}$ land is enough to know where every vector lands.

The derivation follows directly from the two linearity rules. Start with an arbitrary 2D vector, written as a linear combination of $\hat{i}$ and $\hat{j}$:

v = \begin{bmatrix} x \\ y \end{bmatrix} = x \cdot \begin{bmatrix} 1 \\ 0 \end{bmatrix} + y \cdot \begin{bmatrix} 0 \\ 1 \end{bmatrix} = x\hat{i} + y\hat{j}

This is just restating that the components of a vector are the coefficients in front of the basis vectors. Applying the two linearity rules from above:

A v = A(x\hat{i} + y\hat{j}) = x \cdot A\hat{i} + y \cdot A\hat{j}

The first step uses the first linearity rule (distribute over addition). The second uses the second (pull the scalars out). The result $Av$ is determined by two things: the scalars $x$ and $y$ (which are $v$'s components), and the vectors $A\hat{i}$ and $A\hat{j}$ (which are where the basis vectors land).

Compute $A\hat{i}$ and $A\hat{j}$ directly with the matrix-vector rule from Lesson 1. For a 2×2 matrix $A = \begin{bmatrix}a & b \\ c & d\end{bmatrix}$:

A\hat{i} = \begin{bmatrix}a & b \\ c & d\end{bmatrix} \begin{bmatrix}1 \\ 0\end{bmatrix} = \begin{bmatrix}a \\ c\end{bmatrix}, \qquad A\hat{j} = \begin{bmatrix}a & b \\ c & d\end{bmatrix} \begin{bmatrix}0 \\ 1\end{bmatrix} = \begin{bmatrix}b \\ d\end{bmatrix}

$A\hat{i}$ is the first column of $A$. $A\hat{j}$ is the second column of $A$. This is what matrix-vector multiplication does: right-multiplying a matrix by $\hat{i}$ returns column 0; right-multiplying by $\hat{j}$ returns column 1.

Given $A = \begin{bmatrix}4 & -2 \\ 1 & 5\end{bmatrix}$, where does $\hat{i} = \begin{bmatrix}1\\0\end{bmatrix}$ land after applying $A$?

For the same $A = \begin{bmatrix}4 & -2 \\ 1 & 5\end{bmatrix}$, where does $\hat{j} = \begin{bmatrix}0\\1\end{bmatrix}$ land?

A worked example — by hand, slowly

Take the matrix

A = \begin{bmatrix} 2 & -1 \\ 0 & 3 \end{bmatrix}

The columns tell us: $\hat{i} = \begin{bmatrix}1\\0\end{bmatrix}$ lands at $\begin{bmatrix}2\\0\end{bmatrix}$, and $\hat{j} = \begin{bmatrix}0\\1\end{bmatrix}$ lands at $\begin{bmatrix}-1\\3\end{bmatrix}$. That's it — those two pieces of information fully determine what $A$ does to any input vector.

Let's verify by transforming $v = \begin{bmatrix}3\\2\end{bmatrix}$ two different ways and checking they agree.

Way 1 — the Lesson 1 computational view (stack of dot products):

Av = \begin{bmatrix} 2 & -1 \\ 0 & 3 \end{bmatrix} \begin{bmatrix} 3 \\ 2 \end{bmatrix} = \begin{bmatrix} (2)(3) + (-1)(2) \\ (0)(3) + (3)(2) \end{bmatrix} = \begin{bmatrix} 4 \\ 6 \end{bmatrix}

Way 2 — the new geometric view (linear combination of columns):

Av = 3 \cdot \begin{bmatrix} 2 \\ 0 \end{bmatrix} + 2 \cdot \begin{bmatrix} -1 \\ 3 \end{bmatrix} = \begin{bmatrix} 6 \\ 0 \end{bmatrix} + \begin{bmatrix} -2 \\ 6 \end{bmatrix} = \begin{bmatrix} 4 \\ 6 \end{bmatrix}

Same answer. The second view says: "$v$'s first component (3) tells me how far to walk along the new $\hat{i}$; $v$'s second component (2) tells me how far to walk along the new $\hat{j}$; the final location is the sum." Put another way, the transformed basis vectors become the new coordinate system, and the input vector's components are the coordinates in that system.

Same [4, 6] the hand calculation gave. The two views agree because they are the same arithmetic re-described — one walks across rows, the other along columns.

A matrix $A$ sends $\hat{i}$ to $\begin{bmatrix}3\\0\end{bmatrix}$ and $\hat{j}$ to $\begin{bmatrix}1\\-2\end{bmatrix}$. Where does $v = \begin{bmatrix}1\\2\end{bmatrix}$ land?

Opinionated take

If you're coming at this fresh, the dot-product view is more natural because it matches how you'd write the code. The column-as-basis-image view feels abstract. Push through the discomfort — the second view is the one you'll need for PCA, eigenvectors, and reading any modern ML paper. When the paper says "project into a latent space spanned by the learned basis," they're using the second view. Memorize it.

3. Interactive: drag the basis vectors

Drag the sliders to set where $\hat{i}$ (orange) and $\hat{j}$ (blue) land; the grid and a silhouette warp accordingly. The matrix readout updates live, connecting the columns to the visible transformation.

Three things to verify as you drag:

The orange arrow is always the first column of $A$; the blue arrow is always the second. The arrows literally are the columns, as numbers.
Grid lines always stay straight and parallel. They may tilt, stretch, or compress, but nothing curves — that is what linearity enforces.
The "Project to x" preset. The 2D silhouette flattens onto a 1D line. The $y$-information is gone. This is what happens to data passed through a rank-deficient matrix.

🎛️ Drag to transform the grid

$\hat{i}_x$ $\hat{i}_y$ $\hat{j}_x$ $\hat{j}_y$

A = [[1.0, 0.0], [0.0, 1.0]] det = 1.00

The full-featured visualizer embedded below lets you drive the matrix entries directly, watch the determinant react in real time, and apply a sample-data silhouette through the transformation. Open it, try each preset, and verify the claim that "columns are where the basis lands."

Which preset from the visualizer produces $A = \begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix}$?

In the visualizer, start from Identity and drag $\hat{j}$ so it points in the same direction as $\hat{i}$. What happens to the determinant readout?

A matrix you build in the widget displays $\det = -1.00$. Does this transformation contain a reflection?

4. The seven transformations you'll see everywhere

Almost every 2D transformation you'll meet in ML is one of — or a composition of — seven building blocks. Each has a geometric meaning, a matrix form, and a real ML application. Learn to recognize them on sight.

4.1 Identity: do nothing

I = \begin{bmatrix}1 & 0 \\ 0 & 1\end{bmatrix}

The identity matrix has 1s on the diagonal and 0s elsewhere. $Iv = v$ for every $v$ — columns say "$\hat{i}$ goes to $\hat{i}$, $\hat{j}$ goes to $\hat{j}$." In NumPy: np.eye(n). You'll use it as a sanity check ("does my function correctly return the identity for the zero-parameter case?") and as the answer to "what matrix times $A$ gives $A$?"

4.2 Scaling: stretch independently along each axis

S = \begin{bmatrix}s_x & 0 \\ 0 & s_y\end{bmatrix}

This is a diagonal matrix — zeros off the diagonal. $\hat{i}$ lands at $(s_x, 0)$ (stretched by $s_x$ along $x$), $\hat{j}$ lands at $(0, s_y)$ (stretched by $s_y$ along $y$). When $s_x = s_y$, it's uniform scaling and the silhouette gets bigger or smaller without distortion. When they differ, you get stretching — picture a circle becoming an ellipse.

ML encounter: feature scaling. When you call StandardScaler in scikit-learn, it's effectively multiplying your features by a diagonal matrix of 1 / std_dev values. The purpose is to make features comparable so optimizers don't get distracted by the feature with the biggest raw magnitude.

Apply $S = \begin{bmatrix}3 & 0 \\ 0 & 2\end{bmatrix}$ to $v = \begin{bmatrix}2 \\ -2\end{bmatrix}$. Where does $v$ land?

4.3 Rotation: spin around the origin

R(\theta) = \begin{bmatrix}\cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{bmatrix}

Rotates everything counterclockwise by angle $\theta$. Check the logic: $\hat{i}$ (which points right) should rotate to $(\cos\theta, \sin\theta)$ — that's the first column. $\hat{j}$ (which points up) should rotate to $(-\sin\theta, \cos\theta)$ — that's the second column. The minus sign is because rotating "up" counterclockwise moves you to the left (negative $x$) and up (positive $y$).

At $\theta = 90°$: $\cos(90°)=0$, $\sin(90°)=1$, giving $R = \begin{bmatrix}0 & -1 \\ 1 & 0\end{bmatrix}$. At $\theta = 45°$: $\cos(45°) = \sin(45°) \approx 0.707$ — the values in the "Rotate 45°" preset.

ML encounter: data augmentation for computer vision (rotating training images), coordinate frames in robotics, and the rotation matrices inside SVD decompositions you'll meet in Lesson 4.

Classify the transformation $\begin{bmatrix}0 & 1 \\ -1 & 0\end{bmatrix}$. Is it a rotation, reflection, shear, or projection?

4.4 Reflection: flip across an axis

F_x = \begin{bmatrix}-1 & 0 \\ 0 & 1\end{bmatrix}, \qquad F_y = \begin{bmatrix}1 & 0 \\ 0 & -1\end{bmatrix}

$F_x$ flips along the $y$-axis: $\hat{i}$ goes to $-\hat{i}$, $\hat{j}$ stays put. The silhouette mirrors left-to-right. Reflections always have $\det = -1$ — they invert orientation (clockwise becomes counterclockwise).

ML encounter: horizontal flip as a standard image-augmentation trick.

4.5 Shear: slide parallel to an axis

H = \begin{bmatrix}1 & k \\ 0 & 1\end{bmatrix}

A shear slides points horizontally by an amount proportional to their height. $\hat{i}$ stays at $(1, 0)$; $\hat{j}$ shifts over to $(k, 1)$. Squares become parallelograms. Area is preserved — $\det H = 1$ — because nothing is stretched, only skewed.

ML encounter: less direct than scaling/rotation, but shears show up in the building blocks of SVD and in certain geometric data-augmentation pipelines.

Shear $H = \begin{bmatrix}1 & 3 \\ 0 & 1\end{bmatrix}$ applied to $v = \begin{bmatrix}1\\2\end{bmatrix}$. Where does $v$ land?

4.6 Projection: flatten onto a lower-dim subspace

P_x = \begin{bmatrix}1 & 0 \\ 0 & 0\end{bmatrix}

A projection matrix collapses an entire dimension. $P_x$ kills the $y$-component: every point $(x, y)$ lands at $(x, 0)$, on the $x$-axis. The 2D plane becomes a 1D line. This is irreversible — once the $y$-information is gone, you can't get it back. The determinant is 0 (the transformation is singular; its matrix is rank-deficient).

Why you should care

Projection is what PCA does. PCA takes a high-dimensional dataset, picks the directions along which the data varies the most, and projects onto those — throwing away low-variance directions to get a compact representation. Embeddings (turning a word, image, or user into a 300-dim vector) are also projections from a much bigger implicit space. Anytime a model "compresses" something, a projection is lurking.

4.7 Translation: the one that isn't a matrix

Shifting every point by a fixed vector — $v \mapsto v + b$ — is not a linear transformation (it moves the origin), and so it can't be represented by a 2×2 matrix alone. This is why every real neural-net layer is $Wx + b$: a matrix multiply plus a bias vector. The $b$ is the translation, bolted on.

(There's a standard trick — homogeneous coordinates — that lets you stuff the translation into a bigger matrix. Computer graphics lives on this trick. In ML we usually just keep the bias as a separate vector.)

What kind of transformation is $P = \begin{bmatrix}0 & 0 \\ 0 & 1\end{bmatrix}$?

5. Composition: multiplying matrices is chaining transformations

The rule-based matrix multiplication from Lesson 1 has a direct geometric meaning:

Multiplying matrices = composing transformations

If you first apply transformation $B$ to a vector, then apply $A$ to the result, the overall effect is a single transformation whose matrix is $AB$. That is: $A(Bv) = (AB)v$. The product $AB$ is defined to make this work. Matrix multiplication isn't an arbitrary rule — it's reverse-engineered from "I want chaining transformations to correspond to multiplying matrices."

Work through this carefully; the order convention is a common source of confusion. Pick two transformations:

B = \begin{bmatrix}2 & 0 \\ 0 & 1\end{bmatrix} \text{ (scale } x \text{ by 2)}, \qquad A = \begin{bmatrix}0 & -1 \\ 1 & 0\end{bmatrix} \text{ (rotate 90° CCW)}

What happens to $v = \begin{bmatrix}1\\0\end{bmatrix}$ if we apply $B$ first, then $A$?

Apply $B$: $Bv = \begin{bmatrix}2 & 0 \\ 0 & 1\end{bmatrix}\begin{bmatrix}1 \\ 0\end{bmatrix} = \begin{bmatrix}2 \\ 0\end{bmatrix}$ — the vector got stretched to length 2.
Apply $A$ to the result: $A(Bv) = \begin{bmatrix}0 & -1 \\ 1 & 0\end{bmatrix}\begin{bmatrix}2 \\ 0\end{bmatrix} = \begin{bmatrix}0 \\ 2\end{bmatrix}$ — the stretched vector rotated 90° to point up.

Now compute $AB$ directly and apply it to $v$ in one shot:

AB = \begin{bmatrix}0 & -1 \\ 1 & 0\end{bmatrix}\begin{bmatrix}2 & 0 \\ 0 & 1\end{bmatrix} = \begin{bmatrix}0 & -1 \\ 2 & 0\end{bmatrix}

(AB)v = \begin{bmatrix}0 & -1 \\ 2 & 0\end{bmatrix}\begin{bmatrix}1 \\ 0\end{bmatrix} = \begin{bmatrix}0 \\ 2\end{bmatrix}

Same answer. Matrix multiplication is defined precisely so that $(AB)v$ matches $A(Bv)$.

Read right-to-left. Always.

In the expression $ABv$, the transformation applied first is the rightmost one — $B$ — because it's adjacent to the vector. Then $A$ acts on the result. Read innermost-parentheses first: $A(Bv)$. Reading the ordering left-to-right produces the wrong answer, because rotate-then-scale and scale-then-rotate are different transformations.

When you see a neural-net forward pass written as $h = W_3 \sigma(W_2 \sigma(W_1 x))$, read from the inside out: $x$ first, then $W_1$, then nonlinearity, then $W_2$, and so on. The math notation reverses the data-flow order, which is annoying — but it's the convention, and fighting it wastes time.

In the expression $ABCv$, which matrix is applied to $v$ first?

AB ≠ BA, and why that's not a bug

Matrix multiplication is non-commutative: $AB$ and $BA$ are generally different matrices. Geometrically this is obvious — "rotate then scale along $x$" is a different transformation than "scale along $x$ then rotate." You can even verify it with the interactive above:

Click "Rotate 45°" to see what that transformation does to the silhouette.
Click "Scale 2x" on top of it (or compute by hand what the composition looks like).
Now do it in the opposite order and compare.

Different orders, different silhouettes. If matrix multiplication were commutative, chaining transformations would be order-independent — and the physical world would be much stranger.

Which pair of transformations is guaranteed to commute (i.e., $AB = BA$)?

6. Determinant: how much the transformation stretches space

The determinant is a single number attached to a square matrix that tells you what the transformation does to area (in 2D) or volume (in 3D and higher). For a 2×2 matrix:

\det\begin{bmatrix}a & b \\ c & d\end{bmatrix} = ad - bc

Geometrically, the determinant is the area scaling factor. Consider the unit square at the origin with corners $(0,0), (1,0), (0,1), (1,1)$ — area 1. After applying $A$, that square becomes a parallelogram whose corners are the images of those four points, and its area is $|\det A|$. So:

$\det A = 1$: area preserved. Rotations and shears both have determinant 1 — they reshape but don't stretch.
$\det A = 3$: everything's area tripled. A scale-by-$\sqrt{3}$ along both axes, for example, gives $\det = 3$.
$\det A = 0$: the transformation collapses space to a lower dimension (a line or a point). The parallelogram's area is zero because its two edges are parallel. Information has been destroyed.
$\det A < 0$: orientation is flipped. The transformation contains a reflection. Same absolute area, but if you labeled the corners A, B, C, D going counterclockwise before, they're now going clockwise.

Walk through the math for the scale-by-$(2, 3)$ example:

A = \begin{bmatrix}2 & 0 \\ 0 & 3\end{bmatrix}, \qquad \det A = (2)(3) - (0)(0) = 6

Unit square of area 1 becomes a 2×3 rectangle of area 6. Checks out. And the singular example:

B = \begin{bmatrix}1 & 2 \\ 2 & 4\end{bmatrix}, \qquad \det B = (1)(4) - (2)(2) = 0

Why is this zero? Look at the columns: $\begin{bmatrix}1\\2\end{bmatrix}$ and $\begin{bmatrix}2\\4\end{bmatrix}$. The second column is just 2× the first. $\hat{i}$ and $\hat{j}$ both land on the same line — the line $y = 2x$. The whole 2D plane collapses onto that line. Information is lost.

Compute $\det \begin{bmatrix}4 & -2 \\ 3 & 1\end{bmatrix}$.

One of these matrices is singular (non-invertible). Which? A: $\begin{bmatrix}1 & 1 \\ 0 & 3\end{bmatrix}$, B: $\begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix}$, C: $\begin{bmatrix}2 & 3 \\ 4 & 6\end{bmatrix}$, D: $\begin{bmatrix}1 & 0 \\ 1 & 1\end{bmatrix}$.

A matrix has $\det A = -1$. Does this mean the transformation includes a reflection?

import numpy as np

A = np.array([[2, 0], [0, 3]])
np.linalg.det(A)   # 6.0 — area scaled 6x, orientation preserved

B = np.array([[1, 2], [2, 4]])
np.linalg.det(B)   # 0.0 — collapses to a line; columns are linearly dependent

R = np.array([[0, -1], [1, 0]])   # 90° rotation
np.linalg.det(R)   # 1.0 — area preserved, no reflection

F = np.array([[-1, 0], [0, 1]])   # reflect across y-axis
np.linalg.det(F)   # -1.0 — orientation flipped, area preserved

Det = 0 is a red flag in ML

When a matrix has $\det = 0$ — or, in floating-point reality, a determinant that's uncomfortably close to zero — it's singular and cannot be inverted. In ML, this shows up as collinearity: two or more features in your design matrix are linearly dependent on each other. Classic example: you accidentally include age_in_years and age_in_months, or total_revenue and avg_revenue_per_day along with days_active. The columns become linearly dependent, $X^T X$ becomes singular, and the ordinary least squares closed-form solution explodes.

scikit-learn will either warn you, silently return garbage coefficients, or (for LinearRegression) use a pseudoinverse that gives some answer without telling you the feature set was degenerate. Always check for near-collinearity before you trust a fit. numpy.linalg.cond(X) or pandas.DataFrame.corr() are your first lines of defense.

7. Inverse: undoing a transformation

If $A$ takes $v$ to $Av$, the inverse $A^{-1}$ takes $Av$ back to $v$. By definition:

A^{-1} A = A A^{-1} = I

An inverse only exists when $\det A \neq 0$. If the determinant is zero, the transformation destroyed information, and no amount of cleverness can recover it — there is no inverse. Geometrically, you can't "un-project" from a line back to a plane because you don't know which of the infinitely many pre-image points in the plane you came from.

For a 2×2 matrix, there's a closed-form inverse:

A^{-1} = \frac{1}{\det A} \begin{bmatrix}d & -b \\ -c & a\end{bmatrix} \quad \text{where } A = \begin{bmatrix}a & b \\ c & d\end{bmatrix}

Swap the diagonal entries, negate the off-diagonal entries, divide by the determinant. Useful to know exists; you'll rarely compute it by hand for anything bigger than 2×2.

Does $A = \begin{bmatrix}2 & 1 \\ 1 & 2\end{bmatrix}$ have an inverse?

If $R = R(30°)$ is a rotation by 30° counterclockwise, what transformation is $R^{-1}$?

A = np.array([[2, 1], [1, 3]])
A_inv = np.linalg.inv(A)          # compute the inverse
A @ A_inv                          # ≈ identity (up to floating-point noise)
# array([[1.00000000e+00, 0.00000000e+00],
#        [2.22044605e-16, 1.00000000e+00]])

Don't compute inverses in production

The math textbook formula for OLS is $w = (X^T X)^{-1} X^T y$. In real code, you never write it that way. To solve the system $A w = b$, you use np.linalg.solve(A, b) — it uses LU decomposition under the hood, which is faster and much more numerically stable than forming the inverse explicitly. For least-squares specifically, np.linalg.lstsq(X, y) handles the rank-deficient case gracefully.

Rule of thumb: if you ever find yourself typing np.linalg.inv in production code, stop and ask "am I sure I don't want solve or lstsq here?" Ninety-nine times out of a hundred, the answer is you do.

8. How this shows up in real ML

Direct applications you'll meet later, each in one paragraph:

Linear regression as projection. Given features $X$ and targets $y$, linear regression finds a weight vector $w$ such that $Xw$ is as close as possible to $y$. Geometrically, $Xw$ is constrained to live in the column space of $X$ (the linear combinations of $X$'s columns), and "as close as possible" means orthogonal projection of $y$ onto that subspace. Module 3.
PCA as finding the best basis. PCA rotates your data into a new coordinate system where the first axis captures the most variance, the second captures the next most, and so on. Then it keeps the top few axes and projects the data onto them. It's literally a rotation + projection. Lesson 4 builds this from eigenvalues.
Neural-network layers as transformations. A single layer of a neural net is $h = \sigma(Wx + b)$: a linear transformation $W$, a translation $b$, and a nonlinearity $\sigma$ (like ReLU). Stack many of these and you get a function that can warp input space into a shape where classes are linearly separable. Training a neural net is just adjusting the $W$s and $b$s to produce a useful warp. Module 5.
Embeddings as learned projections. Word2vec, sentence embeddings, image encoders — all of them are transformations from a messy, high-cardinality input space (words, pixels) to a low-dimensional space where similarity is just a dot product or Euclidean distance. The embedding is the transformation's output.
Attention as weighted averaging in a projected space. Inside a transformer, every token is first projected through three learned matrices ($W_Q$, $W_K$, $W_V$) into query/key/value vectors. The attention scores are dot products between queries and keys — but because those vectors live in the projected space, the model effectively chooses which features to attend to by learning useful projections. Module 6.

Each bullet is an application of one of the three ideas in this lesson: columns are where the basis lands, composition is multiplication, projection is information loss.

PCA decomposes into two operations from this lesson. Which pair?

In a layer $Wx + b$, why is the bias $b$ a separate term rather than folded into $W$?

9. Self-check

Each question is checkable and wrong answers reveal the full reasoning. Work through all of them without scrolling back up first.

If $A = \begin{bmatrix}3 & 0 \\ 0 & 3\end{bmatrix}$ and $v = \begin{bmatrix}5 \\ -2\end{bmatrix}$, where does $v$ land?

The first column of $A$ is $\begin{bmatrix}0 \\ 1\end{bmatrix}$ and the second is $\begin{bmatrix}-1 \\ 0\end{bmatrix}$. What transformation does $A$ represent?

Compute $\det \begin{bmatrix}2 & 3 \\ 2 & 0\end{bmatrix}$.

In the chain $ABCv$, which matrix is applied first and which is applied last?

A square matrix has $\det = 0$. Pick the statement that describes all three of: columns, inverse, and information.

You need to compute $A^{-1} b$ for a nonsingular square matrix $A$. Which approach should you prefer in real code?

Fill in the sentence: "PCA is, at heart, a ______ followed by a ______."

If any were fuzzy, jump back to the relevant section — or open the matrix visualizer and build each transformation by hand.

Glossary — all terms from this lesson

Matrix: A 2D grid of numbers. Here, read as a function that transforms vectors.
Vector: An ordered list of numbers. The input and output of a matrix-as-function.
Linear transformation: A function that keeps the origin fixed and preserves straight, evenly spaced grid lines. Every matrix is one.
Origin: The point where all coordinates are zero. Fixed under any linear transformation.
Affine: A linear transformation plus a translation: $x \mapsto Wx + b$. The $+b$ is the affine piece.
Basis vector: The unit vectors $\hat{i} = (1,0)$ and $\hat{j} = (0,1)$ in 2D. Every vector is a linear combination of them.
Linear combination: A weighted sum of vectors: $c_1 v_1 + c_2 v_2 + \ldots$. How matrices act on inputs.
Identity matrix: Diagonal of 1s, everything else 0. Does nothing: $Iv = v$. Written np.eye(n).
Diagonal matrix: A matrix with nonzeros only on the main diagonal. Represents per-axis scaling.
Shear: A transformation that slides points parallel to an axis proportional to their distance from it.
Projection matrix: A matrix that collapses vectors onto a lower-dimensional subspace. Not invertible.
Orientation: Whether coordinate axes are right-handed (positive det) or flipped (negative det).
Bias vector: The constant vector $b$ added after a matrix multiply. Provides the translation that pure matrices can't.
Homogeneous coordinates: A trick that represents translations using a bigger matrix by adding an extra dimension. Standard in computer graphics.
Matmul: Matrix multiplication. A @ B in Python. Geometrically: compose transformations.
Determinant: A scalar attached to a square matrix that reports the signed area/volume scaling. Zero means the transformation collapses space.
Singular matrix: A matrix with determinant zero. Not invertible. Destroys information.
Rank-deficient: Synonymous with singular for square matrices: the columns don't span the full output space.
Collinearity: When features in your data are linear combinations of each other, making $X^T X$ singular and breaking OLS.
Inverse matrix: The matrix $A^{-1}$ that undoes $A$: $A^{-1} A = I$. Only exists if $\det A \neq 0$.
OLS (ordinary least squares): The classical linear regression fit: $w = (X^T X)^{-1} X^T y$. We'll implement this in Module 3.
Linear regression: The simplest ML model: predict $y \approx Xw$ for a learned weight vector $w$. Module 3.
Column space: The set of all vectors you can produce as $Xw$ for some $w$. Linear combinations of $X$'s columns.
SVD: Singular Value Decomposition. Factors any matrix into rotation × scale × rotation. Lesson 4.
Eigenvalues: Numbers describing how a matrix stretches along certain directions. Core to PCA. Lesson 4.
PCA: Principal Component Analysis. Rotates data into a new basis where the top few axes capture most of the variance; then projects.
ReLU: The nonlinearity $\max(0, x)$. Inserted between matrix multiplies in neural nets to enable non-linear behavior. Module 5.
Transformer: Architecture behind GPT, BERT, Claude. Built around learned projections and dot-product attention. Module 6.

What's next

Continue to Lesson 3 — Derivatives & Gradients, where we shift from "what does this matrix do to space?" to "how does the output change when I nudge the input?" That's the math ML optimizers run on.