A matrix is a function that reshapes space. Viewed this way, matrix multiplication has a geometric picture: chaining transformations. PCA, embeddings, neural-net layers, and "projecting into latent space" are all instances of the same operation.
Lesson 1 gave you the computational view: a matrix is an (m, n) grid, A @ v is a stack of dot products, shapes must line up. This lesson builds the geometric view: a matrix is a function that warps space. Both views matter — computational for writing code, geometric for interpreting what the code does. When an ML paper says "we project $x$ into a lower-dimensional space," the statement only makes sense with the geometric view.
Pick any matrix $A$ of shape (m, n). Multiply it by a vector $v$ of length $n$. You get back a vector of length $m$. That's a function — same interface as any Python function you've ever written: input goes in, output comes out, same shape contract every time.
Read that notation as "$A$ is a function that takes an $n$-dimensional input and returns an $m$-dimensional output." The symbol $\mathbb{R}^n$ just means "the set of all real-valued vectors with $n$ components." If you prefer code:
def A(v):
# A is a stand-in for a fixed (m, n) matrix.
# v comes in with shape (n,). result comes out with shape (m,).
return A_matrix @ v
(5, 3) and a vector $v$ of shape (3,). What is the shape of A @ v?Every matrix is a function like this. But matrices aren't arbitrary functions — they're a very disciplined subset called linear transformations. Two rules define them:
Formally, the second rule is two equations:
In words: applying $A$ to a sum is the same as summing the applications, and applying $A$ to a scaled vector scales the result by the same factor. That property is called linearity.
The word "linear" here is stricter than the everyday meaning. A function like $f(x) = 3x + 7$ is usually called "linear" in an Excel/stats sense, but a true linear transformation has no constant offset — $f(0)$ must equal $0$. The $+7$ is called an affine piece, and it gets handled separately (in ML, this is the "bias" term). For now, pure linear = no shifts, only rotation/scale/shear/projection.
Column $j$ of $A$ is where the $j$-th basis vector lands after the transformation. In 2D, column 1 is the new location of $\hat{i} = \begin{bmatrix}1\\0\end{bmatrix}$, and column 2 is the new location of $\hat{j} = \begin{bmatrix}0\\1\end{bmatrix}$. Knowing where $\hat{i}$ and $\hat{j}$ land is enough to know where every vector lands.
The derivation follows directly from the two linearity rules. Start with an arbitrary 2D vector, written as a linear combination of $\hat{i}$ and $\hat{j}$:
This is just restating that the components of a vector are the coefficients in front of the basis vectors. Applying the two linearity rules from above:
The first step uses the first linearity rule (distribute over addition). The second uses the second (pull the scalars out). The result $Av$ is determined by two things: the scalars $x$ and $y$ (which are $v$'s components), and the vectors $A\hat{i}$ and $A\hat{j}$ (which are where the basis vectors land).
Compute $A\hat{i}$ and $A\hat{j}$ directly with the matrix-vector rule from Lesson 1. For a 2×2 matrix $A = \begin{bmatrix}a & b \\ c & d\end{bmatrix}$:
$A\hat{i}$ is the first column of $A$. $A\hat{j}$ is the second column of $A$. This is what matrix-vector multiplication does: right-multiplying a matrix by $\hat{i}$ returns column 0; right-multiplying by $\hat{j}$ returns column 1.
Take the matrix
The columns tell us: $\hat{i} = \begin{bmatrix}1\\0\end{bmatrix}$ lands at $\begin{bmatrix}2\\0\end{bmatrix}$, and $\hat{j} = \begin{bmatrix}0\\1\end{bmatrix}$ lands at $\begin{bmatrix}-1\\3\end{bmatrix}$. That's it — those two pieces of information fully determine what $A$ does to any input vector.
Let's verify by transforming $v = \begin{bmatrix}3\\2\end{bmatrix}$ two different ways and checking they agree.
Way 1 — the Lesson 1 computational view (stack of dot products):
Way 2 — the new geometric view (linear combination of columns):
Same answer. The second view says: "$v$'s first component (3) tells me how far to walk along the new $\hat{i}$; $v$'s second component (2) tells me how far to walk along the new $\hat{j}$; the final location is the sum." Put another way, the transformed basis vectors become the new coordinate system, and the input vector's components are the coordinates in that system.
Same [4, 6] the hand calculation gave. The two views agree because they are the same arithmetic re-described — one walks across rows, the other along columns.
If you're coming at this fresh, the dot-product view is more natural because it matches how you'd write the code. The column-as-basis-image view feels abstract. Push through the discomfort — the second view is the one you'll need for PCA, eigenvectors, and reading any modern ML paper. When the paper says "project into a latent space spanned by the learned basis," they're using the second view. Memorize it.
Drag the sliders to set where $\hat{i}$ (orange) and $\hat{j}$ (blue) land; the grid and a silhouette warp accordingly. The matrix readout updates live, connecting the columns to the visible transformation.
Three things to verify as you drag:
The full-featured visualizer embedded below lets you drive the matrix entries directly, watch the determinant react in real time, and apply a sample-data silhouette through the transformation. Open it, try each preset, and verify the claim that "columns are where the basis lands."
Almost every 2D transformation you'll meet in ML is one of — or a composition of — seven building blocks. Each has a geometric meaning, a matrix form, and a real ML application. Learn to recognize them on sight.
The identity matrix has 1s on the diagonal and 0s elsewhere. $Iv = v$ for every $v$ — columns say "$\hat{i}$ goes to $\hat{i}$, $\hat{j}$ goes to $\hat{j}$." In NumPy: np.eye(n). You'll use it as a sanity check ("does my function correctly return the identity for the zero-parameter case?") and as the answer to "what matrix times $A$ gives $A$?"
This is a diagonal matrix — zeros off the diagonal. $\hat{i}$ lands at $(s_x, 0)$ (stretched by $s_x$ along $x$), $\hat{j}$ lands at $(0, s_y)$ (stretched by $s_y$ along $y$). When $s_x = s_y$, it's uniform scaling and the silhouette gets bigger or smaller without distortion. When they differ, you get stretching — picture a circle becoming an ellipse.
ML encounter: feature scaling. When you call StandardScaler in scikit-learn, it's effectively multiplying your features by a diagonal matrix of 1 / std_dev values. The purpose is to make features comparable so optimizers don't get distracted by the feature with the biggest raw magnitude.
Rotates everything counterclockwise by angle $\theta$. Check the logic: $\hat{i}$ (which points right) should rotate to $(\cos\theta, \sin\theta)$ — that's the first column. $\hat{j}$ (which points up) should rotate to $(-\sin\theta, \cos\theta)$ — that's the second column. The minus sign is because rotating "up" counterclockwise moves you to the left (negative $x$) and up (positive $y$).
At $\theta = 90°$: $\cos(90°)=0$, $\sin(90°)=1$, giving $R = \begin{bmatrix}0 & -1 \\ 1 & 0\end{bmatrix}$. At $\theta = 45°$: $\cos(45°) = \sin(45°) \approx 0.707$ — the values in the "Rotate 45°" preset.
ML encounter: data augmentation for computer vision (rotating training images), coordinate frames in robotics, and the rotation matrices inside SVD decompositions you'll meet in Lesson 4.
$F_x$ flips along the $y$-axis: $\hat{i}$ goes to $-\hat{i}$, $\hat{j}$ stays put. The silhouette mirrors left-to-right. Reflections always have $\det = -1$ — they invert orientation (clockwise becomes counterclockwise).
ML encounter: horizontal flip as a standard image-augmentation trick.
A shear slides points horizontally by an amount proportional to their height. $\hat{i}$ stays at $(1, 0)$; $\hat{j}$ shifts over to $(k, 1)$. Squares become parallelograms. Area is preserved — $\det H = 1$ — because nothing is stretched, only skewed.
ML encounter: less direct than scaling/rotation, but shears show up in the building blocks of SVD and in certain geometric data-augmentation pipelines.
A projection matrix collapses an entire dimension. $P_x$ kills the $y$-component: every point $(x, y)$ lands at $(x, 0)$, on the $x$-axis. The 2D plane becomes a 1D line. This is irreversible — once the $y$-information is gone, you can't get it back. The determinant is 0 (the transformation is singular; its matrix is rank-deficient).
Projection is what PCA does. PCA takes a high-dimensional dataset, picks the directions along which the data varies the most, and projects onto those — throwing away low-variance directions to get a compact representation. Embeddings (turning a word, image, or user into a 300-dim vector) are also projections from a much bigger implicit space. Anytime a model "compresses" something, a projection is lurking.
Shifting every point by a fixed vector — $v \mapsto v + b$ — is not a linear transformation (it moves the origin), and so it can't be represented by a 2×2 matrix alone. This is why every real neural-net layer is $Wx + b$: a matrix multiply plus a bias vector. The $b$ is the translation, bolted on.
(There's a standard trick — homogeneous coordinates — that lets you stuff the translation into a bigger matrix. Computer graphics lives on this trick. In ML we usually just keep the bias as a separate vector.)
The rule-based matrix multiplication from Lesson 1 has a direct geometric meaning:
If you first apply transformation $B$ to a vector, then apply $A$ to the result, the overall effect is a single transformation whose matrix is $AB$. That is: $A(Bv) = (AB)v$. The product $AB$ is defined to make this work. Matrix multiplication isn't an arbitrary rule — it's reverse-engineered from "I want chaining transformations to correspond to multiplying matrices."
Work through this carefully; the order convention is a common source of confusion. Pick two transformations:
What happens to $v = \begin{bmatrix}1\\0\end{bmatrix}$ if we apply $B$ first, then $A$?
Now compute $AB$ directly and apply it to $v$ in one shot:
Same answer. Matrix multiplication is defined precisely so that $(AB)v$ matches $A(Bv)$.
In the expression $ABv$, the transformation applied first is the rightmost one — $B$ — because it's adjacent to the vector. Then $A$ acts on the result. Read innermost-parentheses first: $A(Bv)$. Reading the ordering left-to-right produces the wrong answer, because rotate-then-scale and scale-then-rotate are different transformations.
When you see a neural-net forward pass written as $h = W_3 \sigma(W_2 \sigma(W_1 x))$, read from the inside out: $x$ first, then $W_1$, then nonlinearity, then $W_2$, and so on. The math notation reverses the data-flow order, which is annoying — but it's the convention, and fighting it wastes time.
Matrix multiplication is non-commutative: $AB$ and $BA$ are generally different matrices. Geometrically this is obvious — "rotate then scale along $x$" is a different transformation than "scale along $x$ then rotate." You can even verify it with the interactive above:
Different orders, different silhouettes. If matrix multiplication were commutative, chaining transformations would be order-independent — and the physical world would be much stranger.
The determinant is a single number attached to a square matrix that tells you what the transformation does to area (in 2D) or volume (in 3D and higher). For a 2×2 matrix:
Geometrically, the determinant is the area scaling factor. Consider the unit square at the origin with corners $(0,0), (1,0), (0,1), (1,1)$ — area 1. After applying $A$, that square becomes a parallelogram whose corners are the images of those four points, and its area is $|\det A|$. So:
Walk through the math for the scale-by-$(2, 3)$ example:
Unit square of area 1 becomes a 2×3 rectangle of area 6. Checks out. And the singular example:
Why is this zero? Look at the columns: $\begin{bmatrix}1\\2\end{bmatrix}$ and $\begin{bmatrix}2\\4\end{bmatrix}$. The second column is just 2× the first. $\hat{i}$ and $\hat{j}$ both land on the same line — the line $y = 2x$. The whole 2D plane collapses onto that line. Information is lost.
import numpy as np
A = np.array([[2, 0], [0, 3]])
np.linalg.det(A) # 6.0 — area scaled 6x, orientation preserved
B = np.array([[1, 2], [2, 4]])
np.linalg.det(B) # 0.0 — collapses to a line; columns are linearly dependent
R = np.array([[0, -1], [1, 0]]) # 90° rotation
np.linalg.det(R) # 1.0 — area preserved, no reflection
F = np.array([[-1, 0], [0, 1]]) # reflect across y-axis
np.linalg.det(F) # -1.0 — orientation flipped, area preserved
When a matrix has $\det = 0$ — or, in floating-point reality, a determinant that's uncomfortably close to zero — it's singular and cannot be inverted. In ML, this shows up as collinearity: two or more features in your design matrix are linearly dependent on each other. Classic example: you accidentally include age_in_years and age_in_months, or total_revenue and avg_revenue_per_day along with days_active. The columns become linearly dependent, $X^T X$ becomes singular, and the ordinary least squares closed-form solution explodes.
scikit-learn will either warn you, silently return garbage coefficients, or (for LinearRegression) use a pseudoinverse that gives some answer without telling you the feature set was degenerate. Always check for near-collinearity before you trust a fit. numpy.linalg.cond(X) or pandas.DataFrame.corr() are your first lines of defense.
If $A$ takes $v$ to $Av$, the inverse $A^{-1}$ takes $Av$ back to $v$. By definition:
An inverse only exists when $\det A \neq 0$. If the determinant is zero, the transformation destroyed information, and no amount of cleverness can recover it — there is no inverse. Geometrically, you can't "un-project" from a line back to a plane because you don't know which of the infinitely many pre-image points in the plane you came from.
For a 2×2 matrix, there's a closed-form inverse:
Swap the diagonal entries, negate the off-diagonal entries, divide by the determinant. Useful to know exists; you'll rarely compute it by hand for anything bigger than 2×2.
A = np.array([[2, 1], [1, 3]])
A_inv = np.linalg.inv(A) # compute the inverse
A @ A_inv # ≈ identity (up to floating-point noise)
# array([[1.00000000e+00, 0.00000000e+00],
# [2.22044605e-16, 1.00000000e+00]])
The math textbook formula for OLS is $w = (X^T X)^{-1} X^T y$. In real code, you never write it that way. To solve the system $A w = b$, you use np.linalg.solve(A, b) — it uses LU decomposition under the hood, which is faster and much more numerically stable than forming the inverse explicitly. For least-squares specifically, np.linalg.lstsq(X, y) handles the rank-deficient case gracefully.
Rule of thumb: if you ever find yourself typing np.linalg.inv in production code, stop and ask "am I sure I don't want solve or lstsq here?" Ninety-nine times out of a hundred, the answer is you do.
Direct applications you'll meet later, each in one paragraph:
Each bullet is an application of one of the three ideas in this lesson: columns are where the basis lands, composition is multiplication, projection is information loss.
Each question is checkable and wrong answers reveal the full reasoning. Work through all of them without scrolling back up first.
If any were fuzzy, jump back to the relevant section — or open the matrix visualizer and build each transformation by hand.
np.eye(n).A @ B in Python. Geometrically: compose transformations.Continue to Lesson 3 — Derivatives & Gradients, where we shift from "what does this matrix do to space?" to "how does the output change when I nudge the input?" That's the math ML optimizers run on.