Module 1 · Lesson 1 · ~50 min read + play · Math Foundations

Vectors & Matrices: The Language of ML

Every ML model you'll ever use — linear regression, random forests, GPT-4 — is ultimately doing arithmetic on vectors and matrices. Learn to read that language fluently and the rest gets dramatically easier.

Bridge from what you know

You already work with tables in SQL and dbt. A matrix is a table; a vector is a single row or column. The "math" part consists of a handful of named operations with consistent rules. By the end of this lesson you'll see your data the way a model sees it.

1. What is a vector?

A vector is an ordered list of numbers. Vectors show up everywhere in ML because almost anything we want to reason about — a user, a document, a pixel, a word — can be described by a fixed list of measurements, and once you have a list of numbers, you can do math on it.

Why vectors

An ML model takes some input and produces some output. The input could be a user row, a photograph, an email, or a sentence — wildly different things. The model can't deal with "a user" in the abstract. It needs numbers — specifically, fixed-length, fixed-order lists of numbers. That's a vector. Converting real-world objects into vectors — and choosing which numbers to include — is half the job of an ML engineer. The other half is what you do to those vectors once you have them.

Which of these objects is ready to hand to a model as an input vector?

Can a single trained ML model accept user records of varying feature counts (e.g., 5 features for some, 7 for others)?

You want to feed an image into a model. Which of these produces a vector the model can consume?

A concrete example

Take a single user in your database. Pick three things you know about them:

age            = 34
lifetime_spend = 487.50
days_active    = 120

Those are three Python variables. Pack them into a list in a fixed order — and agree the order never changes — and you have a vector:

u = \begin{bmatrix} 34 \\ 487.50 \\ 120 \end{bmatrix}

The vertical bracket notation writes "three numbers, stacked into a column." Some papers write it horizontally as $u = (34, 487.50, 120)$; they mean the same thing. In code:

import numpy as np
u = np.array([34, 487.50, 120])
u.shape   # (3,)   ← "a 1-dimensional array of length 3"

That trailing comma in (3,) trips people up. It's NumPy's way of saying this is a Python tuple with one entry — the entry being 3. NumPy always reports array shapes as tuples because shapes can have multiple numbers (you'll see (1000, 7) for a matrix soon). A one-dimensional array still gets a tuple, just with one entry. The comma is what tells Python "this is a tuple, not a parenthesized integer."

If v = np.array([10, 20, 30, 40, 50]), what does v.shape report (just the length, or the full tuple)?

Why does NumPy show the shape of a 1D array as (3,) with a trailing comma?

Order matters — a lot

If I swap positions to [120, 34, 487.50], I've made a different vector. In ML terms, I've changed which feature lives in which dimension. A model that was trained expecting "age" in position 0 will happily take "days_active" there and quietly produce garbage predictions. No error. No warning. Just wrong answers.

Are [3, 1] and [1, 3] the same vector?

Where this bites you in production

This is the same class of bug as a silent column reorder in a Parquet file that a downstream dbt model reads — the pipeline still runs, but the numbers mean different things. In ML, the standard defense is to save the feature order (a list of column names) alongside the trained model, and assert the order matches at inference time. You'll write your first one of these in Module 3. A vector is only meaningful together with the fixed order of its features.

Each number inside a vector is a feature. The whole vector — all the features for one thing — is a feature vector, equivalently a sample, or a data point. All three terms appear in docs and papers interchangeably.

A model was trained with features in order [age, spend, days]. At inference, someone passes [days, age, spend] by mistake. What happens?

Two ways to picture a vector

There are two mental models for a vector. They describe the same object and emphasize different things. Code wants the first; geometric intuition lives in the second.

1. As a list. Numbers in a specific order. This is the computational picture — what's in memory, what NumPy stores, what you pass between functions. Most of the time this is the only view needed to write correct code.

2. As an arrow. In 2D, the vector [3, 1] is an arrow starting at the origin — the point (0, 0) — and ending at (3, 1). In 3D, same idea with three coordinates. Beyond 3D we can't draw the arrow, but we still talk about it: direction, length, angle. The math works identically; 2D pictures are a tool for building intuition.

The arrow view matters because operations on vectors have geometric meaning, and that meaning lets you reason about what a model is actually doing. Adding two vectors chains arrows tip-to-tail. Multiplying by 2 doubles an arrow's length. Multiplying by -1 flips it to point the other way. When the geometry looks simple, the algebra is simple too.

You want to explain to a colleague why the dot product of two perpendicular vectors is zero. Which view helps you most?

Higher dimensions, same rules

A word embedding in modern NLP is a vector with 300, 1024, or 4096 numbers. You cannot draw it. But every operation in this lesson — addition, dot product, norm, projection — extends to any number of dimensions with no changes to the formulas. The same rules work whether the vector has 2 components or 4096. If a concept only worked in 2D, it wouldn't be useful for ML; none of the concepts here are like that.

The vector [-2, 3] as an arrow from the origin ends at which point?

Can you draw a 4096-dimensional word embedding as an arrow?

Play with it

Drag the orange dot to move the tip of the vector. Three things to try:

The two numbers in the readout update in real time. They are the vector — list and arrow are the same thing.
Drag the tip below the x-axis or left of the y-axis. Vectors can have negative components; they point the other way.
Drag the tip exactly onto the origin. The vector [0, 0] is an arrow of zero length — the zero vector. It has no direction, and it's still a valid vector.

🖱️ Drag to explore

v = [3, 2]

Before dragging, predict: where does the vector [-3, 4] point?

Drag the tip to the origin. What direction does the resulting vector [0, 0] point?

Self-check before moving on

What is the vector [0, 0]?

Most code treats vectors as lists. Why bother with the arrow view at all?

2. Vector operations (and what they mean)

This section introduces four operations — addition, scalar multiplication, dot product, and norm — and explains what each one means, not just how to compute it. All four show up in every ML model, usually under the hood.

Addition: combining effects

\begin{bmatrix} 3 \\ 1 \end{bmatrix} + \begin{bmatrix} 1 \\ 2 \end{bmatrix} = \begin{bmatrix} 4 \\ 3 \end{bmatrix}

Add vectors element-wise: position 0 to position 0, position 1 to position 1, and so on. That's the computational rule.

Geometrically, place the second arrow's tail at the first arrow's tip. The sum is the arrow from the original origin to where you end up. If the first vector takes you "3 east and 1 north" and the second takes you "1 east and 2 north," the combination takes you "4 east and 3 north." Addition chains directions.

Compute [5, 0] + [-3, 3].

One reason to add feature vectors: in learned word embeddings, vector arithmetic roughly captures meaning.

king_vec   - man_vec   + woman_vec   ≈   queen_vec

This doesn't hold perfectly — the pop-science version oversimplifies — but "+ woman − man" literally means "shift the meaning in the same direction that takes a male word to a female word." Addition of vectors is composition of whatever those vectors represent.

Start at the origin. Walk [3, 2], then from that tip walk [-2, -3]. Where do you end up?

a = np.array([3, 1])
b = np.array([1, 2])
a + b  # array([4, 3])

Scalar multiplication: stretching

2 \cdot \begin{bmatrix} 3 \\ 1 \end{bmatrix} = \begin{bmatrix} 6 \\ 2 \end{bmatrix}

A scalar — just a single number, not a vector or matrix — multiplied into a vector multiplies every component. Geometrically: it stretches the arrow. 2·v doubles the length. 0.5·v halves it. -1·v flips it to point the other way. 0·v collapses it to the origin.

Compute -2 · [3, -2].

Scalar multiplication is everywhere in ML. Re-weighting features (multiply a feature by 100 to convert meters to centimeters). Optimizer updates during training — the gradient descent rule is w = w - learning_rate * gradient, and the learning rate is a scalar that shrinks the update step. Every time you scale a vector, you're doing this.

If you multiply a vector by -0.5, what happens to its length?

Dot product: measuring alignment

The dot product takes two vectors and returns a single number. It is the most important operation in this lesson, and arguably in the whole module.

The formula: multiply each corresponding pair of components, then sum the products.

\begin{bmatrix} 3 \\ 1 \end{bmatrix} \cdot \begin{bmatrix} 2 \\ 4 \end{bmatrix} = (3)(2) + (1)(4) = 6 + 4 = 10

Step through the calculation for a = [3, 1] and b = [2, 4]:

Multiply position 0 of a by position 0 of b: 3 × 2 = 6
Multiply position 1 of a by position 1 of b: 1 × 4 = 4
Add the products: 6 + 4 = 10
Return the single number: 10

Compute [5, 2] · [1, 3] in your head.

In NumPy, there are two equivalent spellings — np.dot(a, b) and the @ operator. The @ operator is matrix multiplication; for 1D vectors it equals the dot product. Run this and confirm the answer matches the hand calculation above:

That's the computation. What it measures: the dot product is a number that tells you how much two vectors point in the same direction.

Large positive dot product → the arrows point in similar directions.
Zero dot product → the arrows are perpendicular. They share no directional overlap. (In ML-speak: "orthogonal.")
Negative dot product → the arrows point in roughly opposite directions. Anti-aligned.

Without computing exactly — is [4, 1] · [-2, 3] positive, zero, or negative?

Why does the multiply-and-sum formula measure alignment? When two vectors point the same way, their components share signs, so every pairwise product is positive and they sum to a large positive. When they point opposite, signs disagree, products are negative, and the sum is negative. When perpendicular, the positive and negative products cancel. Run the three cases and compare:

The perpendicular case is the one worth pausing on — it's how ML decides two features (or two embedding directions) carry independent information. A zero dot product means neither vector contributes anything to the other's direction.

Why ML cares

The dot product is the single most common operation in ML. A linear regression's prediction for one row is a dot product of that row's features with the learned weights. An attention score inside a transformer (the architecture behind GPT, BERT, and Claude) is a dot product between a "query" vector and a "key" vector. Semantic search uses cosine similarity — a normalized dot product — to find similar documents. Understanding dot products gives you the arithmetic of most models.

Norm: measuring length

\|v\| = \sqrt{v_1^2 + v_2^2 + \ldots + v_n^2}

The norm of a vector — usually written ||v|| — is its length. The formula above is specifically the L2 norm (also called "Euclidean norm" because it's the one from ordinary Euclidean geometry). It's literally the Pythagorean theorem generalized to any number of dimensions. For [3, 4], the norm is √(9 + 16) = √25 = 5 — same as the hypotenuse of a 3-4-5 right triangle.

What is ||[5, 12]||?

Three reasons vector length matters in ML:

Distance between points. The distance between two data points a and b is ||a - b||. This is the foundation of k-nearest neighbors, clustering (k-means), and anomaly detection.
Normalizing vectors. Divide a vector by its norm and you get a unit vector — same direction, length 1. We call this normalizing the vector. Useful when only direction matters, not magnitude. It's the basis of cosine similarity.
Regularization. To prevent overfitting, many models add a penalty on the norm of the weight vector — effectively telling the model "keep the weights small." This is the entire idea behind L2 regularization (aka ridge regression).

v = np.array([3, 4])
np.linalg.norm(v)           # 5.0

# Normalize to unit length
v / np.linalg.norm(v)       # array([0.6, 0.8]) — same direction, length 1

# Distance between two data points
a = np.array([1, 1])
b = np.array([4, 5])
np.linalg.norm(a - b)       # 5.0 — a is 5 units away from b

Given a = [1, 2] and b = [4, 6], what is the L2 distance ||a - b||?

Which of these uses the L2 norm?

Other norms exist

L2 is the default, but there's also the L1 norm (sum of absolute values: |v₁| + |v₂| + …) and the L∞ norm (the largest absolute value among components). Each captures a different notion of "length." L1 regularization (aka Lasso) uses the L1 norm and has a useful property — it drives some weights to exactly zero, giving you automatic feature selection. You'll meet L1 and L2 regularization in Module 3; for now, just know the family exists and they're all called "norms."

3. What is a matrix?

A matrix is a 2D grid of numbers — rows and columns, like a spreadsheet or a SQL result set. The conventions and operations attached to it are what make it useful.

A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}

This is a 2×3 matrix: 2 rows, 3 columns. "m by n" (written m × n) always means m rows by n columns, in that order. The convention never varies.

A = np.array([[1, 2, 3],
              [4, 5, 6]])
A.shape   # (2, 3) — 2 rows, 3 columns

What is the shape of np.array([[1, 2], [3, 4], [5, 6], [7, 8]])?

A paper describes a "5 × 2 matrix." How many rows and how many columns?

Your dataset is a matrix

When you run SELECT age, spend, days FROM users LIMIT 1000, you get back a 1000-row, 3-column result. That's a matrix. In ML code the feature matrix is almost always named X, with shape (n_samples, n_features).

Each row is one sample — one user, one document, one image. It's a feature vector (the thing from Section 1).
Each column is one feature across all samples — for example, the "age" values for every user in the batch.

Memorize this shape convention

Every function in scikit-learn, every layer in PyTorch, every utility in pandas expects feature data in shape (n_samples, n_features). Rows = samples, columns = features. This is the single most important convention in the entire ML ecosystem.

When you see X.shape == (1000, 7), read it immediately as "1000 data points, each described by 7 features." If you accidentally transpose your data and pass (7, 1000), .fit() will happily treat each feature as a sample and each sample as a feature — silent garbage. This is a top-5 ML bug; sklearn will not save you from it.

Other names for this exact object: data matrix, feature matrix, design matrix (the stats-paper term), or just X. They all refer to the same (n_samples, n_features) table.

You see X.shape == (10000, 50). What does this dataset contain?

You meant to pass X with shape (1000, 7) but accidentally passed X.T (shape (7, 1000)) to model.fit(). What happens?

Transpose: swapping rows and columns

The transpose of a matrix flips rows and columns — what was row 1 becomes column 1, what was row 2 becomes column 2, and so on. Written $A^T$ in math, A.T in NumPy.

A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix}, \quad A^T = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \end{bmatrix}

Element by element: the entry at row 1, column 1 of A (which is 1) stays at row 1, column 1 of Aᵀ. The entry at row 1, column 2 of A (which is 2) moves to row 2, column 1 of Aᵀ. In general, A[i][j] becomes Aᵀ[j][i] — indices swap. Shape-wise, a (3, 2) matrix transposes to a (2, 3) matrix. A (1000, 7) matrix transposes to (7, 1000).

A.T       # transpose — rows become columns; shape swaps
A.T.T     # transpose of a transpose is back to the original A

If X.shape == (1000, 7), what is X.T.shape?

In A = [[1, 2, 3], [4, 5, 6]], what value ends up at A.T[2][1]?

Transpose exists because many matrix operations have strict shape requirements (the next section), and transposing is the standard tool for reshaping things so operations line up. The ordinary least squares formula is w = (XᵀX)⁻¹ Xᵀ y — full of transposes because they're the glue that makes the shapes work out. Understanding that formula is a later-module concern; for now, note transpose's role as shape glue.

Matrix multiplication

Matrix multiplication is where most people bounce off the math. The rule is mechanical but needs to be seen in slow motion the first time.

For A @ B where A is m × n and B is n × p, the result is an m × p matrix. The inner dimensions — A's columns and B's rows — must match. The outer dimensions become the shape of the result.

A.shape = (3, 4) and B.shape = (4, 5). What is (A @ B).shape?

A.shape = (3, 5) and B.shape = (4, 2). Does A @ B work?

A visual for shape-matching:

    A         @        B        =     result
 (m × n)           (n × p)            (m × p)
       └─────────────┘
        must match
       (inner dims)

Each entry in the result is itself a dot product. Specifically, entry at row i, column j of the result is the dot product of row i of A with column j of B:

(AB)_{ij} = \sum_{k} A_{ik} \cdot B_{kj}

The formula says: for each pair (row of A, column of B), compute a dot product and place the result in that position. A worked example, step by step:

A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad B = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix}

Both are 2×2. The inner dimensions match (both are 2). The result will be 2×2. Compute it entry by entry:

Result[0][0] = row 0 of A · column 0 of B = [1, 2] · [5, 7] = 1·5 + 2·7 = 5 + 14 = 19
Result[0][1] = row 0 of A · column 1 of B = [1, 2] · [6, 8] = 1·6 + 2·8 = 6 + 16 = 22
Result[1][0] = row 1 of A · column 0 of B = [3, 4] · [5, 7] = 3·5 + 4·7 = 15 + 28 = 43
Result[1][1] = row 1 of A · column 1 of B = [3, 4] · [6, 8] = 3·6 + 4·8 = 18 + 32 = 50

AB = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}

With A = [[1, 2], [3, 4]] and B = [[5, 3], [7, 14]], what is (A @ B)[0][1]?

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

A @ B        # matrix multiplication — this is the one you want
# array([[19, 22],
#        [43, 50]])

A * B        # element-wise multiplication — DIFFERENT operation, be careful
# array([[ 5, 12],
#        [21, 32]])

You want the standard "row times column" product between two matrices. Which operator?

The #1 NumPy pitfall

In NumPy, * is element-wise and @ is matrix multiplication. They produce different shapes and different answers. In R, MATLAB, and most math textbooks, * means matmul — so people coming from those worlds write A * B in Python and get silent garbage.

Every time you write a multiplication, ask: "element-wise or matmul?" If element-wise, A * B (shapes must match or broadcast). If matmul, A @ B (inner dims must match). Running that check mentally for the first month of ML work prevents hours of debugging.

A @ B is not the same as B @ A, even when both are defined. Sometimes only one ordering is defined at all — if A is (3, 4) and B is (4, 5), then A @ B works (result (3, 5)) but B @ A errors (inner dims 5 and 3 don't match). Matrix multiplication is non-commutative; order always matters.

4. Matrix-vector multiplication: where ML lives

A special case of matrix multiplication deserves its own section: a matrix times a vector. If A is m × n and v has length n (shape (n,)), then A @ v is a vector of length m (shape (m,)). This operation is the core of almost every model you'll train, and there are two equivalent angles on it.

A.shape = (3, 5) and v.shape = (5,). What is the shape of A @ v?

\underbrace{\begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix}}_{A \text{ (2×2)}} \underbrace{\begin{bmatrix} 1 \\ 4 \end{bmatrix}}_{v \text{ (length 2)}} = \begin{bmatrix} (2)(1) + (0)(4) \\ (0)(1) + (3)(4) \end{bmatrix} = \begin{bmatrix} 2 \\ 12 \end{bmatrix}

A.shape = (5, 4) and v.shape = (3,). Does A @ v work?

Two equivalent views of this operation. Same answer; different perspectives.

View 1: as a stack of dot products

The i-th output is the dot product of row i of A with v. If A has m rows, that's m dot products, stacked into a length-m vector.

Row 0 of A is [2, 0]. Dot with [1, 4]: 2·1 + 0·4 = 2. That's output[0].
Row 1 of A is [0, 3]. Dot with [1, 4]: 0·1 + 3·4 = 12. That's output[1].
Result: [2, 12].

This is the computational view — it's what NumPy does under the hood. It's also the view that makes "linear regression is a matrix-vector product" fall into place.

If A.shape = (5, 3) and v.shape = (3,), how many dot products does View 1 perform to produce A @ v?

Compute [[1, 2], [3, 2]] @ [3, 4] by doing each row as a dot product.

View 2: as a linear combination of A's columns

Equivalently: A @ v equals "v[0] times column 0 of A, plus v[1] times column 1 of A, plus …". For our example:

A v = 1 \cdot \begin{bmatrix} 2 \\ 0 \end{bmatrix} + 4 \cdot \begin{bmatrix} 0 \\ 3 \end{bmatrix} = \begin{bmatrix} 2 \\ 0 \end{bmatrix} + \begin{bmatrix} 0 \\ 12 \end{bmatrix} = \begin{bmatrix} 2 \\ 12 \end{bmatrix}

Same answer, different framing. The vector v says "take 1 copy of A's first column and 4 copies of A's second column, and add them up." This is a linear combination of A's columns — a weighted sum.

In View 2 of A @ v, what does the number v[k] do?

Let A = [[1, 3], [2, 4]] and v = [1, 2]. Using View 2, compute A @ v as 1·(col 0) + 2·(col 1).

The second view is what lets you understand what a matrix does geometrically: a matrix acts on an input vector by returning a specific combination of its own columns. That framing is the content of Lesson 2 (matrices as transformations) and underlies eigenvalues and PCA in Lesson 4. Both views are correct; advanced ML concepts lean heavily on view 2.

Linear regression in one line

Linear regression, in one line of code

Suppose you have a data matrix X of shape (n_samples, n_features) and a weight vector w of shape (n_features,) that your optimizer has learned. The prediction for every sample is:

y_pred = X @ w      # shape: (n_samples,) — one prediction per row

By View 1, each output entry is a dot product of one row of X (one sample's features) with w (the learned weights). That dot product is the prediction for that sample. Doing it for every row and stacking the results yields a vector of predictions — one per sample.

Every linear model — ordinary least squares, ridge, lasso, logistic regression — is exactly this operation, differing only in how w is found. Neural networks are stacks of X @ W-style operations with a nonlinearity inserted between each one. A mental picture of X @ w covers most of ML.

With X.shape == (n_samples, n_features) and w.shape == (n_features,), what shape is X @ w?

A user has features [2, 3, 10]. The learned weights are w = [0.5, 0.5, 1.0]. What's the model's prediction for this user?

5. Shapes are everything

The hardest part of deep learning is not the math — it's keeping track of tensor shapes as data flows through a model. A 3-line neural network function can have five implicit shape assumptions, and one wrong guess makes everything silently wrong. Engineers who develop shape fluency make fewer bugs; engineers who don't spend their afternoons in pdb.

Shape rules for everything in this lesson:

Operation	Shape rule	Example
`A + B` or `A * B` (element-wise)	Shapes must match, or be broadcastable	`(3,4) + (3,4) → (3,4)`
`A @ B` (matmul, 2D × 2D)	Inner dims match: `(m, n) @ (n, p) → (m, p)`	`(3,4) @ (4,5) → (3,5)`
`A @ v` (matrix × 1D vector)	`(m, n) @ (n,) → (m,)`	`(3,4) @ (4,) → (3,)`
`A.T` (transpose)	Swap the two numbers: `(m, n) → (n, m)`	`(5,3).T → (3,5)`
`v @ w` (two 1D vectors)	Lengths must match; returns a scalar	`(4,) @ (4,) → scalar`

A.shape = (100, 5) and B.shape = (5, 3). What is (A @ B).shape?

v.shape = (10,) and w.shape = (10,). What does v @ w produce?

The shape-asserting habit

Production ML code is full of assert statements checking expected shapes. They're cheap to write, they fail loudly the moment something's wrong, and they document the function's contract inline.

def predict(X, w):
    """
    Given features X with shape (n_samples, n_features)
    and weights w with shape (n_features,),
    return predictions with shape (n_samples,).
    """
    assert X.ndim == 2, f"X should be 2D, got shape {X.shape}"
    assert w.ndim == 1, f"w should be 1D, got shape {w.shape}"
    assert X.shape[1] == w.shape[0], (
        f"shape mismatch: X has {X.shape[1]} features, "
        f"w has {w.shape[0]} weights"
    )
    return X @ w

The alternative is debugging at 11pm because predictions were silently wrong-shape and downstream code took the first element of a broadcast result as "the answer." The assert pattern is a small investment for substantially fewer bugs. The NumPy ndarray docs cover every attribute — .shape, .ndim, .dtype, .T — and repay a skim.

You call predict(X, w) where X.shape = (100, 7) and w.shape = (3,). What happens inside the assert-laden version of predict?

Where do you put shape asserts in a function like predict(X, w)?

6. Broadcasting

Broadcasting is NumPy's rule for performing element-wise operations on arrays of different but compatible shapes, without a for-loop. It's how you subtract a mean vector from every row of a data matrix, add a bias vector to every sample in a batch, or multiply every column by a scaling factor. Almost every per-row or per-column transformation in NumPy/PyTorch code uses broadcasting.

Why bother?

Without broadcasting, you'd write explicit nested loops:

# The painful way
for i in range(X.shape[0]):
    for j in range(X.shape[1]):
        X[i, j] = X[i, j] - means[j]

Why does NumPy go to the trouble of providing broadcasting instead of just relying on explicit loops?

That's slow (Python loops are interpreted), verbose, and easy to get wrong. With broadcasting, it's one line that runs in optimized C under the hood and reads almost like math:

X_centered = X - means

How many Python lines does it take to center every column of X using broadcasting?

The rule, in detail

When shapes don't match, NumPy tries to "stretch" the smaller array along missing or length-1 dimensions so the shapes align. It only stretches when the result is unambiguous; otherwise it raises an error.

# Subtract the mean of each feature from every row
X = np.random.randn(1000, 5)    # shape (1000, 5) — 1000 samples, 5 features
means = X.mean(axis=0)           # shape (5,) — one mean per feature
X_centered = X - means           # broadcast: (1000, 5) - (5,) → (1000, 5)
#                                  means is conceptually tiled to every row

Step by step, broadcasting did the following:

X has shape (1000, 5). means has shape (5,).
NumPy aligns them from the rightmost axis. X's last axis is 5; means's last (and only) axis is 5. Match.
X has an extra axis on the left (1000). means has nothing there. NumPy treats means as if it were tiled to (1000, 5) — the same (5,) vector, conceptually copied 1000 times.
The subtraction happens element-wise on the now-aligned shapes. Result is (1000, 5).

Can a with shape (10, 5) broadcast against b with shape (5,)?

Another common case — adding a bias to every row of a batch:

batch = np.zeros((4, 3))              # shape (4, 3) — 4 samples, 3 features
bias = np.array([1.0, 2.0, 3.0])      # shape (3,)
batch + bias                          # shape (4, 3) — bias added to every row
# array([[1., 2., 3.],
#        [1., 2., 3.],
#        [1., 2., 3.],
#        [1., 2., 3.]])

When NumPy broadcasts (1000, 5) + (5,), which axis of the smaller array gets conceptually "tiled"?

When broadcasting errors

If the shapes are incompatible — axes that are neither the same length nor length-1 — you get a ValueError:

a = np.zeros((3, 4))
b = np.zeros((3,))
a + b     # ERROR: shapes (3,4) and (3,) cannot broadcast
#           the trailing axis of b (3) must match a's trailing axis (4), and it doesn't

The fix is usually a reshape. b[:, None] inserts a new axis and turns (3,) into (3, 1), which can broadcast against (3, 4) — NumPy stretches the length-1 axis to 4. Mentally performing this reshape is a reliable sign of broadcasting fluency.

You have a with shape (5, 4) and b with shape (4,). Which reshape makes b broadcast to match a?

You hit ValueError: shapes (3,4) and (3,) cannot broadcast. What's the standard fix if you truly want to subtract b from every column?

Read the docs

The NumPy broadcasting documentation has diagrams and worked examples that complement this section. Read it after finishing this lesson.

7. Self-check

Answer these without scrolling back up. Each question is checkable — wrong answers reveal the full reasoning so you know what to re-read.

Given a = [1, 2, 3] and b = [4, 0, -1], compute a · b.

If X.shape == (1000, 7), how many features does each sample have?

What is the output shape of A @ v where A.shape = (3, 4) and v.shape = (4,)?

If A.shape = (5, 3), what is A.T.shape?

Which of these is the element-wise product of two matrices?

Can you add an array of shape (3,) directly to an array of shape (3, 4)?

Replace ... with the right operation so each row of X gets one prediction. The output shape should be (1000,).

For any that were fuzzy, jump back to the relevant section — or open the matrix visualizer to watch shapes transform live.

Glossary — all terms from this lesson

Vector: An ordered list of numbers. Represents "one thing" described by several measurements.
Feature: A single measurable property — age, zip_code. Columns of your data matrix.
Feature vector / Sample / Data point: The full list of feature values for one sample. Three synonyms; all three are common in docs and papers.
Dimension: The count of numbers in a vector. "300-dimensional embedding" just means 300 numbers.
Origin: The point (0, 0, …, 0) where all coordinates are zero. Vectors are drawn as arrows from the origin.
Element-wise: An operation applied position-by-position with no mixing across positions. [1,2] + [3,4] = [4,6].
Scalar: A single number (not a vector or matrix). In code, an int or float.
Dot product: Multiply two vectors pairwise, sum the products. Returns a single number that measures directional alignment.
Alignment: How similarly two vectors point. Positive dot product = aligned, zero = perpendicular, negative = opposite.
Cosine similarity: Dot product after both vectors are normalized to length 1. Ranges from −1 to 1. Workhorse similarity metric in retrieval and embeddings.
Transformer / Attention: Architecture behind GPT, BERT, Claude. Attention scores are dot products between query and key vectors. Module 6.
Norm: The "length" of a vector. L2 norm is √(v₁² + v₂² + …) — Pythagoras generalized.
L2 norm: The default vector length formula. Used for distances, normalization, and L2 regularization.
Unit vector: A vector with length 1. Useful when only direction matters.
Normalize (vector): Divide a vector by its norm so it becomes length 1.
Matrix: A 2D grid of numbers. m × n = m rows, n columns. Your ML dataset lives here.
Transpose: Flip a matrix: rows become columns. A.T in NumPy. Shape (m, n) → (n, m).
Matmul (matrix multiplication): Each result entry is a dot product of a row of A with a column of B. A @ B in Python. Non-commutative.
Inner dimensions: For (m × n) @ (n × p), the two ns. Must match for matmul to be defined.
Linear combination: A weighted sum of vectors: c₁v₁ + c₂v₂ + …. Central to how matrices act on vectors.
Linear regression: The simplest ML model: predict y_pred = X @ w for a learned weight vector w. Module 3.
Eigenvalues: Special numbers describing how a matrix stretches space along certain directions. Core to PCA. Lesson 4.
PCA: Principal Component Analysis. Reduces many features to a few that capture most of the variance. Built on eigenvalues.
Tensor: An n-dimensional array. Scalar = 0D, vector = 1D, matrix = 2D, deeper nets use 3D/4D (e.g., images as (batch, channels, H, W)).
Broadcasting: NumPy's rule for operating on mismatched-shape arrays by stretching the smaller along missing/length-1 axes.
Bias vector: A constant vector added to every output of a layer. The +b in y = Wx + b.
Batch: A group of samples processed together. A batch of shape (B, n_features) is B stacked feature vectors.
scikit-learn: The standard classical-ML library in Python. Uniform .fit(X, y) / .predict(X) API across all models.

What's next

Continue to Lesson 2 — Matrices as Transformations. The timed Linear Algebra Drills are a module capstone covering all four lessons — save them for after Lesson 4.