Every ML model you'll ever use โ linear regression, random forests, GPT-4 โ is ultimately doing arithmetic on vectors and matrices. Learn to read that language fluently and the rest gets dramatically easier.
You already work with tables in SQL and dbt. A matrix is a table; a vector is a single row or column. The "math" part consists of a handful of named operations with consistent rules. By the end of this lesson you'll see your data the way a model sees it.
A vector is an ordered list of numbers. Vectors show up everywhere in ML because almost anything we want to reason about โ a user, a document, a pixel, a word โ can be described by a fixed list of measurements, and once you have a list of numbers, you can do math on it.
An ML model takes some input and produces some output. The input could be a user row, a photograph, an email, or a sentence โ wildly different things. The model can't deal with "a user" in the abstract. It needs numbers โ specifically, fixed-length, fixed-order lists of numbers. That's a vector. Converting real-world objects into vectors โ and choosing which numbers to include โ is half the job of an ML engineer. The other half is what you do to those vectors once you have them.
Take a single user in your database. Pick three things you know about them:
age = 34
lifetime_spend = 487.50
days_active = 120
Those are three Python variables. Pack them into a list in a fixed order โ and agree the order never changes โ and you have a vector:
The vertical bracket notation writes "three numbers, stacked into a column." Some papers write it horizontally as $u = (34, 487.50, 120)$; they mean the same thing. In code:
import numpy as np
u = np.array([34, 487.50, 120])
u.shape # (3,) โ "a 1-dimensional array of length 3"
That trailing comma in (3,) trips people up. It's NumPy's way of saying this is a Python tuple with one entry โ the entry being 3. NumPy always reports array shapes as tuples because shapes can have multiple numbers (you'll see (1000, 7) for a matrix soon). A one-dimensional array still gets a tuple, just with one entry. The comma is what tells Python "this is a tuple, not a parenthesized integer."
v = np.array([10, 20, 30, 40, 50]), what does v.shape report (just the length, or the full tuple)?(3,) with a trailing comma?If I swap positions to [120, 34, 487.50], I've made a different vector. In ML terms, I've changed which feature lives in which dimension. A model that was trained expecting "age" in position 0 will happily take "days_active" there and quietly produce garbage predictions. No error. No warning. Just wrong answers.
[3, 1] and [1, 3] the same vector?This is the same class of bug as a silent column reorder in a Parquet file that a downstream dbt model reads โ the pipeline still runs, but the numbers mean different things. In ML, the standard defense is to save the feature order (a list of column names) alongside the trained model, and assert the order matches at inference time. You'll write your first one of these in Module 3. A vector is only meaningful together with the fixed order of its features.
Each number inside a vector is a feature. The whole vector โ all the features for one thing โ is a feature vector, equivalently a sample, or a data point. All three terms appear in docs and papers interchangeably.
[age, spend, days]. At inference, someone passes [days, age, spend] by mistake. What happens?There are two mental models for a vector. They describe the same object and emphasize different things. Code wants the first; geometric intuition lives in the second.
1. As a list. Numbers in a specific order. This is the computational picture โ what's in memory, what NumPy stores, what you pass between functions. Most of the time this is the only view needed to write correct code.
2. As an arrow. In 2D, the vector [3, 1] is an arrow starting at the origin โ the point (0, 0) โ and ending at (3, 1). In 3D, same idea with three coordinates. Beyond 3D we can't draw the arrow, but we still talk about it: direction, length, angle. The math works identically; 2D pictures are a tool for building intuition.
The arrow view matters because operations on vectors have geometric meaning, and that meaning lets you reason about what a model is actually doing. Adding two vectors chains arrows tip-to-tail. Multiplying by 2 doubles an arrow's length. Multiplying by -1 flips it to point the other way. When the geometry looks simple, the algebra is simple too.
A word embedding in modern NLP is a vector with 300, 1024, or 4096 numbers. You cannot draw it. But every operation in this lesson โ addition, dot product, norm, projection โ extends to any number of dimensions with no changes to the formulas. The same rules work whether the vector has 2 components or 4096. If a concept only worked in 2D, it wouldn't be useful for ML; none of the concepts here are like that.
[-2, 3] as an arrow from the origin ends at which point?Drag the orange dot to move the tip of the vector. Three things to try:
[0, 0] is an arrow of zero length โ the zero vector. It has no direction, and it's still a valid vector.[-3, 4] point?[0, 0] point?[0, 0]?This section introduces four operations โ addition, scalar multiplication, dot product, and norm โ and explains what each one means, not just how to compute it. All four show up in every ML model, usually under the hood.
Add vectors element-wise: position 0 to position 0, position 1 to position 1, and so on. That's the computational rule.
Geometrically, place the second arrow's tail at the first arrow's tip. The sum is the arrow from the original origin to where you end up. If the first vector takes you "3 east and 1 north" and the second takes you "1 east and 2 north," the combination takes you "4 east and 3 north." Addition chains directions.
[5, 0] + [-3, 3].One reason to add feature vectors: in learned word embeddings, vector arithmetic roughly captures meaning.
king_vec - man_vec + woman_vec โ queen_vec
This doesn't hold perfectly โ the pop-science version oversimplifies โ but "+ woman โ man" literally means "shift the meaning in the same direction that takes a male word to a female word." Addition of vectors is composition of whatever those vectors represent.
[3, 2], then from that tip walk [-2, -3]. Where do you end up?a = np.array([3, 1])
b = np.array([1, 2])
a + b # array([4, 3])
A scalar โ just a single number, not a vector or matrix โ multiplied into a vector multiplies every component. Geometrically: it stretches the arrow. 2ยทv doubles the length. 0.5ยทv halves it. -1ยทv flips it to point the other way. 0ยทv collapses it to the origin.
-2 ยท [3, -2].Scalar multiplication is everywhere in ML. Re-weighting features (multiply a feature by 100 to convert meters to centimeters). Optimizer updates during training โ the gradient descent rule is w = w - learning_rate * gradient, and the learning rate is a scalar that shrinks the update step. Every time you scale a vector, you're doing this.
-0.5, what happens to its length?The dot product takes two vectors and returns a single number. It is the most important operation in this lesson, and arguably in the whole module.
The formula: multiply each corresponding pair of components, then sum the products.
Step through the calculation for a = [3, 1] and b = [2, 4]:
a by position 0 of b: 3 ร 2 = 6a by position 1 of b: 1 ร 4 = 46 + 4 = 1010[5, 2] ยท [1, 3] in your head.In NumPy, there are two equivalent spellings โ np.dot(a, b) and the @ operator. The @ operator is matrix multiplication; for 1D vectors it equals the dot product. Run this and confirm the answer matches the hand calculation above:
That's the computation. What it measures: the dot product is a number that tells you how much two vectors point in the same direction.
[4, 1] ยท [-2, 3] positive, zero, or negative?Why does the multiply-and-sum formula measure alignment? When two vectors point the same way, their components share signs, so every pairwise product is positive and they sum to a large positive. When they point opposite, signs disagree, products are negative, and the sum is negative. When perpendicular, the positive and negative products cancel. Run the three cases and compare:
The perpendicular case is the one worth pausing on โ it's how ML decides two features (or two embedding directions) carry independent information. A zero dot product means neither vector contributes anything to the other's direction.
The dot product is the single most common operation in ML. A linear regression's prediction for one row is a dot product of that row's features with the learned weights. An attention score inside a transformer (the architecture behind GPT, BERT, and Claude) is a dot product between a "query" vector and a "key" vector. Semantic search uses cosine similarity โ a normalized dot product โ to find similar documents. Understanding dot products gives you the arithmetic of most models.
The norm of a vector โ usually written ||v|| โ is its length. The formula above is specifically the L2 norm (also called "Euclidean norm" because it's the one from ordinary Euclidean geometry). It's literally the Pythagorean theorem generalized to any number of dimensions. For [3, 4], the norm is โ(9 + 16) = โ25 = 5 โ same as the hypotenuse of a 3-4-5 right triangle.
||[5, 12]||?Three reasons vector length matters in ML:
a and b is ||a - b||. This is the foundation of k-nearest neighbors, clustering (k-means), and anomaly detection.v = np.array([3, 4])
np.linalg.norm(v) # 5.0
# Normalize to unit length
v / np.linalg.norm(v) # array([0.6, 0.8]) โ same direction, length 1
# Distance between two data points
a = np.array([1, 1])
b = np.array([4, 5])
np.linalg.norm(a - b) # 5.0 โ a is 5 units away from b
a = [1, 2] and b = [4, 6], what is the L2 distance ||a - b||?L2 is the default, but there's also the L1 norm (sum of absolute values: |vโ| + |vโ| + โฆ) and the Lโ norm (the largest absolute value among components). Each captures a different notion of "length." L1 regularization (aka Lasso) uses the L1 norm and has a useful property โ it drives some weights to exactly zero, giving you automatic feature selection. You'll meet L1 and L2 regularization in Module 3; for now, just know the family exists and they're all called "norms."
A matrix is a 2D grid of numbers โ rows and columns, like a spreadsheet or a SQL result set. The conventions and operations attached to it are what make it useful.
This is a 2ร3 matrix: 2 rows, 3 columns. "m by n" (written m ร n) always means m rows by n columns, in that order. The convention never varies.
A = np.array([[1, 2, 3],
[4, 5, 6]])
A.shape # (2, 3) โ 2 rows, 3 columns
np.array([[1, 2], [3, 4], [5, 6], [7, 8]])?When you run SELECT age, spend, days FROM users LIMIT 1000, you get back a 1000-row, 3-column result. That's a matrix. In ML code the feature matrix is almost always named X, with shape (n_samples, n_features).
Every function in scikit-learn, every layer in PyTorch, every utility in pandas expects feature data in shape (n_samples, n_features). Rows = samples, columns = features. This is the single most important convention in the entire ML ecosystem.
When you see X.shape == (1000, 7), read it immediately as "1000 data points, each described by 7 features." If you accidentally transpose your data and pass (7, 1000), .fit() will happily treat each feature as a sample and each sample as a feature โ silent garbage. This is a top-5 ML bug; sklearn will not save you from it.
Other names for this exact object: data matrix, feature matrix, design matrix (the stats-paper term), or just X. They all refer to the same (n_samples, n_features) table.
X.shape == (10000, 50). What does this dataset contain?X with shape (1000, 7) but accidentally passed X.T (shape (7, 1000)) to model.fit(). What happens?The transpose of a matrix flips rows and columns โ what was row 1 becomes column 1, what was row 2 becomes column 2, and so on. Written $A^T$ in math, A.T in NumPy.
Element by element: the entry at row 1, column 1 of A (which is 1) stays at row 1, column 1 of Aแต. The entry at row 1, column 2 of A (which is 2) moves to row 2, column 1 of Aแต. In general, A[i][j] becomes Aแต[j][i] โ indices swap. Shape-wise, a (3, 2) matrix transposes to a (2, 3) matrix. A (1000, 7) matrix transposes to (7, 1000).
A.T # transpose โ rows become columns; shape swaps
A.T.T # transpose of a transpose is back to the original A
X.shape == (1000, 7), what is X.T.shape?A = [[1, 2, 3], [4, 5, 6]], what value ends up at A.T[2][1]?Transpose exists because many matrix operations have strict shape requirements (the next section), and transposing is the standard tool for reshaping things so operations line up. The ordinary least squares formula is w = (XแตX)โปยน Xแต y โ full of transposes because they're the glue that makes the shapes work out. Understanding that formula is a later-module concern; for now, note transpose's role as shape glue.
Matrix multiplication is where most people bounce off the math. The rule is mechanical but needs to be seen in slow motion the first time.
For A @ B where A is m ร n and B is n ร p, the result is an m ร p matrix. The inner dimensions โ A's columns and B's rows โ must match. The outer dimensions become the shape of the result.
A.shape = (3, 4) and B.shape = (4, 5). What is (A @ B).shape?A.shape = (3, 5) and B.shape = (4, 2). Does A @ B work?A visual for shape-matching:
A @ B = result
(m ร n) (n ร p) (m ร p)
โโโโโโโโโโโโโโโ
must match
(inner dims)
Each entry in the result is itself a dot product. Specifically, entry at row i, column j of the result is the dot product of row i of A with column j of B:
The formula says: for each pair (row of A, column of B), compute a dot product and place the result in that position. A worked example, step by step:
Both are 2ร2. The inner dimensions match (both are 2). The result will be 2ร2. Compute it entry by entry:
[1, 2] ยท [5, 7] = 1ยท5 + 2ยท7 = 5 + 14 = 19[1, 2] ยท [6, 8] = 1ยท6 + 2ยท8 = 6 + 16 = 22[3, 4] ยท [5, 7] = 3ยท5 + 4ยท7 = 15 + 28 = 43[3, 4] ยท [6, 8] = 3ยท6 + 4ยท8 = 18 + 32 = 50A = [[1, 2], [3, 4]] and B = [[5, 3], [7, 14]], what is (A @ B)[0][1]?A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
A @ B # matrix multiplication โ this is the one you want
# array([[19, 22],
# [43, 50]])
A * B # element-wise multiplication โ DIFFERENT operation, be careful
# array([[ 5, 12],
# [21, 32]])
In NumPy, * is element-wise and @ is matrix multiplication. They produce different shapes and different answers. In R, MATLAB, and most math textbooks, * means matmul โ so people coming from those worlds write A * B in Python and get silent garbage.
Every time you write a multiplication, ask: "element-wise or matmul?" If element-wise, A * B (shapes must match or broadcast). If matmul, A @ B (inner dims must match). Running that check mentally for the first month of ML work prevents hours of debugging.
A @ B is not the same as B @ A, even when both are defined. Sometimes only one ordering is defined at all โ if A is (3, 4) and B is (4, 5), then A @ B works (result (3, 5)) but B @ A errors (inner dims 5 and 3 don't match). Matrix multiplication is non-commutative; order always matters.
A special case of matrix multiplication deserves its own section: a matrix times a vector. If A is m ร n and v has length n (shape (n,)), then A @ v is a vector of length m (shape (m,)). This operation is the core of almost every model you'll train, and there are two equivalent angles on it.
A.shape = (3, 5) and v.shape = (5,). What is the shape of A @ v?A.shape = (5, 4) and v.shape = (3,). Does A @ v work?Two equivalent views of this operation. Same answer; different perspectives.
The i-th output is the dot product of row i of A with v. If A has m rows, that's m dot products, stacked into a length-m vector.
[2, 0]. Dot with [1, 4]: 2ยท1 + 0ยท4 = 2. That's output[0].[0, 3]. Dot with [1, 4]: 0ยท1 + 3ยท4 = 12. That's output[1].[2, 12].This is the computational view โ it's what NumPy does under the hood. It's also the view that makes "linear regression is a matrix-vector product" fall into place.
A.shape = (5, 3) and v.shape = (3,), how many dot products does View 1 perform to produce A @ v?[[1, 2], [3, 2]] @ [3, 4] by doing each row as a dot product.Equivalently: A @ v equals "v[0] times column 0 of A, plus v[1] times column 1 of A, plus โฆ". For our example:
Same answer, different framing. The vector v says "take 1 copy of A's first column and 4 copies of A's second column, and add them up." This is a linear combination of A's columns โ a weighted sum.
A @ v, what does the number v[k] do?A = [[1, 3], [2, 4]] and v = [1, 2]. Using View 2, compute A @ v as 1ยท(col 0) + 2ยท(col 1).The second view is what lets you understand what a matrix does geometrically: a matrix acts on an input vector by returning a specific combination of its own columns. That framing is the content of Lesson 2 (matrices as transformations) and underlies eigenvalues and PCA in Lesson 4. Both views are correct; advanced ML concepts lean heavily on view 2.
Suppose you have a data matrix X of shape (n_samples, n_features) and a weight vector w of shape (n_features,) that your optimizer has learned. The prediction for every sample is:
y_pred = X @ w # shape: (n_samples,) โ one prediction per row
By View 1, each output entry is a dot product of one row of X (one sample's features) with w (the learned weights). That dot product is the prediction for that sample. Doing it for every row and stacking the results yields a vector of predictions โ one per sample.
Every linear model โ ordinary least squares, ridge, lasso, logistic regression โ is exactly this operation, differing only in how w is found. Neural networks are stacks of X @ W-style operations with a nonlinearity inserted between each one. A mental picture of X @ w covers most of ML.
X.shape == (n_samples, n_features) and w.shape == (n_features,), what shape is X @ w?[2, 3, 10]. The learned weights are w = [0.5, 0.5, 1.0]. What's the model's prediction for this user?The hardest part of deep learning is not the math โ it's keeping track of tensor shapes as data flows through a model. A 3-line neural network function can have five implicit shape assumptions, and one wrong guess makes everything silently wrong. Engineers who develop shape fluency make fewer bugs; engineers who don't spend their afternoons in pdb.
Shape rules for everything in this lesson:
| Operation | Shape rule | Example |
|---|---|---|
A + B or A * B (element-wise) | Shapes must match, or be broadcastable | (3,4) + (3,4) โ (3,4) |
A @ B (matmul, 2D ร 2D) | Inner dims match: (m, n) @ (n, p) โ (m, p) | (3,4) @ (4,5) โ (3,5) |
A @ v (matrix ร 1D vector) | (m, n) @ (n,) โ (m,) | (3,4) @ (4,) โ (3,) |
A.T (transpose) | Swap the two numbers: (m, n) โ (n, m) | (5,3).T โ (3,5) |
v @ w (two 1D vectors) | Lengths must match; returns a scalar | (4,) @ (4,) โ scalar |
A.shape = (100, 5) and B.shape = (5, 3). What is (A @ B).shape?v.shape = (10,) and w.shape = (10,). What does v @ w produce?Production ML code is full of assert statements checking expected shapes. They're cheap to write, they fail loudly the moment something's wrong, and they document the function's contract inline.
def predict(X, w):
"""
Given features X with shape (n_samples, n_features)
and weights w with shape (n_features,),
return predictions with shape (n_samples,).
"""
assert X.ndim == 2, f"X should be 2D, got shape {X.shape}"
assert w.ndim == 1, f"w should be 1D, got shape {w.shape}"
assert X.shape[1] == w.shape[0], (
f"shape mismatch: X has {X.shape[1]} features, "
f"w has {w.shape[0]} weights"
)
return X @ w
The alternative is debugging at 11pm because predictions were silently wrong-shape and downstream code took the first element of a broadcast result as "the answer." The assert pattern is a small investment for substantially fewer bugs. The NumPy ndarray docs cover every attribute โ .shape, .ndim, .dtype, .T โ and repay a skim.
predict(X, w) where X.shape = (100, 7) and w.shape = (3,). What happens inside the assert-laden version of predict?predict(X, w)?Broadcasting is NumPy's rule for performing element-wise operations on arrays of different but compatible shapes, without a for-loop. It's how you subtract a mean vector from every row of a data matrix, add a bias vector to every sample in a batch, or multiply every column by a scaling factor. Almost every per-row or per-column transformation in NumPy/PyTorch code uses broadcasting.
Without broadcasting, you'd write explicit nested loops:
# The painful way
for i in range(X.shape[0]):
for j in range(X.shape[1]):
X[i, j] = X[i, j] - means[j]
That's slow (Python loops are interpreted), verbose, and easy to get wrong. With broadcasting, it's one line that runs in optimized C under the hood and reads almost like math:
X_centered = X - means
X using broadcasting?When shapes don't match, NumPy tries to "stretch" the smaller array along missing or length-1 dimensions so the shapes align. It only stretches when the result is unambiguous; otherwise it raises an error.
# Subtract the mean of each feature from every row
X = np.random.randn(1000, 5) # shape (1000, 5) โ 1000 samples, 5 features
means = X.mean(axis=0) # shape (5,) โ one mean per feature
X_centered = X - means # broadcast: (1000, 5) - (5,) โ (1000, 5)
# means is conceptually tiled to every row
Step by step, broadcasting did the following:
X has shape (1000, 5). means has shape (5,).X's last axis is 5; means's last (and only) axis is 5. Match.X has an extra axis on the left (1000). means has nothing there. NumPy treats means as if it were tiled to (1000, 5) โ the same (5,) vector, conceptually copied 1000 times.(1000, 5).a with shape (10, 5) broadcast against b with shape (5,)?Another common case โ adding a bias to every row of a batch:
batch = np.zeros((4, 3)) # shape (4, 3) โ 4 samples, 3 features
bias = np.array([1.0, 2.0, 3.0]) # shape (3,)
batch + bias # shape (4, 3) โ bias added to every row
# array([[1., 2., 3.],
# [1., 2., 3.],
# [1., 2., 3.],
# [1., 2., 3.]])
(1000, 5) + (5,), which axis of the smaller array gets conceptually "tiled"?If the shapes are incompatible โ axes that are neither the same length nor length-1 โ you get a ValueError:
a = np.zeros((3, 4))
b = np.zeros((3,))
a + b # ERROR: shapes (3,4) and (3,) cannot broadcast
# the trailing axis of b (3) must match a's trailing axis (4), and it doesn't
The fix is usually a reshape. b[:, None] inserts a new axis and turns (3,) into (3, 1), which can broadcast against (3, 4) โ NumPy stretches the length-1 axis to 4. Mentally performing this reshape is a reliable sign of broadcasting fluency.
a with shape (5, 4) and b with shape (4,). Which reshape makes b broadcast to match a?ValueError: shapes (3,4) and (3,) cannot broadcast. What's the standard fix if you truly want to subtract b from every column?The NumPy broadcasting documentation has diagrams and worked examples that complement this section. Read it after finishing this lesson.
Answer these without scrolling back up. Each question is checkable โ wrong answers reveal the full reasoning so you know what to re-read.
a = [1, 2, 3] and b = [4, 0, -1], compute a ยท b.X.shape == (1000, 7), how many features does each sample have?A @ v where A.shape = (3, 4) and v.shape = (4,)?A.shape = (5, 3), what is A.T.shape?(3,) directly to an array of shape (3, 4)?... with the right operation so each row of X gets one prediction. The output shape should be (1000,).For any that were fuzzy, jump back to the relevant section โ or open the matrix visualizer to watch shapes transform live.
age, zip_code. Columns of your data matrix.(0, 0, โฆ, 0) where all coordinates are zero. Vectors are drawn as arrows from the origin.[1,2] + [3,4] = [4,6].int or float.โ(vโยฒ + vโยฒ + โฆ) โ Pythagoras generalized.m ร n = m rows, n columns. Your ML dataset lives here.A.T in NumPy. Shape (m, n) โ (n, m).A with a column of B. A @ B in Python. Non-commutative.(m ร n) @ (n ร p), the two ns. Must match for matmul to be defined.cโvโ + cโvโ + โฆ. Central to how matrices act on vectors.y_pred = X @ w for a learned weight vector w. Module 3.(batch, channels, H, W)).+b in y = Wx + b.(B, n_features) is B stacked feature vectors..fit(X, y) / .predict(X) API across all models.Continue to Lesson 2 โ Matrices as Transformations. The timed Linear Algebra Drills are a module capstone covering all four lessons โ save them for after Lesson 4.