Module 1 · Lesson 3 · ~60 min read + play · How models learn

Derivatives & Gradients: The Engine of Learning

Every time a model "learns" anything, it's running the same three-step loop: compute a loss, take a derivative, adjust the weights against the derivative. Master this loop and you've understood training — not a particular model, all of them.

Bridge from what you know

You've written plenty of iterative algorithms — a retry loop with exponential backoff, a dbt incremental model that converges toward a steady state, a cron job that moves a watermark forward until it catches up to "now." Gradient descent is the same flavor: a loop with a tunable step size that terminates when some measure of progress stops shrinking. The difference is what measures progress (a loss function) and what direction to step (the negative gradient). By the end of this lesson you'll have built that loop by hand — the same loop that trains every neural network on earth.

1. Why derivatives at all?

Training an ML model means picking numbers (the weights) that make the model's predictions as close to the truth as possible on your training data. "As close as possible" is measured by a single number: the loss. Lower loss = better model. So training reduces to a pure optimization problem: find the weights that minimize the loss.

If there were only two or three weights, you could plot the loss as a surface and eyeball the lowest point. Real models have thousands to billions of weights. You can't plot a billion-dimensional surface, let alone eyeball it. You need an algorithm that, from any starting point, can tell you which way is downhill — without ever seeing the whole surface.

That algorithm is gradient descent. At each point, it needs a direction that locally decreases the loss. The derivative (for one variable) and the gradient (for many variables) provide exactly that: local information about which way the function is rising and how steeply. Follow the negative gradient, take a small step, repeat.

Why can't you just plot the loss surface and find the minimum by eye for a real model?

The one-sentence version

Derivatives and gradients are the tools that let an algorithm navigate a billion-dimensional loss landscape using only local information at whatever point it's currently at.

In a billion-dimensional weight space, what does computing a gradient at a point give you?

At its core, what is "training an ML model"?

2. The derivative, from scratch

A derivative answers a single question: if I wiggle the input a tiny bit, how much does the output change? The ratio — output change over input change, taken in the limit as the wiggle shrinks to zero — is the derivative. Everything else in this lesson is a corollary of that one idea.

The slope of a secant line

Take the function $f(x) = x^2$. At $x = 1$, $f(1) = 1$. At $x = 1.1$, $f(1.1) = 1.21$. So as $x$ moved from 1 to 1.1, $f$ moved from 1 to 1.21. The average rate of change over that interval is:

\frac{f(1.1) - f(1)}{1.1 - 1} = \frac{1.21 - 1}{0.1} = \frac{0.21}{0.1} = 2.1

Geometrically, that number is the slope of the straight line drawn through the two points $(1, 1)$ and $(1.1, 1.21)$ on the curve. That line is called a secant line — it cuts through the curve at two points.

Now shrink the wiggle. Try $x = 1$ and $x = 1.01$:

\frac{f(1.01) - f(1)}{1.01 - 1} = \frac{1.0201 - 1}{0.01} = 2.01

And again, $x = 1$ and $x = 1.001$:

\frac{f(1.001) - f(1)}{1.001 - 1} = \frac{1.002001 - 1}{0.001} = 2.001

The slopes are heading toward 2. As the wiggle shrinks, the secant slope converges to a specific number. That limiting number is the derivative at $x = 1$.

For $f(x) = x^2$, compute the secant slope between $x = 2$ and $x = 2.1$.

The limit definition

The symbolic version uses a variable wiggle $h$ and lets it shrink to zero. This is the actual definition of a derivative:

f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}

For $f(x) = x^2$, plug in and expand:

\frac{f(x + h) - f(x)}{h} = \frac{(x + h)^2 - x^2}{h} = \frac{x^2 + 2xh + h^2 - x^2}{h} = \frac{2xh + h^2}{h} = 2x + h

Now let $h \to 0$. The $h$ term disappears and we're left with:

$$f'(x) = 2x$$

The power-rule shortcut "bring the exponent down, subtract one" is the compressed answer you get after running this limit argument in general. The rules throughout calculus are derived the same way.

Evaluated at $x = 1$: $f'(1) = 2$, matching the numerical experiment.

For $f(x) = x^2$, what is $f'(3)$?

Three framings of the derivative

The derivative is one object, but it shows up in three framings:

Slope of the tangent line. At the point $(x, f(x))$, draw the line that just kisses the curve — the limit of all those secant lines as the wiggle shrinks. Its slope is $f'(x)$.
Instantaneous rate of change. "How fast is $f$ changing at this exact $x$?" Output-change per unit of input-change, in the limit.
Best linear approximation. Near $x$, the function behaves almost exactly like $f(x) + f'(x) \cdot \Delta x$ — a line. The derivative is the slope of the line that best mimics the curve locally. This framing is what justifies gradient descent (Section 7).

📈 Slope of the tangent

f(x) = x² · at x=1.00 · f(x)=1.00 · slope f'(x)=2.00

Drag the slider. Three things happen:

At $x = 0$, the tangent line is flat — slope 0. This is a stationary point, and for a bowl-shaped function it's the minimum — the point gradient descent is trying to reach.
For $x > 0$, the slope is positive — the function is increasing. For $x < 0$, the slope is negative — the function is decreasing as $x$ increases.
The further you go from $x = 0$, the steeper the tangent. The derivative $f'(x) = 2x$ grows linearly; the curve's "steepness" grows with it.

Move the slider to $x = -1.5$. Before looking: is the tangent's slope positive, zero, or negative?

3. Derivative rules you'll actually use

You will almost never compute a derivative by hand in production ML code — autograd does it for you. But you need to read derivatives when they appear in papers, blog posts, and error messages. The minimum working set:

Rule	Result	Why you see it
Constant	$\frac{d}{dx}(c) = 0$	Bias terms, regularization constants
Power	$\frac{d}{dx}(x^n) = n x^{n-1}$	Squared-error loss, L2 regularization
Sum	$(f + g)' = f' + g'$	Total loss = sum of per-sample losses
Product	$(fg)' = f'g + fg'$	Layer outputs × activations
Exponential	$\frac{d}{dx}(e^x) = e^x$	Softmax, sigmoid, cross-entropy
Logarithm	$\frac{d}{dx}(\ln x) = \frac{1}{x}$	Cross-entropy loss, log-likelihood

Worked example: $\frac{d}{dx}(5x^3 - 2x + 7)$. Apply sum and constant rules, then the power rule to each term: $15x^2 - 2 + 0 = 15x^2 - 2$. The 7 disappears because the derivative of a constant is zero — constants don't change as $x$ changes. The $-2x$ becomes $-2$ because $\frac{d}{dx}(x) = 1$.

What is $\frac{d}{dx}(4x^3)$?

What is $\frac{d}{dx}(3x^2 + 100)$?

4. The chain rule — the rule that makes deep learning possible

Neural networks are not one function; they are compositions of many. A typical forward pass looks like:

\text{loss}(y, \;\text{softmax}(W_3 \cdot \text{relu}(W_2 \cdot \text{relu}(W_1 \cdot x + b_1) + b_2) + b_3))

That's a function wrapped in a function wrapped in a function, layered down through every layer of the network. To train it, we need the derivative of the loss with respect to each weight — buried many levels deep inside that composition. The chain rule tells us how to do it.

The rule itself

If $h(x) = f(g(x))$ — a function of a function — then:

\frac{dh}{dx} = f'(g(x)) \cdot g'(x)

In English: derivative of the outer (evaluated at the inner) times the derivative of the inner. A useful way to remember it: imagine you're tracking how a small wiggle in $x$ propagates. First $x$ wiggles $g$ by $g'(x) \cdot \Delta x$. Then $g$'s wiggle propagates through $f$, which turns it into $f'(g(x))$ times that. The two rates multiply.

For $h(x) = (5x - 2)^2$, which is the inner function and which is the outer?

Worked example

Compute $h'(x)$ for $h(x) = (3x + 1)^2$.

Identify the inner function: $g(x) = 3x + 1$. Its derivative: $g'(x) = 3$.
Identify the outer function: $f(u) = u^2$. Its derivative: $f'(u) = 2u$.
Plug into the chain rule: $h'(x) = f'(g(x)) \cdot g'(x) = 2(3x + 1) \cdot 3 = 6(3x + 1)$.

Sanity check at $x = 0$: $h(0) = 1$, $h(0.01) = (0.03 + 1)^2 = 1.0609$. Numerical slope $\approx (1.0609 - 1) / 0.01 = 6.09$. Formula gives $h'(0) = 6(0 + 1) = 6$. Close enough — the tiny discrepancy is the same $h \to 0$ limit effect we saw in Section 2.

Using $h'(x) = 6(3x + 1)$ from the worked example, compute $h'(1)$.

Backprop = the chain rule, run at scale

Backpropagation — the algorithm that trains every deep network — is the chain rule applied repeatedly, layer by layer, starting from the loss and walking backward through the composition to each weight. The hard part isn't the math; it's the bookkeeping of intermediate values at scale. Autograd (which you'll see in Section 10) handles that bookkeeping automatically. When a reference says "the gradient flows backward through the network," the thing flowing is chain-rule products.

The inner-derivative multiplier

A common mistake: seeing $(f \circ g)(x)$ and writing the derivative as $f'(g(x))$, forgetting the $\cdot g'(x)$ multiplier. The multiplier is where the rule does its real work. Without it, a network with 50 layers would give you the wrong gradient at every layer. That $g'(x)$ factor is also why activation functions matter — if $g'$ is close to zero (a vanishing gradient), you multiply many small numbers together and the product goes to zero long before it reaches the early layers. Picking activations with well-behaved derivatives is half the art of deep learning.

5. Partial derivatives — when inputs are plural

A real ML loss function doesn't depend on one scalar weight — it depends on millions. For a function of several variables, we take a partial derivative: the derivative with respect to one variable, holding the others fixed. Notation:

\frac{\partial f}{\partial x} = \text{how much } f \text{ changes if I wiggle only } x, \text{ keeping } y \text{ frozen}

The rounded $\partial$ (called "del" or "partial") distinguishes a partial derivative from an ordinary derivative $\frac{d}{dx}$. The computation is straightforward: pretend every other variable is a constant, and differentiate as usual.

Worked example: $f(x, y) = 3x^2 + 2xy + y^3$.

$\frac{\partial f}{\partial x}$: treat $y$ as a constant. $3x^2$ becomes $6x$. $2xy$ becomes $2y$ (the $x$ disappears, $y$ stays). $y^3$ is a constant in $x$, so it becomes $0$. Total: $\boxed{6x + 2y}$.
$\frac{\partial f}{\partial y}$: treat $x$ as a constant. $3x^2$ is constant in $y$, so $0$. $2xy$ becomes $2x$. $y^3$ becomes $3y^2$. Total: $\boxed{2x + 3y^2}$.

Partial derivatives are ordinary derivatives taken with every other variable held fixed.

For $f(x, y) = x^2 y + 5y$, what is $\frac{\partial f}{\partial x}$?

For $f(x, y) = x^2 y + 5y$, what is $\frac{\partial f}{\partial y}$?

6. The gradient — all the partials, packed together

The gradient of a multi-variable function, written $\nabla f$ (pronounced "del f" or "nabla f"), is a vector containing every partial derivative:

\nabla f(x, y) = \begin{bmatrix} \partial f / \partial x \\ \partial f / \partial y \end{bmatrix}

For our function $f(x, y) = 3x^2 + 2xy + y^3$:

\nabla f(x, y) = \begin{bmatrix} 6x + 2y \\ 2x + 3y^2 \end{bmatrix}

A gradient is a vector — it has the same length as the number of variables — so it lives in the same shape-world as Lesson 1. If $w$ has 10,000 weights, $\nabla L(w)$ is a 10,000-dimensional vector, shaped the same as $w$, one entry per weight.

For $\nabla f(x, y) = [6x + 2y,\; 2x + 3y^2]$, what is $\nabla f(1, 2)$?

If $w$ is a weight vector of shape $(10000,)$, what is the shape of $\nabla L(w)$?

Geometric meaning of the gradient

The gradient points uphill — steepest uphill

At any point where $f$ is differentiable, $\nabla f$ points in the direction of steepest ascent, and its magnitude $\|\nabla f\|$ is how steep that ascent is. To go downhill fastest, move in the direction $-\nabla f$. This is the reason gradient descent uses a minus sign.

If you stepped along $\nabla f$, you'd be climbing as fast as possible — loss goes up, training diverges. Stepping against it descends as fast as possible. A flipped sign here is a common cause of first-time "my training isn't working" bugs. Autograd handles the sign for you in PyTorch (that's what .backward() + the optimizer's minus-sign combo does), but if you're writing the loop yourself — which you'll do in the exercise — the sign is your responsibility.

To decrease a loss function as fast as possible from the current point, which direction do you step?

7. Gradient descent — the algorithm that trains everything

The ingredients so far:

A loss function $L(w)$ that takes a weight vector $w$ and returns a single number — "how bad is the model with these weights?"
The gradient $\nabla L(w)$ — a vector the same shape as $w$ pointing uphill on the loss surface.
A goal: find $w$ where $L(w)$ is as small as possible (or at least, small enough).

The gradient descent update rule:

w_{\text{new}} = w_{\text{old}} - \eta \cdot \nabla L(w_{\text{old}})

where $\eta$ (Greek letter "eta") is the learning rate — a small positive number controlling how big each step is. In pseudocode:

w = initial_guess
for step in range(num_steps):
    grad = compute_gradient(L, w)   # vector, same shape as w
    w    = w - learning_rate * grad # one step downhill
    # stop when grad is tiny, or loss stopped shrinking, or we hit max steps

That loop — compute gradient, step against it, repeat — is how every neural network learns. Adam, SGD, RMSprop, AdamW: all variants of this idea with cleverer step rules. The skeleton is always the same.

In a real training loop, what are the typical stopping conditions?

If $w = [2, 3]$, $\nabla L(w) = [2, -4]$, and $\eta = 0.1$, what is $w_{\text{new}}$ after one gradient-descent step?

Learning rate is the #1 knob

Too large and you jump clean over the minimum — loss oscillates, then blows up to infinity or NaN. Too small and training crawls; you run out of budget before getting close. Typical starting values: 1e-3 for Adam, 1e-2 to 1e-1 for plain SGD. When a training run looks broken and you don't know why, the learning rate is always the first thing to check.

The analogue in ETL work is a retry-backoff tuning knob: too aggressive and you overshoot the target service, too timid and you never catch up. In gradient descent, "overshooting" doesn't mean rate-limiting — it means the loss diverges to infinity and you get a NaN in your training logs at 3 a.m.

Three common questions

"Why does it ever stop?" Two answers. At a true minimum, $\nabla L = 0$ — you subtract zero, $w$ doesn't move, loss doesn't change. In practice we don't reach exact zero; we hit some convergence criterion (loss change below a threshold, or step count exhausted). Same spirit as a cron job with a max-iterations guard.
"Is the minimum we find the real minimum?" Not necessarily. Gradient descent only guarantees you reach a local minimum — a point lower than anything nearby — not the global minimum. For a convex function (bowl-shaped), local = global, and we're fine. For deep nets, the loss is very non-convex. Empirically this turns out to matter less than you'd expect — many local minima are near-optimal in practice.
"Why does the negative gradient work as a direction?" Because the derivative is the best linear approximation (framing 3 from Section 2). Locally, $L(w + \Delta w) \approx L(w) + \nabla L(w) \cdot \Delta w$. If you choose $\Delta w = -\eta \nabla L(w)$, then $L(w + \Delta w) \approx L(w) - \eta \|\nabla L(w)\|^2$ — the second term is never positive, so loss goes down (for small enough $\eta$). The calculus isn't suggestive or aesthetic — it's doing real work.

You're training a model and the loss goes to NaN on step 2. Which learning rate is almost certainly the problem?

8. A concrete gradient descent run, by hand

Minimize $f(x, y) = x^2 + 3y^2$. By inspection the minimum is at $(0, 0)$; assume we don't know that.

Gradient: $\nabla f = \begin{bmatrix} 2x \\ 6y \end{bmatrix}$. Starting point: $(4, 3)$. Learning rate: $\eta = 0.1$.

Step 0, longhand

Current point: $w = [4, 3]$. Loss: $f(4, 3) = 16 + 27 = 43$.
Gradient here: $\nabla f(4, 3) = [2 \cdot 4, 6 \cdot 3] = [8, 18]$.
Step: $w_{\text{new}} = [4, 3] - 0.1 \cdot [8, 18] = [4 - 0.8, 3 - 1.8] = [3.2, 1.2]$.
New loss: $f(3.2, 1.2) = 10.24 + 4.32 = 14.56$. Dropped from 43 to 14.56 in one step.

The $y$ coordinate moved faster than $x$ (3 → 1.2 is a bigger relative move than 4 → 3.2) because the gradient in $y$ was larger — coefficient 6 vs. 2. The steeper a direction, the bigger the step taken there. Gradient descent takes big strides down steep slopes and gentler strides down gradual ones.

Starting from $w = [3.2, 1.2]$ after step 0, compute step 1. What is the loss after step 1?

The full run in NumPy

Run this as-is, then re-run with eta = 0.001 (watch training crawl) and eta = 0.4 (watch it diverge to inf or NaN). The learning-rate sensitivity from the earlier warning, in numerical form.

Of these learning rates, which is the first that causes the optimizer to diverge on $f(x, y) = x^2 + 3y^2$?

The playground below embeds the same loop with a draggable starting point and a learning-rate slider — the loss surface rendered as contour lines, with the optimizer's path drawn on top as you step. Use it to build intuition for what an oscillating or diverging run looks like. Try learning rates of 0.05, 0.3, and 0.4 on the default function; identify the one where the path starts zig-zagging and then flies off the surface.

9. Vectorized gradients — the form you'll actually write

In real ML code, you don't write gradients one variable at a time. You write them as vector/matrix expressions, using the shape rules from Lesson 1. Example: the mean-squared-error loss for linear regression.

L(w) = \frac{1}{n} \|Xw - y\|^2

Here $X$ is the feature matrix of shape $(n, d)$, $y$ is the target vector of length $n$, and $w$ is the weight vector of length $d$. The loss is a single number. Its gradient with respect to $w$ is:

\nabla_w L = \frac{2}{n} X^T (Xw - y)

Inside the gradient formula, what does the vector $(Xw - y)$ represent?

You don't need to derive this right now — you'll do it carefully in Module 3. Three properties to register today:

The gradient is a vector, same length as $w$ (that's length $d$). Check the shapes: $X^T$ is $(d, n)$, the residual $(Xw - y)$ is $(n,)$. So $X^T (Xw - y)$ is $(d,)$.
Every operation inside it — matmul, subtraction, scalar multiply — came from Lesson 1. The math isn't new; the notation is compressing many individual partials into one vectorized expression.
One line of NumPy implements it: grad = (2 / n) * X.T @ (X @ w - y). That's the line a linear regression optimizer loops over.

If X.shape = (100, 5), w.shape = (5,), and y.shape = (100,), what is the shape of (2/n) * X.T @ (X @ w - y)?

10. Automatic differentiation — why you rarely hand-derive

You just did several pages of calculus. In production you will do almost none of it. Modern frameworks provide autograd — automatic differentiation — which computes derivatives for you, exactly (not numerically), for any computation you write using their primitives.

The mental model: when you compute a forward pass with PyTorch tensors that have requires_grad=True, PyTorch builds a graph of every operation under the hood. When you call .backward() on the loss, it walks that graph backward applying the chain rule at each node, and leaves the gradient in .grad on every parameter tensor.

import torch

# Same toy problem as Section 8
w = torch.tensor([4.0, 3.0], requires_grad=True)   # track grads for this tensor
loss = w[0]**2 + 3 * w[1]**2                        # forward pass, builds the graph

loss.backward()                                      # walks the graph backward, fills w.grad
print(w.grad)       # tensor([8., 18.])  ==  [2x, 6y] at (4, 3) — matches Section 8

# One manual gradient-descent step:
with torch.no_grad():                                # don't track this update
    w -= 0.1 * w.grad
    w.grad.zero_()                                   # reset grads for next iteration

You will write this pattern hundreds of times. In practice you won't write the manual step — you'll use a PyTorch optimizer like torch.optim.SGD or torch.optim.Adam that encapsulates the update rule:

import torch

w = torch.tensor([4.0, 3.0], requires_grad=True)
optimizer = torch.optim.SGD([w], lr=0.1)

for step in range(20):
    optimizer.zero_grad()                # clear leftover gradients from last step
    loss = w[0]**2 + 3 * w[1]**2         # forward pass
    loss.backward()                       # autograd fills w.grad
    optimizer.step()                      # applies: w = w - lr * w.grad
    print(f"step {step:2d}: w={w.data.numpy().round(3)}, loss={loss.item():.4f}")

This is the canonical training loop. Every PyTorch model you'll ever write is some version of it, with a more elaborate forward pass and a more elaborate loss. Module 5 builds from here.

In a PyTorch training loop, which call actually fills w.grad with the gradient?

Why do PyTorch training loops always call optimizer.zero_grad() at the top of each iteration?

Read the source docs

The PyTorch autograd mechanics note is one of the most valuable pieces of documentation in the ML ecosystem — worth a full read once. For the classical side, scipy.optimize provides quasi-Newton methods (L-BFGS) that work for small/mid problems with exact gradients. And for the pure math background, the Wikipedia derivative article is more accessible than most textbooks.

11. What goes wrong — and what you do about it

Common training failures and their first-response moves:

Exploding gradients. Gradient magnitudes blow up, loss goes to infinity or NaN. First moves: lower the learning rate by 10×; add gradient clipping (clamp $\|\nabla\|$ to a max norm); check for numerical issues in the loss (e.g., log(0)).
Vanishing gradients. Gradient values shrink to near zero in early layers, so those layers stop learning. First moves: use better activations (ReLU instead of sigmoid/tanh in deep nets), add residual connections, switch to a better initializer (Kaiming/Xavier).
Saddle points. $\nabla L = 0$ but it's not a minimum — the surface goes up along some axes and down along others. In high dimensions, most zero-gradient points are saddles, not minima. Momentum-based optimizers (Adam, SGD+momentum) help escape them.
Bad learning rate. See the warning above. Serious training uses a schedule — warmup at the start, then cosine decay or step decay toward zero. Never a fixed value for runs longer than a couple of epochs.
Sign errors. Flipped a minus. Did $w + \eta \nabla L$ instead of $w - \eta \nabla L$. Loss climbs forever. A one-character bug that costs an afternoon. Write an assert loss_new <= loss_old + tol in your dev loop; catches it on iteration 1.

Your training loss goes to NaN after 200 steps. What's the first thing to try?

A custom training loop: loss climbs every step, monotonically. Most likely culprit?

12. Self-check

Answer these without scrolling back up. Each question is checkable — wrong answers reveal the reasoning so you know what to re-read.

What is the derivative of $5x^3 - 2x + 7$ evaluated at $x = 0$?

For $h(x) = (2x + 1)^2$, compute $h'(4.5)$.

For $f(x, y) = x^2 y + 5y$, what is $\frac{\partial f}{\partial x}$?

If $\nabla L(w) = [2, -3, 1]$ and $\eta = 0.1$, what is $\Delta w = w_{\text{new}} - w_{\text{old}}$?

If w.shape = (10,) and $L(w)$ is a scalar loss, what is the shape of $\nabla L(w)$?

Why does the gradient descent update rule subtract the gradient rather than adding it?

A training loss climbs to inf then NaN around step 300. First thing to try?

Wiggle each coordinate of w by h, measure the change in f, divide by h. The last printed line should be 18.0, matching the analytic answer.

For any that were fuzzy, jump back to the relevant section — or scroll up and run the playground again until the mechanics feel mechanical.

Glossary — all terms from this lesson

Derivative: How fast a function's output changes when you wiggle its input, in the limit as the wiggle shrinks. Slope of the tangent line. Written $f'(x)$ or $\tfrac{df}{dx}$.
Slope: Rise over run. For a curve, the slope at a point is the slope of the tangent line there — the derivative.
Rate of change: Output change per unit of input change. The derivative is the instantaneous rate of change (taken in the limit).
Stationary point: A point where $f'(x) = 0$ (or $\nabla f = 0$ in many variables). Could be a minimum, maximum, or saddle.
Minimum / Local minimum: A point where $f$ is lower than everything nearby. Gradient is zero and the curvature bends upward.
Global minimum: The single lowest point anywhere in the domain. For convex functions, any local minimum is also the global minimum.
Convex function: Bowl-shaped: any line between two points on the graph lies above or on the graph. For these, gradient descent converges to the global minimum.
Chain rule: Derivative of a composition: $(f \circ g)'(x) = f'(g(x)) \cdot g'(x)$. The mathematical backbone of backpropagation.
Partial derivative: Derivative of a multi-variable function with respect to one variable, holding the others fixed. Written $\tfrac{\partial f}{\partial x}$.
Gradient: The vector of all partial derivatives, $\nabla f$. Same shape as the input. Points in the direction of steepest ascent.
Loss function: A scalar-valued function of the weights that measures how bad the model's predictions are. Training minimizes it.
Gradient descent: Iterative optimizer: $w \leftarrow w - \eta \nabla L(w)$. Repeatedly step against the gradient to reduce loss.
Learning rate: The step size $\eta$ in gradient descent. Too big diverges, too small crawls. The #1 hyperparameter.
Convergence: The optimizer has (approximately) stopped moving — gradient near zero, loss flatlining, or max steps hit.
Vanishing gradients: Gradient values shrink toward zero (often in early layers of deep nets), so those layers stop learning.
Backpropagation: Algorithm that computes gradients of a loss through a neural network by applying the chain rule layer by layer, from output backward to input.
Autograd: Automatic differentiation. The framework feature (PyTorch, TensorFlow, JAX) that computes exact gradients of any computation you build with its primitives.
Optimizer: The object that encapsulates the update rule in training loops — SGD, Adam, RMSprop, etc. Stores state (momentum, running averages) and applies w = w - lr * grad (plus extras) when you call .step().
Linear regression: The simplest ML model: $y_{\text{pred}} = Xw$. Its MSE loss has a clean closed-form gradient that you'll implement in Module 3.

What's next

Lesson 4 — Eigenvalues & SVD — completes the math foundation and unlocks PCA, rank, and a deeper view of what matrices actually do to space. After that, your first hands-on exercise implements gradient descent from scratch in exercises/exercise-02-gradient-descent.ipynb — this lesson is the prerequisite for every line of that notebook.