Module 1 Β· Lesson 3 Β· ~60 min read + play Β· How models learn

Derivatives & Gradients: The Engine of Learning

Every time a model "learns" anything, it's running the same three-step loop: compute a loss, take a derivative, adjust the weights against the derivative. Master this loop and you've understood training β€” not a particular model, all of them.

Bridge from what you know

You've written plenty of iterative algorithms β€” a retry loop with exponential backoff, a dbt incremental model that converges toward a steady state, a cron job that moves a watermark forward until it catches up to "now." Gradient descent is the same flavor: a loop with a tunable step size that terminates when some measure of progress stops shrinking. The difference is what measures progress (a loss function) and what direction to step (the negative gradient). By the end of this lesson you'll have built that loop by hand β€” the same loop that trains every neural network on earth.

1. Why derivatives at all?

Training an ML model means picking numbers (the weights) that make the model's predictions as close to the truth as possible on your training data. "As close as possible" is measured by a single number: the loss. Lower loss = better model. So training reduces to a pure optimization problem: find the weights that minimize the loss.

If there were only two or three weights, you could plot the loss as a surface and eyeball the lowest point. Real models have thousands to billions of weights. You can't plot a billion-dimensional surface, let alone eyeball it. You need an algorithm that, from any starting point, can tell you which way is downhill β€” without ever seeing the whole surface.

That algorithm is gradient descent. At each point, it needs a direction that locally decreases the loss. The derivative (for one variable) and the gradient (for many variables) provide exactly that: local information about which way the function is rising and how steeply. Follow the negative gradient, take a small step, repeat.

Why can't you just plot the loss surface and find the minimum by eye for a real model?
The one-sentence version

Derivatives and gradients are the tools that let an algorithm navigate a billion-dimensional loss landscape using only local information at whatever point it's currently at.

In a billion-dimensional weight space, what does computing a gradient at a point give you?
At its core, what is "training an ML model"?

2. The derivative, from scratch

A derivative answers a single question: if I wiggle the input a tiny bit, how much does the output change? The ratio β€” output change over input change, taken in the limit as the wiggle shrinks to zero β€” is the derivative. Everything else in this lesson is a corollary of that one idea.

The slope of a secant line

Take the function $f(x) = x^2$. At $x = 1$, $f(1) = 1$. At $x = 1.1$, $f(1.1) = 1.21$. So as $x$ moved from 1 to 1.1, $f$ moved from 1 to 1.21. The average rate of change over that interval is:

$$\frac{f(1.1) - f(1)}{1.1 - 1} = \frac{1.21 - 1}{0.1} = \frac{0.21}{0.1} = 2.1$$

Geometrically, that number is the slope of the straight line drawn through the two points $(1, 1)$ and $(1.1, 1.21)$ on the curve. That line is called a secant line β€” it cuts through the curve at two points.

Now shrink the wiggle. Try $x = 1$ and $x = 1.01$:

$$\frac{f(1.01) - f(1)}{1.01 - 1} = \frac{1.0201 - 1}{0.01} = 2.01$$

And again, $x = 1$ and $x = 1.001$:

$$\frac{f(1.001) - f(1)}{1.001 - 1} = \frac{1.002001 - 1}{0.001} = 2.001$$

The slopes are heading toward 2. As the wiggle shrinks, the secant slope converges to a specific number. That limiting number is the derivative at $x = 1$.

For $f(x) = x^2$, compute the secant slope between $x = 2$ and $x = 2.1$.

The limit definition

The symbolic version uses a variable wiggle $h$ and lets it shrink to zero. This is the actual definition of a derivative:

$$f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}$$

For $f(x) = x^2$, plug in and expand:

$$\frac{f(x + h) - f(x)}{h} = \frac{(x + h)^2 - x^2}{h} = \frac{x^2 + 2xh + h^2 - x^2}{h} = \frac{2xh + h^2}{h} = 2x + h$$

Now let $h \to 0$. The $h$ term disappears and we're left with:

$$f'(x) = 2x$$

The power-rule shortcut "bring the exponent down, subtract one" is the compressed answer you get after running this limit argument in general. The rules throughout calculus are derived the same way.

Evaluated at $x = 1$: $f'(1) = 2$, matching the numerical experiment.

For $f(x) = x^2$, what is $f'(3)$?

Three framings of the derivative

The derivative is one object, but it shows up in three framings:

πŸ“ˆ Slope of the tangent
f(x) = xΒ² Β· at x=1.00 Β· f(x)=1.00 Β· slope f'(x)=2.00

Drag the slider. Three things happen:

Move the slider to $x = -1.5$. Before looking: is the tangent's slope positive, zero, or negative?

3. Derivative rules you'll actually use

You will almost never compute a derivative by hand in production ML code β€” autograd does it for you. But you need to read derivatives when they appear in papers, blog posts, and error messages. The minimum working set:

RuleResultWhy you see it
Constant$\frac{d}{dx}(c) = 0$Bias terms, regularization constants
Power$\frac{d}{dx}(x^n) = n x^{n-1}$Squared-error loss, L2 regularization
Sum$(f + g)' = f' + g'$Total loss = sum of per-sample losses
Product$(fg)' = f'g + fg'$Layer outputs Γ— activations
Exponential$\frac{d}{dx}(e^x) = e^x$Softmax, sigmoid, cross-entropy
Logarithm$\frac{d}{dx}(\ln x) = \frac{1}{x}$Cross-entropy loss, log-likelihood

Worked example: $\frac{d}{dx}(5x^3 - 2x + 7)$. Apply sum and constant rules, then the power rule to each term: $15x^2 - 2 + 0 = 15x^2 - 2$. The 7 disappears because the derivative of a constant is zero β€” constants don't change as $x$ changes. The $-2x$ becomes $-2$ because $\frac{d}{dx}(x) = 1$.

What is $\frac{d}{dx}(4x^3)$?
What is $\frac{d}{dx}(3x^2 + 100)$?

4. The chain rule β€” the rule that makes deep learning possible

Neural networks are not one function; they are compositions of many. A typical forward pass looks like:

$$\text{loss}(y, \;\text{softmax}(W_3 \cdot \text{relu}(W_2 \cdot \text{relu}(W_1 \cdot x + b_1) + b_2) + b_3))$$

That's a function wrapped in a function wrapped in a function, layered down through every layer of the network. To train it, we need the derivative of the loss with respect to each weight β€” buried many levels deep inside that composition. The chain rule tells us how to do it.

The rule itself

If $h(x) = f(g(x))$ β€” a function of a function β€” then:

$$\frac{dh}{dx} = f'(g(x)) \cdot g'(x)$$

In English: derivative of the outer (evaluated at the inner) times the derivative of the inner. A useful way to remember it: imagine you're tracking how a small wiggle in $x$ propagates. First $x$ wiggles $g$ by $g'(x) \cdot \Delta x$. Then $g$'s wiggle propagates through $f$, which turns it into $f'(g(x))$ times that. The two rates multiply.

For $h(x) = (5x - 2)^2$, which is the inner function and which is the outer?

Worked example

Compute $h'(x)$ for $h(x) = (3x + 1)^2$.

  1. Identify the inner function: $g(x) = 3x + 1$. Its derivative: $g'(x) = 3$.
  2. Identify the outer function: $f(u) = u^2$. Its derivative: $f'(u) = 2u$.
  3. Plug into the chain rule: $h'(x) = f'(g(x)) \cdot g'(x) = 2(3x + 1) \cdot 3 = 6(3x + 1)$.

Sanity check at $x = 0$: $h(0) = 1$, $h(0.01) = (0.03 + 1)^2 = 1.0609$. Numerical slope $\approx (1.0609 - 1) / 0.01 = 6.09$. Formula gives $h'(0) = 6(0 + 1) = 6$. Close enough β€” the tiny discrepancy is the same $h \to 0$ limit effect we saw in Section 2.

Using $h'(x) = 6(3x + 1)$ from the worked example, compute $h'(1)$.
Backprop = the chain rule, run at scale

Backpropagation β€” the algorithm that trains every deep network β€” is the chain rule applied repeatedly, layer by layer, starting from the loss and walking backward through the composition to each weight. The hard part isn't the math; it's the bookkeeping of intermediate values at scale. Autograd (which you'll see in Section 10) handles that bookkeeping automatically. When a reference says "the gradient flows backward through the network," the thing flowing is chain-rule products.

The inner-derivative multiplier

A common mistake: seeing $(f \circ g)(x)$ and writing the derivative as $f'(g(x))$, forgetting the $\cdot g'(x)$ multiplier. The multiplier is where the rule does its real work. Without it, a network with 50 layers would give you the wrong gradient at every layer. That $g'(x)$ factor is also why activation functions matter β€” if $g'$ is close to zero (a vanishing gradient), you multiply many small numbers together and the product goes to zero long before it reaches the early layers. Picking activations with well-behaved derivatives is half the art of deep learning.

5. Partial derivatives β€” when inputs are plural

A real ML loss function doesn't depend on one scalar weight β€” it depends on millions. For a function of several variables, we take a partial derivative: the derivative with respect to one variable, holding the others fixed. Notation:

$$\frac{\partial f}{\partial x} = \text{how much } f \text{ changes if I wiggle only } x, \text{ keeping } y \text{ frozen}$$

The rounded $\partial$ (called "del" or "partial") distinguishes a partial derivative from an ordinary derivative $\frac{d}{dx}$. The computation is straightforward: pretend every other variable is a constant, and differentiate as usual.

Worked example: $f(x, y) = 3x^2 + 2xy + y^3$.

Partial derivatives are ordinary derivatives taken with every other variable held fixed.

For $f(x, y) = x^2 y + 5y$, what is $\frac{\partial f}{\partial x}$?
For $f(x, y) = x^2 y + 5y$, what is $\frac{\partial f}{\partial y}$?

6. The gradient β€” all the partials, packed together

The gradient of a multi-variable function, written $\nabla f$ (pronounced "del f" or "nabla f"), is a vector containing every partial derivative:

$$\nabla f(x, y) = \begin{bmatrix} \partial f / \partial x \\ \partial f / \partial y \end{bmatrix}$$

For our function $f(x, y) = 3x^2 + 2xy + y^3$:

$$\nabla f(x, y) = \begin{bmatrix} 6x + 2y \\ 2x + 3y^2 \end{bmatrix}$$

A gradient is a vector β€” it has the same length as the number of variables β€” so it lives in the same shape-world as Lesson 1. If $w$ has 10,000 weights, $\nabla L(w)$ is a 10,000-dimensional vector, shaped the same as $w$, one entry per weight.

For $\nabla f(x, y) = [6x + 2y,\; 2x + 3y^2]$, what is $\nabla f(1, 2)$?
If $w$ is a weight vector of shape $(10000,)$, what is the shape of $\nabla L(w)$?

Geometric meaning of the gradient

The gradient points uphill β€” steepest uphill

At any point where $f$ is differentiable, $\nabla f$ points in the direction of steepest ascent, and its magnitude $\|\nabla f\|$ is how steep that ascent is. To go downhill fastest, move in the direction $-\nabla f$. This is the reason gradient descent uses a minus sign.

If you stepped along $\nabla f$, you'd be climbing as fast as possible β€” loss goes up, training diverges. Stepping against it descends as fast as possible. A flipped sign here is a common cause of first-time "my training isn't working" bugs. Autograd handles the sign for you in PyTorch (that's what .backward() + the optimizer's minus-sign combo does), but if you're writing the loop yourself β€” which you'll do in the exercise β€” the sign is your responsibility.

To decrease a loss function as fast as possible from the current point, which direction do you step?

7. Gradient descent β€” the algorithm that trains everything

The ingredients so far:

The gradient descent update rule:

$$w_{\text{new}} = w_{\text{old}} - \eta \cdot \nabla L(w_{\text{old}})$$

where $\eta$ (Greek letter "eta") is the learning rate β€” a small positive number controlling how big each step is. In pseudocode:

w = initial_guess
for step in range(num_steps):
    grad = compute_gradient(L, w)   # vector, same shape as w
    w    = w - learning_rate * grad # one step downhill
    # stop when grad is tiny, or loss stopped shrinking, or we hit max steps

That loop β€” compute gradient, step against it, repeat β€” is how every neural network learns. Adam, SGD, RMSprop, AdamW: all variants of this idea with cleverer step rules. The skeleton is always the same.

In a real training loop, what are the typical stopping conditions?
If $w = [2, 3]$, $\nabla L(w) = [2, -4]$, and $\eta = 0.1$, what is $w_{\text{new}}$ after one gradient-descent step?
Learning rate is the #1 knob

Too large and you jump clean over the minimum β€” loss oscillates, then blows up to infinity or NaN. Too small and training crawls; you run out of budget before getting close. Typical starting values: 1e-3 for Adam, 1e-2 to 1e-1 for plain SGD. When a training run looks broken and you don't know why, the learning rate is always the first thing to check.

The analogue in ETL work is a retry-backoff tuning knob: too aggressive and you overshoot the target service, too timid and you never catch up. In gradient descent, "overshooting" doesn't mean rate-limiting β€” it means the loss diverges to infinity and you get a NaN in your training logs at 3 a.m.

Three common questions

You're training a model and the loss goes to NaN on step 2. Which learning rate is almost certainly the problem?

8. A concrete gradient descent run, by hand

Minimize $f(x, y) = x^2 + 3y^2$. By inspection the minimum is at $(0, 0)$; assume we don't know that.

Gradient: $\nabla f = \begin{bmatrix} 2x \\ 6y \end{bmatrix}$. Starting point: $(4, 3)$. Learning rate: $\eta = 0.1$.

Step 0, longhand

  1. Current point: $w = [4, 3]$. Loss: $f(4, 3) = 16 + 27 = 43$.
  2. Gradient here: $\nabla f(4, 3) = [2 \cdot 4, 6 \cdot 3] = [8, 18]$.
  3. Step: $w_{\text{new}} = [4, 3] - 0.1 \cdot [8, 18] = [4 - 0.8, 3 - 1.8] = [3.2, 1.2]$.
  4. New loss: $f(3.2, 1.2) = 10.24 + 4.32 = 14.56$. Dropped from 43 to 14.56 in one step.

The $y$ coordinate moved faster than $x$ (3 β†’ 1.2 is a bigger relative move than 4 β†’ 3.2) because the gradient in $y$ was larger β€” coefficient 6 vs. 2. The steeper a direction, the bigger the step taken there. Gradient descent takes big strides down steep slopes and gentler strides down gradual ones.

Starting from $w = [3.2, 1.2]$ after step 0, compute step 1. What is the loss after step 1?

The full run in NumPy

Run this as-is, then re-run with eta = 0.001 (watch training crawl) and eta = 0.4 (watch it diverge to inf or NaN). The learning-rate sensitivity from the earlier warning, in numerical form.

Of these learning rates, which is the first that causes the optimizer to diverge on $f(x, y) = x^2 + 3y^2$?

The playground below embeds the same loop with a draggable starting point and a learning-rate slider β€” the loss surface rendered as contour lines, with the optimizer's path drawn on top as you step. Use it to build intuition for what an oscillating or diverging run looks like. Try learning rates of 0.05, 0.3, and 0.4 on the default function; identify the one where the path starts zig-zagging and then flies off the surface.

9. Vectorized gradients β€” the form you'll actually write

In real ML code, you don't write gradients one variable at a time. You write them as vector/matrix expressions, using the shape rules from Lesson 1. Example: the mean-squared-error loss for linear regression.

$$L(w) = \frac{1}{n} \|Xw - y\|^2$$

Here $X$ is the feature matrix of shape $(n, d)$, $y$ is the target vector of length $n$, and $w$ is the weight vector of length $d$. The loss is a single number. Its gradient with respect to $w$ is:

$$\nabla_w L = \frac{2}{n} X^T (Xw - y)$$
Inside the gradient formula, what does the vector $(Xw - y)$ represent?

You don't need to derive this right now β€” you'll do it carefully in Module 3. Three properties to register today:

If X.shape = (100, 5), w.shape = (5,), and y.shape = (100,), what is the shape of (2/n) * X.T @ (X @ w - y)?

10. Automatic differentiation β€” why you rarely hand-derive

You just did several pages of calculus. In production you will do almost none of it. Modern frameworks provide autograd β€” automatic differentiation β€” which computes derivatives for you, exactly (not numerically), for any computation you write using their primitives.

The mental model: when you compute a forward pass with PyTorch tensors that have requires_grad=True, PyTorch builds a graph of every operation under the hood. When you call .backward() on the loss, it walks that graph backward applying the chain rule at each node, and leaves the gradient in .grad on every parameter tensor.

import torch

# Same toy problem as Section 8
w = torch.tensor([4.0, 3.0], requires_grad=True)   # track grads for this tensor
loss = w[0]**2 + 3 * w[1]**2                        # forward pass, builds the graph

loss.backward()                                      # walks the graph backward, fills w.grad
print(w.grad)       # tensor([8., 18.])  ==  [2x, 6y] at (4, 3) β€” matches Section 8

# One manual gradient-descent step:
with torch.no_grad():                                # don't track this update
    w -= 0.1 * w.grad
    w.grad.zero_()                                   # reset grads for next iteration

You will write this pattern hundreds of times. In practice you won't write the manual step β€” you'll use a PyTorch optimizer like torch.optim.SGD or torch.optim.Adam that encapsulates the update rule:

import torch

w = torch.tensor([4.0, 3.0], requires_grad=True)
optimizer = torch.optim.SGD([w], lr=0.1)

for step in range(20):
    optimizer.zero_grad()                # clear leftover gradients from last step
    loss = w[0]**2 + 3 * w[1]**2         # forward pass
    loss.backward()                       # autograd fills w.grad
    optimizer.step()                      # applies: w = w - lr * w.grad
    print(f"step {step:2d}: w={w.data.numpy().round(3)}, loss={loss.item():.4f}")

This is the canonical training loop. Every PyTorch model you'll ever write is some version of it, with a more elaborate forward pass and a more elaborate loss. Module 5 builds from here.

In a PyTorch training loop, which call actually fills w.grad with the gradient?
Why do PyTorch training loops always call optimizer.zero_grad() at the top of each iteration?
Read the source docs

The PyTorch autograd mechanics note is one of the most valuable pieces of documentation in the ML ecosystem β€” worth a full read once. For the classical side, scipy.optimize provides quasi-Newton methods (L-BFGS) that work for small/mid problems with exact gradients. And for the pure math background, the Wikipedia derivative article is more accessible than most textbooks.

11. What goes wrong β€” and what you do about it

Common training failures and their first-response moves:

Your training loss goes to NaN after 200 steps. What's the first thing to try?
A custom training loop: loss climbs every step, monotonically. Most likely culprit?

12. Self-check

Answer these without scrolling back up. Each question is checkable β€” wrong answers reveal the reasoning so you know what to re-read.

What is the derivative of $5x^3 - 2x + 7$ evaluated at $x = 0$?
For $h(x) = (2x + 1)^2$, compute $h'(4.5)$.
For $f(x, y) = x^2 y + 5y$, what is $\frac{\partial f}{\partial x}$?
If $\nabla L(w) = [2, -3, 1]$ and $\eta = 0.1$, what is $\Delta w = w_{\text{new}} - w_{\text{old}}$?
If w.shape = (10,) and $L(w)$ is a scalar loss, what is the shape of $\nabla L(w)$?
Why does the gradient descent update rule subtract the gradient rather than adding it?
A training loss climbs to inf then NaN around step 300. First thing to try?
Wiggle each coordinate of w by h, measure the change in f, divide by h. The last printed line should be 18.0, matching the analytic answer.

For any that were fuzzy, jump back to the relevant section β€” or scroll up and run the playground again until the mechanics feel mechanical.

Glossary β€” all terms from this lesson
Derivative
How fast a function's output changes when you wiggle its input, in the limit as the wiggle shrinks. Slope of the tangent line. Written $f'(x)$ or $\tfrac{df}{dx}$.
Slope
Rise over run. For a curve, the slope at a point is the slope of the tangent line there β€” the derivative.
Rate of change
Output change per unit of input change. The derivative is the instantaneous rate of change (taken in the limit).
Stationary point
A point where $f'(x) = 0$ (or $\nabla f = 0$ in many variables). Could be a minimum, maximum, or saddle.
Minimum / Local minimum
A point where $f$ is lower than everything nearby. Gradient is zero and the curvature bends upward.
Global minimum
The single lowest point anywhere in the domain. For convex functions, any local minimum is also the global minimum.
Convex function
Bowl-shaped: any line between two points on the graph lies above or on the graph. For these, gradient descent converges to the global minimum.
Chain rule
Derivative of a composition: $(f \circ g)'(x) = f'(g(x)) \cdot g'(x)$. The mathematical backbone of backpropagation.
Partial derivative
Derivative of a multi-variable function with respect to one variable, holding the others fixed. Written $\tfrac{\partial f}{\partial x}$.
Gradient
The vector of all partial derivatives, $\nabla f$. Same shape as the input. Points in the direction of steepest ascent.
Loss function
A scalar-valued function of the weights that measures how bad the model's predictions are. Training minimizes it.
Gradient descent
Iterative optimizer: $w \leftarrow w - \eta \nabla L(w)$. Repeatedly step against the gradient to reduce loss.
Learning rate
The step size $\eta$ in gradient descent. Too big diverges, too small crawls. The #1 hyperparameter.
Convergence
The optimizer has (approximately) stopped moving β€” gradient near zero, loss flatlining, or max steps hit.
Vanishing gradients
Gradient values shrink toward zero (often in early layers of deep nets), so those layers stop learning.
Backpropagation
Algorithm that computes gradients of a loss through a neural network by applying the chain rule layer by layer, from output backward to input.
Autograd
Automatic differentiation. The framework feature (PyTorch, TensorFlow, JAX) that computes exact gradients of any computation you build with its primitives.
Optimizer
The object that encapsulates the update rule in training loops β€” SGD, Adam, RMSprop, etc. Stores state (momentum, running averages) and applies w = w - lr * grad (plus extras) when you call .step().
Linear regression
The simplest ML model: $y_{\text{pred}} = Xw$. Its MSE loss has a clean closed-form gradient that you'll implement in Module 3.
What's next

Lesson 4 β€” Eigenvalues & SVD β€” completes the math foundation and unlocks PCA, rank, and a deeper view of what matrices actually do to space. After that, your first hands-on exercise implements gradient descent from scratch in exercises/exercise-02-gradient-descent.ipynb β€” this lesson is the prerequisite for every line of that notebook.