Every time a model "learns" anything, it's running the same three-step loop: compute a loss, take a derivative, adjust the weights against the derivative. Master this loop and you've understood training β not a particular model, all of them.
You've written plenty of iterative algorithms β a retry loop with exponential backoff, a dbt incremental model that converges toward a steady state, a cron job that moves a watermark forward until it catches up to "now." Gradient descent is the same flavor: a loop with a tunable step size that terminates when some measure of progress stops shrinking. The difference is what measures progress (a loss function) and what direction to step (the negative gradient). By the end of this lesson you'll have built that loop by hand β the same loop that trains every neural network on earth.
Training an ML model means picking numbers (the weights) that make the model's predictions as close to the truth as possible on your training data. "As close as possible" is measured by a single number: the loss. Lower loss = better model. So training reduces to a pure optimization problem: find the weights that minimize the loss.
If there were only two or three weights, you could plot the loss as a surface and eyeball the lowest point. Real models have thousands to billions of weights. You can't plot a billion-dimensional surface, let alone eyeball it. You need an algorithm that, from any starting point, can tell you which way is downhill β without ever seeing the whole surface.
That algorithm is gradient descent. At each point, it needs a direction that locally decreases the loss. The derivative (for one variable) and the gradient (for many variables) provide exactly that: local information about which way the function is rising and how steeply. Follow the negative gradient, take a small step, repeat.
Derivatives and gradients are the tools that let an algorithm navigate a billion-dimensional loss landscape using only local information at whatever point it's currently at.
A derivative answers a single question: if I wiggle the input a tiny bit, how much does the output change? The ratio β output change over input change, taken in the limit as the wiggle shrinks to zero β is the derivative. Everything else in this lesson is a corollary of that one idea.
Take the function $f(x) = x^2$. At $x = 1$, $f(1) = 1$. At $x = 1.1$, $f(1.1) = 1.21$. So as $x$ moved from 1 to 1.1, $f$ moved from 1 to 1.21. The average rate of change over that interval is:
Geometrically, that number is the slope of the straight line drawn through the two points $(1, 1)$ and $(1.1, 1.21)$ on the curve. That line is called a secant line β it cuts through the curve at two points.
Now shrink the wiggle. Try $x = 1$ and $x = 1.01$:
And again, $x = 1$ and $x = 1.001$:
The slopes are heading toward 2. As the wiggle shrinks, the secant slope converges to a specific number. That limiting number is the derivative at $x = 1$.
The symbolic version uses a variable wiggle $h$ and lets it shrink to zero. This is the actual definition of a derivative:
For $f(x) = x^2$, plug in and expand:
Now let $h \to 0$. The $h$ term disappears and we're left with:
The power-rule shortcut "bring the exponent down, subtract one" is the compressed answer you get after running this limit argument in general. The rules throughout calculus are derived the same way.
Evaluated at $x = 1$: $f'(1) = 2$, matching the numerical experiment.
The derivative is one object, but it shows up in three framings:
Drag the slider. Three things happen:
You will almost never compute a derivative by hand in production ML code β autograd does it for you. But you need to read derivatives when they appear in papers, blog posts, and error messages. The minimum working set:
| Rule | Result | Why you see it |
|---|---|---|
| Constant | $\frac{d}{dx}(c) = 0$ | Bias terms, regularization constants |
| Power | $\frac{d}{dx}(x^n) = n x^{n-1}$ | Squared-error loss, L2 regularization |
| Sum | $(f + g)' = f' + g'$ | Total loss = sum of per-sample losses |
| Product | $(fg)' = f'g + fg'$ | Layer outputs Γ activations |
| Exponential | $\frac{d}{dx}(e^x) = e^x$ | Softmax, sigmoid, cross-entropy |
| Logarithm | $\frac{d}{dx}(\ln x) = \frac{1}{x}$ | Cross-entropy loss, log-likelihood |
Worked example: $\frac{d}{dx}(5x^3 - 2x + 7)$. Apply sum and constant rules, then the power rule to each term: $15x^2 - 2 + 0 = 15x^2 - 2$. The 7 disappears because the derivative of a constant is zero β constants don't change as $x$ changes. The $-2x$ becomes $-2$ because $\frac{d}{dx}(x) = 1$.
Neural networks are not one function; they are compositions of many. A typical forward pass looks like:
That's a function wrapped in a function wrapped in a function, layered down through every layer of the network. To train it, we need the derivative of the loss with respect to each weight β buried many levels deep inside that composition. The chain rule tells us how to do it.
If $h(x) = f(g(x))$ β a function of a function β then:
In English: derivative of the outer (evaluated at the inner) times the derivative of the inner. A useful way to remember it: imagine you're tracking how a small wiggle in $x$ propagates. First $x$ wiggles $g$ by $g'(x) \cdot \Delta x$. Then $g$'s wiggle propagates through $f$, which turns it into $f'(g(x))$ times that. The two rates multiply.
Compute $h'(x)$ for $h(x) = (3x + 1)^2$.
Sanity check at $x = 0$: $h(0) = 1$, $h(0.01) = (0.03 + 1)^2 = 1.0609$. Numerical slope $\approx (1.0609 - 1) / 0.01 = 6.09$. Formula gives $h'(0) = 6(0 + 1) = 6$. Close enough β the tiny discrepancy is the same $h \to 0$ limit effect we saw in Section 2.
Backpropagation β the algorithm that trains every deep network β is the chain rule applied repeatedly, layer by layer, starting from the loss and walking backward through the composition to each weight. The hard part isn't the math; it's the bookkeeping of intermediate values at scale. Autograd (which you'll see in Section 10) handles that bookkeeping automatically. When a reference says "the gradient flows backward through the network," the thing flowing is chain-rule products.
A common mistake: seeing $(f \circ g)(x)$ and writing the derivative as $f'(g(x))$, forgetting the $\cdot g'(x)$ multiplier. The multiplier is where the rule does its real work. Without it, a network with 50 layers would give you the wrong gradient at every layer. That $g'(x)$ factor is also why activation functions matter β if $g'$ is close to zero (a vanishing gradient), you multiply many small numbers together and the product goes to zero long before it reaches the early layers. Picking activations with well-behaved derivatives is half the art of deep learning.
A real ML loss function doesn't depend on one scalar weight β it depends on millions. For a function of several variables, we take a partial derivative: the derivative with respect to one variable, holding the others fixed. Notation:
The rounded $\partial$ (called "del" or "partial") distinguishes a partial derivative from an ordinary derivative $\frac{d}{dx}$. The computation is straightforward: pretend every other variable is a constant, and differentiate as usual.
Worked example: $f(x, y) = 3x^2 + 2xy + y^3$.
Partial derivatives are ordinary derivatives taken with every other variable held fixed.
The gradient of a multi-variable function, written $\nabla f$ (pronounced "del f" or "nabla f"), is a vector containing every partial derivative:
For our function $f(x, y) = 3x^2 + 2xy + y^3$:
A gradient is a vector β it has the same length as the number of variables β so it lives in the same shape-world as Lesson 1. If $w$ has 10,000 weights, $\nabla L(w)$ is a 10,000-dimensional vector, shaped the same as $w$, one entry per weight.
At any point where $f$ is differentiable, $\nabla f$ points in the direction of steepest ascent, and its magnitude $\|\nabla f\|$ is how steep that ascent is. To go downhill fastest, move in the direction $-\nabla f$. This is the reason gradient descent uses a minus sign.
If you stepped along $\nabla f$, you'd be climbing as fast as possible β loss goes up, training diverges. Stepping against it descends as fast as possible. A flipped sign here is a common cause of first-time "my training isn't working" bugs. Autograd handles the sign for you in PyTorch (that's what .backward() + the optimizer's minus-sign combo does), but if you're writing the loop yourself β which you'll do in the exercise β the sign is your responsibility.
The ingredients so far:
The gradient descent update rule:
where $\eta$ (Greek letter "eta") is the learning rate β a small positive number controlling how big each step is. In pseudocode:
w = initial_guess
for step in range(num_steps):
grad = compute_gradient(L, w) # vector, same shape as w
w = w - learning_rate * grad # one step downhill
# stop when grad is tiny, or loss stopped shrinking, or we hit max steps
That loop β compute gradient, step against it, repeat β is how every neural network learns. Adam, SGD, RMSprop, AdamW: all variants of this idea with cleverer step rules. The skeleton is always the same.
Too large and you jump clean over the minimum β loss oscillates, then blows up to infinity or NaN. Too small and training crawls; you run out of budget before getting close. Typical starting values: 1e-3 for Adam, 1e-2 to 1e-1 for plain SGD. When a training run looks broken and you don't know why, the learning rate is always the first thing to check.
The analogue in ETL work is a retry-backoff tuning knob: too aggressive and you overshoot the target service, too timid and you never catch up. In gradient descent, "overshooting" doesn't mean rate-limiting β it means the loss diverges to infinity and you get a NaN in your training logs at 3 a.m.
NaN on step 2. Which learning rate is almost certainly the problem?Minimize $f(x, y) = x^2 + 3y^2$. By inspection the minimum is at $(0, 0)$; assume we don't know that.
Gradient: $\nabla f = \begin{bmatrix} 2x \\ 6y \end{bmatrix}$. Starting point: $(4, 3)$. Learning rate: $\eta = 0.1$.
The $y$ coordinate moved faster than $x$ (3 β 1.2 is a bigger relative move than 4 β 3.2) because the gradient in $y$ was larger β coefficient 6 vs. 2. The steeper a direction, the bigger the step taken there. Gradient descent takes big strides down steep slopes and gentler strides down gradual ones.
Run this as-is, then re-run with eta = 0.001 (watch training crawl) and eta = 0.4 (watch it diverge to inf or NaN). The learning-rate sensitivity from the earlier warning, in numerical form.
The playground below embeds the same loop with a draggable starting point and a learning-rate slider β the loss surface rendered as contour lines, with the optimizer's path drawn on top as you step. Use it to build intuition for what an oscillating or diverging run looks like. Try learning rates of 0.05, 0.3, and 0.4 on the default function; identify the one where the path starts zig-zagging and then flies off the surface.
In real ML code, you don't write gradients one variable at a time. You write them as vector/matrix expressions, using the shape rules from Lesson 1. Example: the mean-squared-error loss for linear regression.
Here $X$ is the feature matrix of shape $(n, d)$, $y$ is the target vector of length $n$, and $w$ is the weight vector of length $d$. The loss is a single number. Its gradient with respect to $w$ is:
You don't need to derive this right now β you'll do it carefully in Module 3. Three properties to register today:
grad = (2 / n) * X.T @ (X @ w - y). That's the line a linear regression optimizer loops over.X.shape = (100, 5), w.shape = (5,), and y.shape = (100,), what is the shape of (2/n) * X.T @ (X @ w - y)?You just did several pages of calculus. In production you will do almost none of it. Modern frameworks provide autograd β automatic differentiation β which computes derivatives for you, exactly (not numerically), for any computation you write using their primitives.
The mental model: when you compute a forward pass with PyTorch tensors that have requires_grad=True, PyTorch builds a graph of every operation under the hood. When you call .backward() on the loss, it walks that graph backward applying the chain rule at each node, and leaves the gradient in .grad on every parameter tensor.
import torch
# Same toy problem as Section 8
w = torch.tensor([4.0, 3.0], requires_grad=True) # track grads for this tensor
loss = w[0]**2 + 3 * w[1]**2 # forward pass, builds the graph
loss.backward() # walks the graph backward, fills w.grad
print(w.grad) # tensor([8., 18.]) == [2x, 6y] at (4, 3) β matches Section 8
# One manual gradient-descent step:
with torch.no_grad(): # don't track this update
w -= 0.1 * w.grad
w.grad.zero_() # reset grads for next iteration
You will write this pattern hundreds of times. In practice you won't write the manual step β you'll use a PyTorch optimizer like torch.optim.SGD or torch.optim.Adam that encapsulates the update rule:
import torch
w = torch.tensor([4.0, 3.0], requires_grad=True)
optimizer = torch.optim.SGD([w], lr=0.1)
for step in range(20):
optimizer.zero_grad() # clear leftover gradients from last step
loss = w[0]**2 + 3 * w[1]**2 # forward pass
loss.backward() # autograd fills w.grad
optimizer.step() # applies: w = w - lr * w.grad
print(f"step {step:2d}: w={w.data.numpy().round(3)}, loss={loss.item():.4f}")
This is the canonical training loop. Every PyTorch model you'll ever write is some version of it, with a more elaborate forward pass and a more elaborate loss. Module 5 builds from here.
w.grad with the gradient?optimizer.zero_grad() at the top of each iteration?The PyTorch autograd mechanics note is one of the most valuable pieces of documentation in the ML ecosystem β worth a full read once. For the classical side, scipy.optimize provides quasi-Newton methods (L-BFGS) that work for small/mid problems with exact gradients. And for the pure math background, the Wikipedia derivative article is more accessible than most textbooks.
Common training failures and their first-response moves:
NaN. First moves: lower the learning rate by 10Γ; add gradient clipping (clamp $\|\nabla\|$ to a max norm); check for numerical issues in the loss (e.g., log(0)).assert loss_new <= loss_old + tol in your dev loop; catches it on iteration 1.NaN after 200 steps. What's the first thing to try?Answer these without scrolling back up. Each question is checkable β wrong answers reveal the reasoning so you know what to re-read.
w.shape = (10,) and $L(w)$ is a scalar loss, what is the shape of $\nabla L(w)$?inf then NaN around step 300. First thing to try?w by h, measure the change in f, divide by h. The last printed line should be 18.0, matching the analytic answer.For any that were fuzzy, jump back to the relevant section β or scroll up and run the playground again until the mechanics feel mechanical.
SGD, Adam, RMSprop, etc. Stores state (momentum, running averages) and applies w = w - lr * grad (plus extras) when you call .step().Lesson 4 β Eigenvalues & SVD β completes the math foundation and unlocks PCA, rank, and a deeper view of what matrices actually do to space. After that, your first hands-on exercise implements gradient descent from scratch in exercises/exercise-02-gradient-descent.ipynb β this lesson is the prerequisite for every line of that notebook.