Module 2 · Lesson 4·~30 min·The core ML tradeoff

Bias, Variance, and the Tradeoff That Governs ML

The single most important conceptual framework for understanding why models fail. Every regularization technique, every cross-validation strategy, every "should I use a bigger model?" question routes through this.

1. The decomposition

For a model $\hat{f}$ trying to predict a true function $f$, the expected squared error on an unseen point can be written as:

E[(y - \hat{f}(x))^2] = \underbrace{\text{Bias}[\hat{f}(x)]^2}_{\text{systematic miss}} + \underbrace{\text{Var}[\hat{f}(x)]}_{\text{sensitivity to data}} + \underbrace{\sigma^2}_{\text{irreducible}}

Three components:

Bias

How far, on average, the model's predictions are from the truth — because the model is too simple to capture the real pattern.

Variance

How much the model's predictions would change if you retrained on a different random sample — because the model overreacts to the specific data it saw.

Irreducible

Noise in the data itself. No model can beat this — it's the ceiling of what's learnable.

2. The archery target picture

Imagine shooting at a target multiple times. Each training run = one quiver of arrows.

Low bias, low variance: arrows cluster tightly around the bullseye. 🎯 This is what we want.
High bias, low variance: arrows cluster tightly — but off-target. Consistent but wrong. Underfitting.
Low bias, high variance: arrows average to the bullseye but are scattered everywhere. Right on average, useless individually. Overfitting.
High bias, high variance: scattered and off-target. Uninformative model.

3. Underfitting and overfitting

Underfitting (high bias)

The model is too simple to capture the pattern. Symptoms:

Training error is high
Test error is also high and close to training error
Adding more data doesn't help much

Fix: more features, more complex model (deeper network, higher-degree polynomial, fewer constraints), less regularization.

Overfitting (high variance)

The model memorized the training noise. Symptoms:

Training error is very low
Test error is much higher — wide gap between train and test
Adding more data often helps a lot

Fix: more data, simpler model, stronger regularization, dropout, early stopping, data augmentation.

Your diagnostic routine

Whenever a model underperforms, plot training loss and validation loss on the same axis across training (for iterative models) or across model complexity. The shape of that gap tells you which problem you have, and therefore which fix to try. This is the single most valuable debugging habit in ML.

4. The tradeoff, illustrated

As you make a model more complex:

Bias goes down — the model can represent more patterns.
Variance goes up — the model latches onto more noise.

Total error has a U shape. The sweet spot is in the middle.

In code, you can see this directly with polynomial regression:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np

# True relationship: quadratic + noise
X = np.random.uniform(-3, 3, 200).reshape(-1, 1)
y = 0.5 * X.ravel()**2 + 0.2 * X.ravel() + np.random.normal(0, 0.5, 200)

for degree in [1, 2, 5, 15]:
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    print(f"degree={degree}: CV MSE = {-scores.mean():.3f}")

# degree=1:  CV MSE = 1.23  <- high bias (underfitting)
# degree=2:  CV MSE = 0.27  <- sweet spot
# degree=5:  CV MSE = 0.31  <- starting to overfit
# degree=15: CV MSE = 0.88  <- high variance

5. Regularization — the cure for variance

Regularization penalizes model complexity during training, pulling weights toward zero and preventing overfitting.

L2 (Ridge)

L = \text{MSE} + \alpha \sum_i w_i^2

Shrinks all weights, keeps them small. Smooth, differentiable. This is what weight_decay is in PyTorch optimizers.

L1 (Lasso)

L = \text{MSE} + \alpha \sum_i |w_i|

Shrinks weights AND drives many to exactly zero — automatic feature selection.

Elastic net

Combination of L1 and L2. Good default when you don't know which to prefer.

Other forms of regularization (deep learning)

Dropout — randomly zero out activations during training
Data augmentation — apply label-preserving transformations to training data
Early stopping — stop training when validation loss stops improving
Batch normalization — implicit regularization through noise injection

6. The "double descent" wrinkle

The classical bias-variance U-curve isn't the whole story. In very over-parameterized regimes (think: modern LLMs with billions of parameters trained on limited data), test error can go down again as you add more capacity. This is called double descent, and it's an active research area.

The intuition: when a model has way more parameters than data points, gradient descent tends to find the "smoothest" solution that fits the data — which generalizes better than classical theory predicts. But don't let this make you skip regularization. For most practical work, the classical U-curve still dominates.

7. Model complexity ≠ model size

A 100-million-parameter neural network with strong regularization can be less effectively complex than a 1000-parameter polynomial. "Complexity" means "variance of the trained function" — not parameter count. Modern deep learning relies on this fact.

8. Putting it together: the diagnostic flowchart

When your model isn't good enough:

Symptom	Likely cause	Try
Train error high, test close to train	High bias (underfitting)	Bigger model, more features, longer training, less regularization
Train error low, test error much higher	High variance (overfitting)	More data, smaller model, more regularization, early stopping, dropout
Train error low, test error low on val, much worse in prod	Distribution shift	Retrain on recent data, monitor drift
Train error plateaus at a high floor	Irreducible noise or missing signal	Better features, more data — or accept the ceiling

9. Self-check

If your model has 99% training accuracy and 65% test accuracy, what do you do?
If your model has 60% training accuracy and 60% test accuracy, what do you do?
What's the difference between bias in statistics and bias in a social/ethical sense? (Both matter in ML, but they're different concepts.)
Why does L1 regularization produce sparse weights but L2 doesn't?

Module 2 wrap

You now have the statistical foundation to reason about data, uncertainty, and when to trust a model. Together with Module 1's linear algebra and calculus, you have the math language of ML.

Module 3 is where we finally build models. It's going to feel like a lot of what you just learned suddenly clicks into place.