Module 2 ยท Lesson 4ยท~30 minยทThe core ML tradeoff

Bias, Variance, and the Tradeoff That Governs ML

The single most important conceptual framework for understanding why models fail. Every regularization technique, every cross-validation strategy, every "should I use a bigger model?" question routes through this.

1. The decomposition

For a model $\hat{f}$ trying to predict a true function $f$, the expected squared error on an unseen point can be written as:

$$E[(y - \hat{f}(x))^2] = \underbrace{\text{Bias}[\hat{f}(x)]^2}_{\text{systematic miss}} + \underbrace{\text{Var}[\hat{f}(x)]}_{\text{sensitivity to data}} + \underbrace{\sigma^2}_{\text{irreducible}}$$

Three components:

Bias

How far, on average, the model's predictions are from the truth โ€” because the model is too simple to capture the real pattern.

Variance

How much the model's predictions would change if you retrained on a different random sample โ€” because the model overreacts to the specific data it saw.

Irreducible

Noise in the data itself. No model can beat this โ€” it's the ceiling of what's learnable.

2. The archery target picture

Imagine shooting at a target multiple times. Each training run = one quiver of arrows.

3. Underfitting and overfitting

Underfitting (high bias)

The model is too simple to capture the pattern. Symptoms:

Fix: more features, more complex model (deeper network, higher-degree polynomial, fewer constraints), less regularization.

Overfitting (high variance)

The model memorized the training noise. Symptoms:

Fix: more data, simpler model, stronger regularization, dropout, early stopping, data augmentation.

Your diagnostic routine

Whenever a model underperforms, plot training loss and validation loss on the same axis across training (for iterative models) or across model complexity. The shape of that gap tells you which problem you have, and therefore which fix to try. This is the single most valuable debugging habit in ML.

4. The tradeoff, illustrated

As you make a model more complex:

Total error has a U shape. The sweet spot is in the middle.

In code, you can see this directly with polynomial regression:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np

# True relationship: quadratic + noise
X = np.random.uniform(-3, 3, 200).reshape(-1, 1)
y = 0.5 * X.ravel()**2 + 0.2 * X.ravel() + np.random.normal(0, 0.5, 200)

for degree in [1, 2, 5, 15]:
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    print(f"degree={degree}: CV MSE = {-scores.mean():.3f}")

# degree=1:  CV MSE = 1.23  <- high bias (underfitting)
# degree=2:  CV MSE = 0.27  <- sweet spot
# degree=5:  CV MSE = 0.31  <- starting to overfit
# degree=15: CV MSE = 0.88  <- high variance

5. Regularization โ€” the cure for variance

Regularization penalizes model complexity during training, pulling weights toward zero and preventing overfitting.

L2 (Ridge)

$$L = \text{MSE} + \alpha \sum_i w_i^2$$

Shrinks all weights, keeps them small. Smooth, differentiable. This is what weight_decay is in PyTorch optimizers.

L1 (Lasso)

$$L = \text{MSE} + \alpha \sum_i |w_i|$$

Shrinks weights AND drives many to exactly zero โ€” automatic feature selection.

Elastic net

Combination of L1 and L2. Good default when you don't know which to prefer.

Other forms of regularization (deep learning)

6. The "double descent" wrinkle

The classical bias-variance U-curve isn't the whole story. In very over-parameterized regimes (think: modern LLMs with billions of parameters trained on limited data), test error can go down again as you add more capacity. This is called double descent, and it's an active research area.

The intuition: when a model has way more parameters than data points, gradient descent tends to find the "smoothest" solution that fits the data โ€” which generalizes better than classical theory predicts. But don't let this make you skip regularization. For most practical work, the classical U-curve still dominates.

7. Model complexity โ‰  model size

A 100-million-parameter neural network with strong regularization can be less effectively complex than a 1000-parameter polynomial. "Complexity" means "variance of the trained function" โ€” not parameter count. Modern deep learning relies on this fact.

8. Putting it together: the diagnostic flowchart

When your model isn't good enough:

SymptomLikely causeTry
Train error high, test close to trainHigh bias (underfitting)Bigger model, more features, longer training, less regularization
Train error low, test error much higherHigh variance (overfitting)More data, smaller model, more regularization, early stopping, dropout
Train error low, test error low on val, much worse in prodDistribution shiftRetrain on recent data, monitor drift
Train error plateaus at a high floorIrreducible noise or missing signalBetter features, more data โ€” or accept the ceiling

9. Self-check

Module 2 wrap

You now have the statistical foundation to reason about data, uncertainty, and when to trust a model. Together with Module 1's linear algebra and calculus, you have the math language of ML.

Module 3 is where we finally build models. It's going to feel like a lot of what you just learned suddenly clicks into place.