Module 2 ยท Lesson 2ยท~35 minยทWhy we can trust averages

Sampling, Variance, and the CLT

Why does averaging a thousand random things give us a stable number? Why do error bars shrink with more data? The Central Limit Theorem is the answer โ€” and it's the reason statistics works at all.

Why this matters for ML

Every ML evaluation you'll ever run โ€” accuracy, ROC AUC, A/B test lift โ€” is a sample statistic. Understanding the uncertainty of samples is the difference between "model A beat model B" and "model A looked better but we can't tell if it was noise."

1. Population vs. sample

Two different things, constantly confused:

The goal of statistics is to use the sample to say something reliable about the population.

QuantityPopulation (truth, unknown)Sample (estimate, what we have)
Mean$\mu$$\bar{x} = \frac{1}{n}\sum x_i$
Variance$\sigma^2$$s^2 = \frac{1}{n-1}\sum (x_i - \bar{x})^2$
Proportion$p$$\hat{p}$
Why $n-1$ and not $n$ in the sample variance?

Using $\bar{x}$ instead of the unknown $\mu$ slightly underestimates the spread. Dividing by $n-1$ (Bessel's correction) fixes the bias. In ML this rarely matters โ€” np.var(x, ddof=1) gives you the corrected version. np.var(x) uses $n$. For large $n$ the difference is negligible.

2. The sample mean has its own distribution

If I draw a new sample of 100 users and compute the mean age, I'll get some number like 34.2. If I do it again with a different sample of 100 users, I'll get something slightly different โ€” maybe 33.9, or 34.6. The sample mean itself is a random variable.

How does it behave? Two facts that might surprise you:

  1. The expected value of the sample mean is the true mean. $E[\bar{x}] = \mu$. On average, we're right.
  2. The variance of the sample mean is $\sigma^2 / n$. It shrinks as $n$ grows. Standard error: $\sigma / \sqrt{n}$.
$$\text{SE}(\bar{x}) = \frac{\sigma}{\sqrt{n}}$$

This is the most important formula in applied statistics. Read it as: uncertainty in an estimate decreases as the square root of the sample size. Want half the error? Four times the data. Want a tenth? A hundred times.

3. The Central Limit Theorem

Here's the magic. Take any distribution with finite mean $\mu$ and variance $\sigma^2$ โ€” skewed, uniform, bimodal, whatever. Now draw $n$ independent samples from it and compute their mean. Do this many times. What's the distribution of those means?

It's approximately normal. Regardless of the shape of the original distribution. As $n \to \infty$:

$$\bar{x} \sim \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{(approximately)}$$

See it in action

Start with a wildly non-normal distribution (pick it below). Draw $n$ samples, take their mean, repeat 5000 times. Watch the histogram of means become a normal curve as $n$ increases.

๐ŸŽฒ Central Limit Theorem demo

Top: the source distribution. Bottom: histogram of sample means across 5000 trials. At n=1 the bottom looks like the top. By n=30 it's shockingly close to a bell curve โ€” no matter what the top looks like.

Rule of thumb

The CLT usually kicks in by $n \approx 30$ for moderately skewed distributions. For very skewed distributions you might need $n \approx 100$. For wild distributions (like Pareto with infinite variance), the CLT doesn't apply at all โ€” that's a real edge case in web analytics.

4. Confidence intervals

Using the CLT, we can put error bars on our estimates. For a sample mean:

$$\bar{x} \pm 1.96 \cdot \frac{s}{\sqrt{n}}$$

This is an approximate 95% confidence interval. The interpretation is subtle โ€” it does not mean "there's a 95% chance the true mean is in here." It means: if we repeated this whole procedure many times, 95% of the intervals we compute would contain the true mean.

Concretely:

import numpy as np
data = np.random.exponential(scale=2, size=500)  # sampled from true mean = 2
mean = data.mean()
se = data.std(ddof=1) / np.sqrt(len(data))
ci_low, ci_high = mean - 1.96*se, mean + 1.96*se
print(f"Estimate: {mean:.3f}  (95% CI: [{ci_low:.3f}, {ci_high:.3f}])")
# Estimate: 1.954  (95% CI: [1.776, 2.132])  โ€” CI contains the truth โœ“

5. Bootstrap โ€” the Swiss Army knife of confidence intervals

The formula above only works for simple estimators (means, proportions). For anything more complex โ€” a median, an Rยฒ score, a ROC AUC โ€” use bootstrap resampling:

def bootstrap_ci(data, statistic, n_boot=5000, alpha=0.05):
    estimates = []
    n = len(data)
    for _ in range(n_boot):
        sample = np.random.choice(data, size=n, replace=True)
        estimates.append(statistic(sample))
    return np.percentile(estimates, [100*alpha/2, 100*(1-alpha/2)])

# Confidence interval for the median
data = np.random.normal(50, 10, size=200)
print(bootstrap_ci(data, np.median))

Bootstrap is honestly one of the coolest ideas in statistics โ€” "resample with replacement, compute the statistic, look at the spread." It works for almost anything. Use it whenever you want uncertainty around an ML metric on your test set.

Real ML usage

When comparing two models, don't just report "Model A: 0.87 AUC, Model B: 0.85 AUC โ€” A wins." Bootstrap your test set 1000 times, compute the AUC difference in each resample, and report the 95% CI. If it crosses zero, you don't actually have evidence that A is better. This one habit will level up your ML work immediately.

6. Law of Large Numbers (the CLT's cousin)

A simpler, older result: the sample mean converges to the true mean as $n$ grows. $\bar{x}_n \to \mu$. This is why ML works at all โ€” with more data, we converge toward the true underlying function.

The LLN guarantees convergence. The CLT tells us the rate ($1/\sqrt{n}$) and the shape of the errors around the true value (normal).

7. Self-check