Module 2 · Lesson 3·~40 min·Making decisions

Bayes, Hypothesis Testing & Inference

How to update your beliefs when data arrives, and how to decide whether a difference you observed is real. These two skills separate engineers who ship models from engineers who ship trusted models.

1. Bayes' Theorem — the one equation

P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}

In words: to get the probability of $A$ given $B$, take how likely $B$ is under $A$, weight it by how likely $A$ was to begin with, then normalize by how likely $B$ is overall.

In ML-shaped notation — where $\theta$ represents parameters and $D$ represents data:

\underbrace{P(\theta \mid D)}_{\text{posterior}} = \frac{\overbrace{P(D \mid \theta)}^{\text{likelihood}} \cdot \overbrace{P(\theta)}^{\text{prior}}}{\underbrace{P(D)}_{\text{evidence}}}

Memorize this. These four words — posterior, likelihood, prior, evidence — show up everywhere from linear regression to LLM fine-tuning.

2. The classic Bayes example: the surprisingly rare disease

Suppose a disease affects 1% of the population. You take a test that is:

99% accurate on people who have it (true positive rate)
95% accurate on people who don't (true negative rate)

You test positive. What's the probability you actually have the disease?

Most people's gut says "like 95%." Let's see.

P(\text{sick} \mid +) = \frac{P(+ \mid \text{sick}) \cdot P(\text{sick})}{P(+)}

$P(+) = P(+ \mid \text{sick}) P(\text{sick}) + P(+ \mid \text{healthy}) P(\text{healthy}) = 0.99 \cdot 0.01 + 0.05 \cdot 0.99 = 0.0594$

P(\text{sick} \mid +) = \frac{0.99 \cdot 0.01}{0.0594} \approx 0.167

About 17%. Even with a 99%-accurate test, a positive result only means a 17% chance you're sick — because the disease itself is rare. The prior dominates.

This is why calibration matters

ML models often output overconfident probabilities. A fraud classifier that says "this transaction is 95% fraud" in a world where 0.1% of transactions are actually fraudulent is wildly miscalibrated. Always check: when your model says "90% probability of X," out of those cases how many actually are X? That's reliability/calibration, and it matters as much as accuracy.

3. Prior, likelihood, posterior in ML practice

Prior $P(\theta)$: what you believe about parameters before seeing data. Can be deliberately vague ("uninformative prior") or strongly opinionated ("we know weights should be near zero"). L2 regularization in ML is mathematically equivalent to a Gaussian prior centered at 0.
Likelihood $P(D \mid \theta)$: how probable the observed data is under a given $\theta$. This is what MLE (from Lesson 1) maximizes.
Posterior $P(\theta \mid D)$: your updated belief after seeing data. If you use a prior and maximize the posterior instead of just the likelihood, you're doing MAP (maximum a posteriori) estimation — equivalent to regularized MLE.

4. Hypothesis testing — is this difference real?

The frequentist framework for deciding "is the thing I observed likely due to chance?"

State a null hypothesis $H_0$ — usually "there's no effect" (no difference between groups, no improvement).
State an alternative $H_1$ — "there is an effect."
Compute a test statistic from your data (e.g., difference in means).
Compute the p-value: if $H_0$ were true, how likely is a test statistic this extreme or more?
If $p < \alpha$ (traditionally 0.05), reject $H_0$.

What the p-value is NOT

The p-value is not the probability the null is true. It's not the probability you're wrong. It's the probability of seeing data this extreme assuming the null is true. Mixing these up powers a depressing amount of bad science.

Example: two-sample t-test (A/B test)

You run an A/B test on a recommendation change. 1000 users in each group. Group A has mean engagement 4.2 min, stddev 2.1. Group B has 4.5 min, stddev 2.0. Did the change help, or is this noise?

from scipy import stats
import numpy as np

# Simulated data
a = np.random.normal(4.2, 2.1, size=1000)
b = np.random.normal(4.5, 2.0, size=1000)

t_stat, p_value = stats.ttest_ind(a, b)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")
# t = -3.12, p = 0.0018 — highly significant difference

With 1000 in each group, a 0.3 minute difference is detectable. With only 50, it wouldn't be.

5. Type I and Type II errors

	$H_0$ is true (no effect)	$H_0$ is false (real effect)
Reject $H_0$	Type I error (false positive)	Correct ✓
Fail to reject	Correct ✓	Type II error (false negative)

$\alpha$ = significance level = P(Type I error). You pick this; 0.05 is traditional.
$\beta$ = P(Type II error). Depends on effect size, sample size, and variance.
Power = $1 - \beta$ = P(detect a real effect when there is one). You want high power — at least 0.8 — which requires enough sample size.

6. Multiple comparisons — the silent killer

If you test 20 independent hypotheses at $\alpha = 0.05$, you'd expect 1 false positive on average, even if nothing is real. Running 20 A/B tests and celebrating the one with $p < 0.05$ is how garbage products get shipped.

Corrections:

Bonferroni: divide $\alpha$ by the number of tests. Simple, conservative.
Benjamini-Hochberg (FDR): controls the false discovery rate. Better for large-scale testing (e.g., feature selection).

7. When to be Bayesian vs. frequentist in ML

Pragmatically:

Most production ML is frequentist-flavored. Point estimates, confidence intervals, MLE training.
Use Bayesian thinking when: you have strong priors (small data, domain knowledge), you need calibrated uncertainty (medical, finance), or you're doing hyperparameter tuning (Bayesian optimization).
Tools: scipy.stats for classical testing, statsmodels for regression inference, pymc or numpyro for full Bayesian modeling.

8. Self-check

In Bayes' theorem, what do the prior, likelihood, and posterior represent?
A test has $p = 0.03$. What does this tell you? (Careful — this is the question people get wrong.)
When is a Type II error a big deal vs. a Type I error a big deal? (Hint: depends on the domain.)
Why does Bonferroni correction make tests more conservative?