Module 2 ยท Lesson 3ยท~40 minยทMaking decisions

Bayes, Hypothesis Testing & Inference

How to update your beliefs when data arrives, and how to decide whether a difference you observed is real. These two skills separate engineers who ship models from engineers who ship trusted models.

1. Bayes' Theorem โ€” the one equation

$$P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}$$

In words: to get the probability of $A$ given $B$, take how likely $B$ is under $A$, weight it by how likely $A$ was to begin with, then normalize by how likely $B$ is overall.

In ML-shaped notation โ€” where $\theta$ represents parameters and $D$ represents data:

$$\underbrace{P(\theta \mid D)}_{\text{posterior}} = \frac{\overbrace{P(D \mid \theta)}^{\text{likelihood}} \cdot \overbrace{P(\theta)}^{\text{prior}}}{\underbrace{P(D)}_{\text{evidence}}}$$

Memorize this. These four words โ€” posterior, likelihood, prior, evidence โ€” show up everywhere from linear regression to LLM fine-tuning.

2. The classic Bayes example: the surprisingly rare disease

Suppose a disease affects 1% of the population. You take a test that is:

You test positive. What's the probability you actually have the disease?

Most people's gut says "like 95%." Let's see.

$$P(\text{sick} \mid +) = \frac{P(+ \mid \text{sick}) \cdot P(\text{sick})}{P(+)}$$

$P(+) = P(+ \mid \text{sick}) P(\text{sick}) + P(+ \mid \text{healthy}) P(\text{healthy}) = 0.99 \cdot 0.01 + 0.05 \cdot 0.99 = 0.0594$

$$P(\text{sick} \mid +) = \frac{0.99 \cdot 0.01}{0.0594} \approx 0.167$$

About 17%. Even with a 99%-accurate test, a positive result only means a 17% chance you're sick โ€” because the disease itself is rare. The prior dominates.

This is why calibration matters

ML models often output overconfident probabilities. A fraud classifier that says "this transaction is 95% fraud" in a world where 0.1% of transactions are actually fraudulent is wildly miscalibrated. Always check: when your model says "90% probability of X," out of those cases how many actually are X? That's reliability/calibration, and it matters as much as accuracy.

3. Prior, likelihood, posterior in ML practice

4. Hypothesis testing โ€” is this difference real?

The frequentist framework for deciding "is the thing I observed likely due to chance?"

  1. State a null hypothesis $H_0$ โ€” usually "there's no effect" (no difference between groups, no improvement).
  2. State an alternative $H_1$ โ€” "there is an effect."
  3. Compute a test statistic from your data (e.g., difference in means).
  4. Compute the p-value: if $H_0$ were true, how likely is a test statistic this extreme or more?
  5. If $p < \alpha$ (traditionally 0.05), reject $H_0$.
What the p-value is NOT

The p-value is not the probability the null is true. It's not the probability you're wrong. It's the probability of seeing data this extreme assuming the null is true. Mixing these up powers a depressing amount of bad science.

Example: two-sample t-test (A/B test)

You run an A/B test on a recommendation change. 1000 users in each group. Group A has mean engagement 4.2 min, stddev 2.1. Group B has 4.5 min, stddev 2.0. Did the change help, or is this noise?

from scipy import stats
import numpy as np

# Simulated data
a = np.random.normal(4.2, 2.1, size=1000)
b = np.random.normal(4.5, 2.0, size=1000)

t_stat, p_value = stats.ttest_ind(a, b)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")
# t = -3.12, p = 0.0018 โ€” highly significant difference

With 1000 in each group, a 0.3 minute difference is detectable. With only 50, it wouldn't be.

5. Type I and Type II errors

$H_0$ is true (no effect)$H_0$ is false (real effect)
Reject $H_0$Type I error (false positive)Correct โœ“
Fail to rejectCorrect โœ“Type II error (false negative)

6. Multiple comparisons โ€” the silent killer

If you test 20 independent hypotheses at $\alpha = 0.05$, you'd expect 1 false positive on average, even if nothing is real. Running 20 A/B tests and celebrating the one with $p < 0.05$ is how garbage products get shipped.

Corrections:

7. When to be Bayesian vs. frequentist in ML

Pragmatically:

8. Self-check