Module 2 ยท Lesson 1ยท~40 minยทFoundations

Probability & Distributions

Probability is the language for reasoning about uncertainty. Every ML model outputs โ€” implicitly or explicitly โ€” a probability distribution. Learn to think in distributions and ML models stop being magic.

Bridge from SQL

When you GROUP BY country and compute COUNT(*), you're estimating the empirical distribution of countries in your data. When you compute AVG(purchase_amount), you're estimating a mean. When you ask "is this difference significant?" โ€” that's inference. Statistics is the rigorous version of the questions you already ask.

1. Probability โ€” four mental models

A probability is a number between 0 and 1 that expresses how likely something is. There are several equally valid ways to think about what that number means:

In ML you'll use all of these at different moments. For training, we mostly act like Bayesians. For evaluation, mostly frequentist. For serving, decision-theoretic.

2. The three axioms, mathematically

For a set of possible outcomes $\Omega$ and a probability function $P$:

Almost everything else in probability follows from these three rules.

3. Joint, marginal, and conditional probability

These three concepts are the bread and butter of ML. Master them:

Joint

$P(A, B)$ โ€” the probability of both $A$ and $B$ happening. Example: $P(\text{rain}, \text{cold}) = 0.3$.

Marginal

$P(A)$ โ€” the probability of $A$ regardless of $B$. You get it by summing the joint over all values of $B$:

$$P(A) = \sum_b P(A, B=b)$$

This "summing over a variable to eliminate it" is called marginalization. You'll see it constantly.

Conditional

$P(A \mid B)$ โ€” probability of $A$ given that $B$ has happened. Definition:

$$P(A \mid B) = \frac{P(A, B)}{P(B)}$$

In words: restrict the world to just "cases where $B$ happened," then ask how often $A$ also happened.

Why ML is about conditional distributions

Every supervised ML model estimates a conditional: $P(y \mid x)$ โ€” "given features $x$, what's the distribution over the label $y$?" A classifier that outputs "90% cat, 8% dog, 2% other" is literally spelling out a conditional distribution. A regression model outputs a point estimate of $E[y \mid x]$.

4. Independence

Two events are independent if knowing one tells you nothing about the other:

$$P(A, B) = P(A) \cdot P(B)$$

Equivalent form: $P(A \mid B) = P(A)$.

Most things in real life are not independent โ€” that's why we can learn from data. But conditional independence shows up constantly. "Given the weather, the kids' umbrellas tell us nothing more about whether it's raining." Naive Bayes classifiers lean on this heavily.

5. Random variables and distributions

A random variable is a variable whose value is uncertain. Coin flip, user's age, tomorrow's temperature.

A distribution describes all the possible values and their probabilities.

Discrete: probability mass function (PMF)

For countable outcomes โ€” integer counts, categories. The PMF gives the probability of each exact value:

$$P(X = k) = p(k), \quad \sum_k p(k) = 1$$

Continuous: probability density function (PDF)

For uncountable outcomes โ€” any real number. The probability of exactly some value is zero; we talk about probability densities and probabilities over ranges:

$$P(a \leq X \leq b) = \int_a^b f(x) \, dx$$
A density is not a probability

A PDF can have values greater than 1. That's not a bug โ€” the PDF is a density, not a probability. The area under the curve is the probability, and that area integrates to 1. Confusing these is a classic gotcha.

6. Distributions you'll actually use

Memorize these six. You'll meet them constantly:

Bernoulli โ€” single yes/no

One trial with probability $p$ of success. $X \in \{0, 1\}$. A single click, a single positive label.

Binomial โ€” count of successes in $n$ trials

$X$ = number of 1s in $n$ independent Bernoulli trials. "Out of 1000 shown ads, how many were clicked?"

Categorical (multinomial with n=1)

Generalization of Bernoulli to $k$ outcomes. Every classification model outputs a categorical distribution over classes. Softmax produces one.

Normal (Gaussian)

The default continuous distribution. Shape: bell curve. Parameters: mean $\mu$ and variance $\sigma^2$:

$$f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$

The normal dominates ML because of the Central Limit Theorem (next lesson) โ€” sums of independent things tend to be approximately normal. Regression residuals, measurement errors, weight initializations all default to normal.

Poisson โ€” event counts in a window

Integer-valued. "Emails per hour," "website crashes per day." Governed by a rate $\lambda$. Mean and variance are both $\lambda$.

Exponential โ€” time between events

Continuous, positive. Waiting time between Poisson events. Memoryless property.

Head to the Distribution Explorer and spend 10 minutes playing with them. Muscle memory matters.

7. Expectation and variance

Expectation $E[X]$ = the long-run average value of a random variable:

$$E[X] = \sum_x x \cdot P(X=x) \quad \text{(discrete)} \qquad E[X] = \int x \cdot f(x) \, dx \quad \text{(continuous)}$$

Variance $\text{Var}(X) = E[(X - E[X])^2]$ = the average squared distance from the mean. It measures spread.

Standard deviation $\sigma = \sqrt{\text{Var}(X)}$ has the same units as $X$, which makes it more interpretable.

Useful identities:

8. Why ML cares: likelihood and log-likelihood

When you train a probabilistic model, you're typically doing maximum likelihood estimation (MLE):

  1. Assume your data came from a specific distribution family (e.g., Gaussian, Bernoulli).
  2. Find the parameters that make the observed data most probable.

For $n$ independent observations $x_1, \ldots, x_n$ under a distribution with parameters $\theta$:

$$L(\theta) = \prod_{i=1}^n p(x_i \mid \theta)$$

In practice we maximize the log-likelihood instead (products become sums, no underflow):

$$\log L(\theta) = \sum_{i=1}^n \log p(x_i \mid \theta)$$
Cross-entropy loss = negative log-likelihood

When you train a classifier with cross-entropy, you're doing MLE on a categorical distribution. When you train a regression model with MSE, you're doing MLE assuming normally-distributed residuals. These aren't arbitrary choices โ€” they follow from probabilistic modeling. Almost every loss function in ML is a negative log-likelihood under some assumption.

9. The code you'll write

import numpy as np
from scipy import stats

# Sample from distributions
np.random.normal(loc=0, scale=1, size=1000)          # standard normal
np.random.binomial(n=10, p=0.3, size=1000)           # 1000 draws from Binomial(10, 0.3)
np.random.uniform(low=0, high=1, size=1000)

# Compute densities and probabilities
stats.norm.pdf(0, loc=0, scale=1)                    # 0.3989 โ€” density at x=0
stats.norm.cdf(1.96, loc=0, scale=1)                 # 0.975 โ€” P(X โ‰ค 1.96)
stats.binom.pmf(k=3, n=10, p=0.3)                    # probability of exactly 3 successes

# Fit a distribution to data
data = np.random.normal(5, 2, size=1000)
mu_hat, sigma_hat = stats.norm.fit(data)             # MLE fit

10. Self-check