Probability is the language for reasoning about uncertainty. Every ML model outputs โ implicitly or explicitly โ a probability distribution. Learn to think in distributions and ML models stop being magic.
When you GROUP BY country and compute COUNT(*), you're estimating the empirical distribution of countries in your data. When you compute AVG(purchase_amount), you're estimating a mean. When you ask "is this difference significant?" โ that's inference. Statistics is the rigorous version of the questions you already ask.
A probability is a number between 0 and 1 that expresses how likely something is. There are several equally valid ways to think about what that number means:
In ML you'll use all of these at different moments. For training, we mostly act like Bayesians. For evaluation, mostly frequentist. For serving, decision-theoretic.
For a set of possible outcomes $\Omega$ and a probability function $P$:
Almost everything else in probability follows from these three rules.
These three concepts are the bread and butter of ML. Master them:
$P(A, B)$ โ the probability of both $A$ and $B$ happening. Example: $P(\text{rain}, \text{cold}) = 0.3$.
$P(A)$ โ the probability of $A$ regardless of $B$. You get it by summing the joint over all values of $B$:
This "summing over a variable to eliminate it" is called marginalization. You'll see it constantly.
$P(A \mid B)$ โ probability of $A$ given that $B$ has happened. Definition:
In words: restrict the world to just "cases where $B$ happened," then ask how often $A$ also happened.
Every supervised ML model estimates a conditional: $P(y \mid x)$ โ "given features $x$, what's the distribution over the label $y$?" A classifier that outputs "90% cat, 8% dog, 2% other" is literally spelling out a conditional distribution. A regression model outputs a point estimate of $E[y \mid x]$.
Two events are independent if knowing one tells you nothing about the other:
Equivalent form: $P(A \mid B) = P(A)$.
Most things in real life are not independent โ that's why we can learn from data. But conditional independence shows up constantly. "Given the weather, the kids' umbrellas tell us nothing more about whether it's raining." Naive Bayes classifiers lean on this heavily.
A random variable is a variable whose value is uncertain. Coin flip, user's age, tomorrow's temperature.
A distribution describes all the possible values and their probabilities.
For countable outcomes โ integer counts, categories. The PMF gives the probability of each exact value:
For uncountable outcomes โ any real number. The probability of exactly some value is zero; we talk about probability densities and probabilities over ranges:
A PDF can have values greater than 1. That's not a bug โ the PDF is a density, not a probability. The area under the curve is the probability, and that area integrates to 1. Confusing these is a classic gotcha.
Memorize these six. You'll meet them constantly:
One trial with probability $p$ of success. $X \in \{0, 1\}$. A single click, a single positive label.
$X$ = number of 1s in $n$ independent Bernoulli trials. "Out of 1000 shown ads, how many were clicked?"
Generalization of Bernoulli to $k$ outcomes. Every classification model outputs a categorical distribution over classes. Softmax produces one.
The default continuous distribution. Shape: bell curve. Parameters: mean $\mu$ and variance $\sigma^2$:
The normal dominates ML because of the Central Limit Theorem (next lesson) โ sums of independent things tend to be approximately normal. Regression residuals, measurement errors, weight initializations all default to normal.
Integer-valued. "Emails per hour," "website crashes per day." Governed by a rate $\lambda$. Mean and variance are both $\lambda$.
Continuous, positive. Waiting time between Poisson events. Memoryless property.
Head to the Distribution Explorer and spend 10 minutes playing with them. Muscle memory matters.
Expectation $E[X]$ = the long-run average value of a random variable:
Variance $\text{Var}(X) = E[(X - E[X])^2]$ = the average squared distance from the mean. It measures spread.
Standard deviation $\sigma = \sqrt{\text{Var}(X)}$ has the same units as $X$, which makes it more interpretable.
Useful identities:
When you train a probabilistic model, you're typically doing maximum likelihood estimation (MLE):
For $n$ independent observations $x_1, \ldots, x_n$ under a distribution with parameters $\theta$:
In practice we maximize the log-likelihood instead (products become sums, no underflow):
When you train a classifier with cross-entropy, you're doing MLE on a categorical distribution. When you train a regression model with MSE, you're doing MLE assuming normally-distributed residuals. These aren't arbitrary choices โ they follow from probabilistic modeling. Almost every loss function in ML is a negative log-likelihood under some assumption.
import numpy as np
from scipy import stats
# Sample from distributions
np.random.normal(loc=0, scale=1, size=1000) # standard normal
np.random.binomial(n=10, p=0.3, size=1000) # 1000 draws from Binomial(10, 0.3)
np.random.uniform(low=0, high=1, size=1000)
# Compute densities and probabilities
stats.norm.pdf(0, loc=0, scale=1) # 0.3989 โ density at x=0
stats.norm.cdf(1.96, loc=0, scale=1) # 0.975 โ P(X โค 1.96)
stats.binom.pmf(k=3, n=10, p=0.3) # probability of exactly 3 successes
# Fit a distribution to data
data = np.random.normal(5, 2, size=1000)
mu_hat, sigma_hat = stats.norm.fit(data) # MLE fit