Before we touch a specific model, we need the framework everything else hangs on. This lesson is the mental skeleton β you'll reach for it every single time you build something real.
You already know the dbt/Airflow shape: sources β staging β marts β tests β deploy. ML has a similar spine: data β features β split β train β evaluate β deploy β monitor. The vocabulary is different and a few steps are sneaky (especially the split), but the engineering instinct transfers directly.
It's deceptively linear in the picture. In practice, you loop. You discover a data problem during evaluation and go back to step 2. You realize your framing was wrong and start over. Rapid looping is the job β not a sign you messed up.
The single biggest mistake engineers make is jumping straight to modeling. Before you load a single CSV, answer these:
| Question | Why it matters |
|---|---|
| What decision does this model drive? | If no human or system will act differently on the prediction, you're building a demo, not a product. |
| Is this supervised, unsupervised, or something else? | Do you have labels? (supervised) Just patterns to find? (unsupervised) An agent taking actions? (RL) |
| Regression or classification? | Predicting a number (revenue, latency) vs. a category (spam/not spam, churn/retain). |
| What's the cost of being wrong? | A false-positive spam filter wastes a second. A false-negative fraud check costs thousands. The cost shape picks your metric. |
| What's the baseline? | "Always predict the majority class" or "use last week's value" β if you can't beat that, the ML isn't earning its keep. |
Treat the framing as a spec doc. Write it down before you code. Six weeks in, when the model isn't working, this doc is what tells you whether the target was even reasonable.
The three main learning paradigms differ in what signal the model gets:
| Type | What you give the model | Examples |
|---|---|---|
| Supervised | Inputs $X$ and correct answers $y$. Model learns $X \to y$. | Churn prediction, image classification, forecasting, fraud detection |
| Unsupervised | Just $X$. Model finds structure. | Clustering users, anomaly detection, dimensionality reduction (PCA) |
| Reinforcement | An environment with rewards. Model learns by acting. | Game playing, robotics, recommendation optimization, LLM RLHF |
| Self-supervised | Invent labels from the data itself (predict the next token, fill the blank). | How LLMs are pretrained |
Most real production ML is still supervised. That's where this module lives.
Both are supervised. The target shape is what differs.
Before you train anything, split your data into three buckets:
If any information from your test set sneaks into training β even indirectly β your reported accuracy is a lie. Common leaks: fitting a scaler on the whole dataset before splitting, using future data to predict the past, including a feature that's only known after the target event, or tuning against the test set until it looks good.
When data has a time dimension (which is almost always, in your world), never shuffle and split randomly. A model that trained on next month's data and predicts last month's will look magical and fail in production.
Instead: split by time. Train on JanβSep, validate on Oct, test on NovβDec. This simulates the real job: predicting the future from the past.
For classification with rare classes (1% fraud), a random split might put most fraud cases in training and almost none in test. Stratified splits preserve the class ratio across all three buckets. scikit-learn's train_test_split(..., stratify=y) does this.
With small datasets, one validation set is noisy. K-fold cross-validation splits training data into $k$ folds, trains on $k-1$, validates on the held-out one, rotates. You get $k$ scores and average them. Standard default: $k=5$ or $10$.
Once you've split, you build features from the training set (we'll do this in depth in Module 4). Then you fit a model. Fitting is just: pick model parameters that minimize a loss on the training data.
You already know how this optimization works β it's the gradient descent from Module 1. What's new is that we're now doing it in a structured pipeline with scikit-learn or PyTorch, not by hand.
The typical scikit-learn shape looks identical across almost every classical model:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train) # fit parameters
y_pred = model.predict_proba(X_test)[:, 1]
print(roc_auc_score(y_test, y_pred)) # honest score
That fit / predict interface is the whole API surface of classical ML. Every model β random forests, gradient boosting, SVMs β uses the same two methods. Learn it once, use it everywhere.
We'll spend a whole lesson on metrics later (Lesson 4). For now, the key idea: the metric you optimize is what you actually get. Pick one that reflects the decision you're driving.
| Problem | Common metric | Why |
|---|---|---|
| Regression | RMSE, MAE, RΒ² | Measures numeric distance from truth |
| Balanced classification | Accuracy, F1 | Simple and interpretable when classes are even |
| Imbalanced classification | ROC AUC, PR AUC, recall@k | Accuracy is useless when 99% of labels are one class |
| Ranking / recommendation | NDCG, MAP, hit@k | Care about ordering, not absolute score |
| Forecasting | MAPE, sMAPE, pinball loss | Relative errors, or quantile-aware |
In a tutorial, the cells run top to bottom. In practice, you spend most of your time in this loop:
Keep a lightweight log of every experiment β what changed, what score, what you learned. MLflow or just a markdown file both work. If you've built a runs/jobs table before, it's the same mental model.
Deployment is Module 7, but the seed goes in here. The moment you ship, two things become true:
Every production ML system needs monitoring, scheduled retraining, and a rollback plan. It's engineering you already know how to do β the ML-specific part is knowing what to measure: input distributions, prediction distributions, real-world outcomes when you have them.
Think of a trained model as a materialized artifact β like a dbt mart. It has inputs (features), an artifact (weights), tests (metrics), and a schedule (retraining). Most MLOps is just applied data engineering with the added constraint that the artifact changes behavior when you rebuild it.
By the end of this module, "I need to predict X from Y" becomes a problem you can scaffold in 20 minutes.
fit / predict pattern and it feels like a familiar API shape.