Module 3 · Lesson 1·~40 min·The operating system of ML

The ML Workflow: From Problem to Production

Before we touch a specific model, we need the framework everything else hangs on. This lesson is the mental skeleton — you'll reach for it every single time you build something real.

Bridge from your world

You already know the dbt/Airflow shape: sources → staging → marts → tests → deploy. ML has a similar spine: data → features → split → train → evaluate → deploy → monitor. The vocabulary is different and a few steps are sneaky (especially the split), but the engineering instinct transfers directly.

1. The workflow, in one picture

Frame

Define the problem

Data

Collect & clean

Split

Train / Val / Test

Features

Engineer inputs

Model

Pick & fit

Evaluate

Score on val/test

Tune

Iterate

Ship

Deploy

Monitor

Watch & retrain

It's deceptively linear in the picture. In practice, you loop. You discover a data problem during evaluation and go back to step 2. You realize your framing was wrong and start over. Rapid looping is the job — not a sign you messed up.

2. Frame the problem (don't skip this)

The single biggest mistake engineers make is jumping straight to modeling. Before you load a single CSV, answer these:

Question	Why it matters
What decision does this model drive?	If no human or system will act differently on the prediction, you're building a demo, not a product.
Is this supervised, unsupervised, or something else?	Do you have labels? (supervised) Just patterns to find? (unsupervised) An agent taking actions? (RL)
Regression or classification?	Predicting a number (revenue, latency) vs. a category (spam/not spam, churn/retain).
What's the cost of being wrong?	A false-positive spam filter wastes a second. A false-negative fraud check costs thousands. The cost shape picks your metric.
What's the baseline?	"Always predict the majority class" or "use last week's value" — if you can't beat that, the ML isn't earning its keep.

Engineering habit

Treat the framing as a spec doc. Write it down before you code. Six weeks in, when the model isn't working, this doc is what tells you whether the target was even reasonable.

3. Supervised vs. unsupervised vs. RL

The three main learning paradigms differ in what signal the model gets:

Type	What you give the model	Examples
Supervised	Inputs $X$ and correct answers $y$. Model learns $X \to y$.	Churn prediction, image classification, forecasting, fraud detection
Unsupervised	Just $X$. Model finds structure.	Clustering users, anomaly detection, dimensionality reduction (PCA)
Reinforcement	An environment with rewards. Model learns by acting.	Game playing, robotics, recommendation optimization, LLM RLHF
Self-supervised	Invent labels from the data itself (predict the next token, fill the blank).	How LLMs are pretrained

Most real production ML is still supervised. That's where this module lives.

4. Regression vs. classification

Both are supervised. The target shape is what differs.

Regression: target is continuous. "How many units will we sell?" "What will the latency be?" Loss is usually MSE or MAE.
Classification: target is a category. "Will this user churn?" "Is this transaction fraud?" Loss is usually cross-entropy.
Ordinal regression: target is categorical but ordered (1-star, 2-star, 3-star review). Specialized models exist.
Multi-label: each sample can belong to several classes at once ("this article is tagged #python and #ml").

5. The split — where most engineers sabotage themselves

Before you train anything, split your data into three buckets:

Training (70%)

Val (15%)

Test (15%)

Fit parameters Tune hyperparameters Final honest estimate

Training set — what the model fits on. The only data the model's parameters ever see directly.
Validation set — used during development to pick hyperparameters, compare models, decide when to stop.
Test set — locked away. Touched once, at the end, to estimate real-world performance.

The cardinal sin: data leakage

If any information from your test set sneaks into training — even indirectly — your reported accuracy is a lie. Common leaks: fitting a scaler on the whole dataset before splitting, using future data to predict the past, including a feature that's only known after the target event, or tuning against the test set until it looks good.

Time-aware splits

When data has a time dimension (which is almost always, in your world), never shuffle and split randomly. A model that trained on next month's data and predicts last month's will look magical and fail in production.

Instead: split by time. Train on Jan–Sep, validate on Oct, test on Nov–Dec. This simulates the real job: predicting the future from the past.

Stratified splits

For classification with rare classes (1% fraud), a random split might put most fraud cases in training and almost none in test. Stratified splits preserve the class ratio across all three buckets. scikit-learn's train_test_split(..., stratify=y) does this.

Cross-validation

With small datasets, one validation set is noisy. K-fold cross-validation splits training data into $k$ folds, trains on $k-1$, validates on the held-out one, rotates. You get $k$ scores and average them. Standard default: $k=5$ or $10$.

6. Features, training, and what "fitting" actually means

Once you've split, you build features from the training set (we'll do this in depth in Module 4). Then you fit a model. Fitting is just: pick model parameters that minimize a loss on the training data.

\theta^* = \arg\min_\theta \; \mathcal{L}(f_\theta(X_{\text{train}}), y_{\text{train}})

You already know how this optimization works — it's the gradient descent from Module 1. What's new is that we're now doing it in a structured pipeline with scikit-learn or PyTorch, not by hand.

The typical scikit-learn shape looks identical across almost every classical model:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)              # fit parameters

y_pred = model.predict_proba(X_test)[:, 1]
print(roc_auc_score(y_test, y_pred))     # honest score

That fit / predict interface is the whole API surface of classical ML. Every model — random forests, gradient boosting, SVMs — uses the same two methods. Learn it once, use it everywhere.

7. Evaluate — and choose the right metric

We'll spend a whole lesson on metrics later (Lesson 4). For now, the key idea: the metric you optimize is what you actually get. Pick one that reflects the decision you're driving.

Problem	Common metric	Why
Regression	RMSE, MAE, R²	Measures numeric distance from truth
Balanced classification	Accuracy, F1	Simple and interpretable when classes are even
Imbalanced classification	ROC AUC, PR AUC, recall@k	Accuracy is useless when 99% of labels are one class
Ranking / recommendation	NDCG, MAP, hit@k	Care about ordering, not absolute score
Forecasting	MAPE, sMAPE, pinball loss	Relative errors, or quantile-aware

8. Iterate: the real shape of the work

In a tutorial, the cells run top to bottom. In practice, you spend most of your time in this loop:

Look at errors: which rows did the model miss?
Form a hypothesis: "it seems to miss when X is high" or "the recent data is underrepresented."
Fix the thing — new feature, more data, different model, different target encoding.
Re-train, re-evaluate, compare to previous version.
Repeat until marginal gains get boring.

Keep a lightweight log of every experiment — what changed, what score, what you learned. MLflow or just a markdown file both work. If you've built a runs/jobs table before, it's the same mental model.

9. Ship and monitor (a preview)

Deployment is Module 7, but the seed goes in here. The moment you ship, two things become true:

Data drift: the live inputs start looking different from your training data.
Concept drift: the relationship between inputs and outputs changes over time (user behavior shifts, seasonality, new product launches).

Every production ML system needs monitoring, scheduled retraining, and a rollback plan. It's engineering you already know how to do — the ML-specific part is knowing what to measure: input distributions, prediction distributions, real-world outcomes when you have them.

Your dbt intuition transfers

Think of a trained model as a materialized artifact — like a dbt mart. It has inputs (features), an artifact (weights), tests (metrics), and a schedule (retraining). Most MLOps is just applied data engineering with the added constraint that the artifact changes behavior when you rebuild it.

10. What this module covers

Lesson 2: Linear and logistic regression — the models that still win 40% of the time.
Lesson 3: Trees, random forests, and gradient boosting (XGBoost, LightGBM) — the other 40%.
Lesson 4: Evaluation in depth — confusion matrices, ROC curves, calibration, and metrics that don't lie.
Exercises: Build your first end-to-end regression on real housing data, then a classification showdown comparing four models.

By the end of this module, "I need to predict X from Y" becomes a problem you can scaffold in 20 minutes.

11. Before you move on

I can list the 9 steps of the ML workflow without looking.
I can explain why the test set stays locked until the end.
I know when to use a time-based split instead of a random one.
I can describe the difference between supervised, unsupervised, and RL in one sentence each.
I've reviewed the scikit-learn fit / predict pattern and it feels like a familiar API shape.

← Previous moduleBias & Variance Next lesson →Linear & Logistic Regression