Module 3 Β· Lesson 1Β·~40 minΒ·The operating system of ML

The ML Workflow: From Problem to Production

Before we touch a specific model, we need the framework everything else hangs on. This lesson is the mental skeleton β€” you'll reach for it every single time you build something real.

Bridge from your world

You already know the dbt/Airflow shape: sources β†’ staging β†’ marts β†’ tests β†’ deploy. ML has a similar spine: data β†’ features β†’ split β†’ train β†’ evaluate β†’ deploy β†’ monitor. The vocabulary is different and a few steps are sneaky (especially the split), but the engineering instinct transfers directly.

1. The workflow, in one picture

01
Frame
Define the problem
02
Data
Collect & clean
03
Split
Train / Val / Test
04
Features
Engineer inputs
05
Model
Pick & fit
06
Evaluate
Score on val/test
07
Tune
Iterate
08
Ship
Deploy
09
Monitor
Watch & retrain

It's deceptively linear in the picture. In practice, you loop. You discover a data problem during evaluation and go back to step 2. You realize your framing was wrong and start over. Rapid looping is the job β€” not a sign you messed up.

2. Frame the problem (don't skip this)

The single biggest mistake engineers make is jumping straight to modeling. Before you load a single CSV, answer these:

QuestionWhy it matters
What decision does this model drive?If no human or system will act differently on the prediction, you're building a demo, not a product.
Is this supervised, unsupervised, or something else?Do you have labels? (supervised) Just patterns to find? (unsupervised) An agent taking actions? (RL)
Regression or classification?Predicting a number (revenue, latency) vs. a category (spam/not spam, churn/retain).
What's the cost of being wrong?A false-positive spam filter wastes a second. A false-negative fraud check costs thousands. The cost shape picks your metric.
What's the baseline?"Always predict the majority class" or "use last week's value" β€” if you can't beat that, the ML isn't earning its keep.
Engineering habit

Treat the framing as a spec doc. Write it down before you code. Six weeks in, when the model isn't working, this doc is what tells you whether the target was even reasonable.

3. Supervised vs. unsupervised vs. RL

The three main learning paradigms differ in what signal the model gets:

TypeWhat you give the modelExamples
SupervisedInputs $X$ and correct answers $y$. Model learns $X \to y$.Churn prediction, image classification, forecasting, fraud detection
UnsupervisedJust $X$. Model finds structure.Clustering users, anomaly detection, dimensionality reduction (PCA)
ReinforcementAn environment with rewards. Model learns by acting.Game playing, robotics, recommendation optimization, LLM RLHF
Self-supervisedInvent labels from the data itself (predict the next token, fill the blank).How LLMs are pretrained

Most real production ML is still supervised. That's where this module lives.

4. Regression vs. classification

Both are supervised. The target shape is what differs.

5. The split β€” where most engineers sabotage themselves

Before you train anything, split your data into three buckets:

Training (70%)
Val (15%)
Test (15%)
Fit parameters Tune hyperparameters Final honest estimate
The cardinal sin: data leakage

If any information from your test set sneaks into training β€” even indirectly β€” your reported accuracy is a lie. Common leaks: fitting a scaler on the whole dataset before splitting, using future data to predict the past, including a feature that's only known after the target event, or tuning against the test set until it looks good.

Time-aware splits

When data has a time dimension (which is almost always, in your world), never shuffle and split randomly. A model that trained on next month's data and predicts last month's will look magical and fail in production.

Instead: split by time. Train on Jan–Sep, validate on Oct, test on Nov–Dec. This simulates the real job: predicting the future from the past.

Stratified splits

For classification with rare classes (1% fraud), a random split might put most fraud cases in training and almost none in test. Stratified splits preserve the class ratio across all three buckets. scikit-learn's train_test_split(..., stratify=y) does this.

Cross-validation

With small datasets, one validation set is noisy. K-fold cross-validation splits training data into $k$ folds, trains on $k-1$, validates on the held-out one, rotates. You get $k$ scores and average them. Standard default: $k=5$ or $10$.

6. Features, training, and what "fitting" actually means

Once you've split, you build features from the training set (we'll do this in depth in Module 4). Then you fit a model. Fitting is just: pick model parameters that minimize a loss on the training data.

$$\theta^* = \arg\min_\theta \; \mathcal{L}(f_\theta(X_{\text{train}}), y_{\text{train}})$$

You already know how this optimization works β€” it's the gradient descent from Module 1. What's new is that we're now doing it in a structured pipeline with scikit-learn or PyTorch, not by hand.

The typical scikit-learn shape looks identical across almost every classical model:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)              # fit parameters

y_pred = model.predict_proba(X_test)[:, 1]
print(roc_auc_score(y_test, y_pred))     # honest score

That fit / predict interface is the whole API surface of classical ML. Every model β€” random forests, gradient boosting, SVMs β€” uses the same two methods. Learn it once, use it everywhere.

7. Evaluate β€” and choose the right metric

We'll spend a whole lesson on metrics later (Lesson 4). For now, the key idea: the metric you optimize is what you actually get. Pick one that reflects the decision you're driving.

ProblemCommon metricWhy
RegressionRMSE, MAE, RΒ²Measures numeric distance from truth
Balanced classificationAccuracy, F1Simple and interpretable when classes are even
Imbalanced classificationROC AUC, PR AUC, recall@kAccuracy is useless when 99% of labels are one class
Ranking / recommendationNDCG, MAP, hit@kCare about ordering, not absolute score
ForecastingMAPE, sMAPE, pinball lossRelative errors, or quantile-aware

8. Iterate: the real shape of the work

In a tutorial, the cells run top to bottom. In practice, you spend most of your time in this loop:

  1. Look at errors: which rows did the model miss?
  2. Form a hypothesis: "it seems to miss when X is high" or "the recent data is underrepresented."
  3. Fix the thing β€” new feature, more data, different model, different target encoding.
  4. Re-train, re-evaluate, compare to previous version.
  5. Repeat until marginal gains get boring.

Keep a lightweight log of every experiment β€” what changed, what score, what you learned. MLflow or just a markdown file both work. If you've built a runs/jobs table before, it's the same mental model.

9. Ship and monitor (a preview)

Deployment is Module 7, but the seed goes in here. The moment you ship, two things become true:

Every production ML system needs monitoring, scheduled retraining, and a rollback plan. It's engineering you already know how to do β€” the ML-specific part is knowing what to measure: input distributions, prediction distributions, real-world outcomes when you have them.

Your dbt intuition transfers

Think of a trained model as a materialized artifact β€” like a dbt mart. It has inputs (features), an artifact (weights), tests (metrics), and a schedule (retraining). Most MLOps is just applied data engineering with the added constraint that the artifact changes behavior when you rebuild it.

10. What this module covers

By the end of this module, "I need to predict X from Y" becomes a problem you can scaffold in 20 minutes.

11. Before you move on