Lesson 1.1 — OLTP vs OLAP, Columnar Warehouses, and Why DE Exists

You've written some form of ETL earlier in your career. You know what a database is. This lesson re-frames that knowledge in the vocabulary the target role expects, so you can walk into an interview and use the right words without needing to translate in your head.

The only diagram you need

 ┌──────────────────┐      ┌──────────────────┐      ┌──────────────────┐
 │  OLTP systems    │      │  Ingestion       │      │  OLAP warehouse  │
 │  (row-oriented)  │ ───▶ │  (E + L)         │ ───▶ │  (column-oriented│
 │                  │      │                  │      │   Snowflake)     │
 │  • CIS (billing) │      │  boto3 / Matillion│      │                  │
 │  • OMS (outages) │      │  Snowpipe / Glue │      │  bronze → silver │
 │  • Meter reads   │      │                  │      │         → gold   │
 └──────────────────┘      └──────────────────┘      └──────────────────┘
          ▲                                                    │
          │ writes fast, reads narrow                          │ scans wide, aggregates
          │ transactions                                       │ no transactions
          │ row-oriented                                        columnar
          │                                                    ▼
          │                                             ┌──────────────┐
          │                                             │ Consumers    │
          │                                             │ • BI tools   │
          │                                             │ • ML models  │
          │                                             │ • APIs       │
          │                                             └──────────────┘

OLTP — the application database

OLTP stands for On-Line Transaction Processing. This is Postgres, MySQL, SQL Server — the database the app reads and writes. Think credit-card authorizations in a payments system, or meter-reading writes from a utility's AMI head-end.

Characteristics:

Row-oriented storage. A customer's row is stored contiguously. Reading that one customer is one disk seek. Updating one cell is cheap.
Heavy indexing. B-tree indexes on the columns you filter and join by. Fast point lookups.
ACID transactions. If you transfer $100 from A to B, both writes happen or neither does.
Narrow reads. Typical query: "give me this one customer's latest bill" — returns 1 row, a few columns.
Throughput target: thousands of tiny transactions/sec.

What you already know: MSSQL Server, PostgreSQL, MySQL. These are all OLTP. You've tuned them. Skip the basics.

OLAP — the warehouse

OLAP is On-Line Analytical Processing. This is Snowflake, BigQuery, Redshift, Databricks. Designed for answering questions like "what was average consumption in zip code 97124 in Q4 2025 by premise type?"

Characteristics:

Columnar storage. All values for one column are stored together (and compressed). When you SELECT AVG(consumption_kwh), you scan only that column — not the other 40.
No indexes (mostly). Snowflake uses micro-partition pruning instead. The engine keeps min/max per column per partition and skips partitions that can't match.
No transactions in the OLTP sense. You bulk-load, you don't UPDATE WHERE customer_id = 42 from an app.
Wide reads. Typical query: "scan 2 billion meter reads, group by zip, compute averages." Returns hundreds or thousands of rows.
Throughput target: a handful of very large queries/sec.

If you've used any on-prem columnar appliance: those MPP boxes are the same category as Snowflake — different ops model (hardware you buy vs. cloud service), same columnar-MPP principle. You already understand why MPP + columnar wins for analytics. The leap now is separation of compute and storage, which most on-prem appliances did not offer.

Why columnar wins for analytics — in one concrete example

Suppose a meter_reads table has 40 columns and 20 billion rows. An analyst asks: "What was the average consumption_kwh last Tuesday?"

Row-oriented path (OLTP)

To compute the average, the engine must read every row. Each row contains 40 columns. It reads 40× more bytes than it needs. Compression is poor because each row mixes ints, strings, timestamps.

Column-oriented path (OLAP)

The engine reads only the consumption_kwh column and read_timestamp (to filter to Tuesday). That's 2 of 40 columns. Compression is excellent because each column contains homogeneous values (all floats, or all timestamps). Net: ~10–50× less I/O, same answer.

So what is "Data Engineering"?

DE is the discipline of moving data from OLTP systems (and APIs, files, event streams) into a warehouse, transforming it along the way so analysts, ML models, and dashboards can ask the wide-read questions efficiently and reliably.

A pipeline from an earlier era — download monthly vendor data → transform → load into an on-prem columnar warehouse — is the same shape as a modern DE pipeline; just very different tools:

Then (classic on-prem ETL)	Now (cloud-native DE)
FTP/SFTP download	S3 landing (or Kinesis/Kafka for streams)
Bash + Python scripts	Python + boto3, Glue, or Lambda
Appliance-specific COPY	Snowflake COPY INTO / Snowpipe
Custom SQL for transforms	Matillion / dbt / stored procs — all compile to SQL
Monthly cron	Event-driven or scheduled; Airflow, Step Functions, or Jenkins
Scripts versioned in SCM	Same — Git, Jenkins, same CI/CD rigor

Nothing here is foreign. The mental shift is from treating data movement as ops automation to treating it as a product that analysts and ML consume downstream.

Key terms from this lesson

Links go to the glossary.

Quick self-check

Q1. You're asked to compute the average bill amount for 50 million customers for a year-end report. Is that OLTP or OLAP?

Q2. Why does a columnar store compress better than a row store?

Q3. A Snowflake warehouse (compute) can be resized or multiplied without touching the stored data. True or false?

← Module 1 index Lesson 1.2 →