Lesson 1.1 β€” OLTP vs OLAP, Columnar Warehouses, and Why DE Exists

You've written some form of ETL earlier in your career. You know what a database is. This lesson re-frames that knowledge in the vocabulary the target role expects, so you can walk into an interview and use the right words without needing to translate in your head.

The only diagram you need

 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚  OLTP systems    β”‚      β”‚  Ingestion       β”‚      β”‚  OLAP warehouse  β”‚
 β”‚  (row-oriented)  β”‚ ───▢ β”‚  (E + L)         β”‚ ───▢ β”‚  (column-orientedβ”‚
 β”‚                  β”‚      β”‚                  β”‚      β”‚   Snowflake)     β”‚
 β”‚  β€’ CIS (billing) β”‚      β”‚  boto3 / Matillionβ”‚      β”‚                  β”‚
 β”‚  β€’ OMS (outages) β”‚      β”‚  Snowpipe / Glue β”‚      β”‚  bronze β†’ silver β”‚
 β”‚  β€’ Meter reads   β”‚      β”‚                  β”‚      β”‚         β†’ gold   β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β–²                                                    β”‚
          β”‚ writes fast, reads narrow                          β”‚ scans wide, aggregates
          β”‚ transactions                                       β”‚ no transactions
          β”‚ row-oriented                                        columnar
          β”‚                                                    β–Ό
          β”‚                                             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                                             β”‚ Consumers    β”‚
          β”‚                                             β”‚ β€’ BI tools   β”‚
          β”‚                                             β”‚ β€’ ML models  β”‚
          β”‚                                             β”‚ β€’ APIs       β”‚
          β”‚                                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

OLTP β€” the application database

OLTP stands for On-Line Transaction Processing. This is Postgres, MySQL, SQL Server β€” the database the app reads and writes. Think credit-card authorizations in a payments system, or meter-reading writes from a utility's AMI head-end.

Characteristics:

What you already know: MSSQL Server, PostgreSQL, MySQL. These are all OLTP. You've tuned them. Skip the basics.

OLAP β€” the warehouse

OLAP is On-Line Analytical Processing. This is Snowflake, BigQuery, Redshift, Databricks. Designed for answering questions like "what was average consumption in zip code 97124 in Q4 2025 by premise type?"

Characteristics:

If you've used any on-prem columnar appliance: those MPP boxes are the same category as Snowflake β€” different ops model (hardware you buy vs. cloud service), same columnar-MPP principle. You already understand why MPP + columnar wins for analytics. The leap now is separation of compute and storage, which most on-prem appliances did not offer.

Why columnar wins for analytics β€” in one concrete example

Suppose a meter_reads table has 40 columns and 20 billion rows. An analyst asks: "What was the average consumption_kwh last Tuesday?"

Row-oriented path (OLTP)

To compute the average, the engine must read every row. Each row contains 40 columns. It reads 40Γ— more bytes than it needs. Compression is poor because each row mixes ints, strings, timestamps.

Column-oriented path (OLAP)

The engine reads only the consumption_kwh column and read_timestamp (to filter to Tuesday). That's 2 of 40 columns. Compression is excellent because each column contains homogeneous values (all floats, or all timestamps). Net: ~10–50Γ— less I/O, same answer.

So what is "Data Engineering"?

DE is the discipline of moving data from OLTP systems (and APIs, files, event streams) into a warehouse, transforming it along the way so analysts, ML models, and dashboards can ask the wide-read questions efficiently and reliably.

A pipeline from an earlier era β€” download monthly vendor data β†’ transform β†’ load into an on-prem columnar warehouse β€” is the same shape as a modern DE pipeline; just very different tools:

Then (classic on-prem ETL)Now (cloud-native DE)
FTP/SFTP downloadS3 landing (or Kinesis/Kafka for streams)
Bash + Python scriptsPython + boto3, Glue, or Lambda
Appliance-specific COPYSnowflake COPY INTO / Snowpipe
Custom SQL for transformsMatillion / dbt / stored procs β€” all compile to SQL
Monthly cronEvent-driven or scheduled; Airflow, Step Functions, or Jenkins
Scripts versioned in SCMSame β€” Git, Jenkins, same CI/CD rigor

Nothing here is foreign. The mental shift is from treating data movement as ops automation to treating it as a product that analysts and ML consume downstream.

Key terms from this lesson

Links go to the glossary.

Quick self-check

Q1. You're asked to compute the average bill amount for 50 million customers for a year-end report. Is that OLTP or OLAP?


Q2. Why does a columnar store compress better than a row store?


Q3. A Snowflake warehouse (compute) can be resized or multiplied without touching the stored data. True or false?