Lesson 3.1 — pandas Fluency for Data Engineers

Analysts use pandas to explore data. DEs use it to move data — fast, with correct types, without blowing up memory. This lesson covers the patterns that matter in pipelines, skips the patterns that don't.

What we assume

You know Python. You've likely used pandas casually. What we're adding: the DE-specific conventions (dtype enforcement, chunked I/O, efficient bulk ops) that matter when the file is 2GB and you can't afford to guess.

The DE pandas playbook

1 · Read with explicit dtypes

Never trust pandas to infer. Explicit dtypes save memory, catch schema drift, and avoid silent float→int coercion bugs.

import pandas as pd

dtypes = {
    "meter_id":         "string",
    "premise_zip":      "string",
    "consumption_kwh":  "float32",   # float32 saves 50% vs float64; plenty for kWh
    "quality_flag":     "category",  # enum-like; tiny memory footprint
}
df = pd.read_csv(
    "meter_reads.csv",
    dtype=dtypes,
    parse_dates=["read_ts"],
    low_memory=False,
)

2 · Chunk for anything bigger than ~1GB

A 10GB CSV won't fit in memory on most laptops. Use chunksize:

for chunk in pd.read_csv("huge.csv", dtype=dtypes, chunksize=100_000):
    transformed = transform(chunk)
    write_to_warehouse(transformed)

3 · Use vectorized ops, not .iterrows()

Classic pitfall. .iterrows() is ~1000× slower than vectorized operations. If you're tempted to loop over rows in pandas, write it as a column expression instead.

# SLOW — don't do this
for i, row in df.iterrows():
    df.at[i, "kwh_flagged"] = row["consumption_kwh"] * 1.1 if row["quality_flag"] == "E" else row["consumption_kwh"]

# FAST — 1000× quicker on a big frame
df["kwh_flagged"] = df["consumption_kwh"].where(
    df["quality_flag"] != "E",
    df["consumption_kwh"] * 1.1
)

4 · Know when to drop pandas for Polars / PyArrow

Pandas copies a lot. For very large transforms (>10GB) or columnar I/O (Parquet), PyArrow or Polars are 10× faster. The JD doesn't name them, but it's worth knowing. Minimum: use PyArrow-backed pandas (dtype_backend="pyarrow") when reading Parquet.

5 · Write to Snowflake with `write_pandas`

Don't loop INSERTs. Use the connector's bulk write — it stages a parquet, then COPY-INTOs:

from snowflake.connector.pandas_tools import write_pandas
from snowflake.connector import connect

conn = connect(user=..., password=..., account=..., warehouse="LEARN_WH", database="LEARN_DB", schema="UTILITY")
success, chunks, rows, _ = write_pandas(conn, df, "METER_READS", chunk_size=100_000)

Memory math

A quick mental model that will serve you for a decade:

int64 = 8 bytes / value. 10M rows × 20 int columns = 1.6 GB
float64 = 8 bytes / value. Same math.
float32 = 4 bytes / value. Cut memory in half when precision doesn't require it (and for kWh, consumption, temperature — it doesn't).
string (modern pandas) = pointer + interned where possible. category is much smaller for low-cardinality strings.
datetime64[ns] = 8 bytes / value. Unavoidable but cheap.

Quick check: df.memory_usage(deep=True).sum() / 1e9 → GB.

Key terms from this lesson

ELT vs ETL snowflake-connector-python

Interview tell

If you can say "I use write_pandas for bulk inserts rather than looping INSERTs, and I enforce explicit dtypes on read to catch schema drift" — you've signaled you've done this for real. Two sentences, both demonstrably true after this lesson.

← Module 3 index Lesson 3.2 →

Lesson 3.1 — pandas Fluency for Data Engineers

What we assume

The DE pandas playbook

1 · Read with explicit dtypes

2 · Chunk for anything bigger than ~1GB

3 · Use vectorized ops, not .iterrows()

4 · Know when to drop pandas for Polars / PyArrow

5 · Write to Snowflake with write_pandas

Memory math

Key terms from this lesson

Interview tell

5 · Write to Snowflake with `write_pandas`