Lesson 3.1 — pandas Fluency for Data Engineers
Analysts use pandas to explore data. DEs use it to move data — fast, with correct types, without blowing up memory. This lesson covers the patterns that matter in pipelines, skips the patterns that don't.
What we assume
You know Python. You've likely used pandas casually. What we're adding: the DE-specific conventions (dtype enforcement, chunked I/O, efficient bulk ops) that matter when the file is 2GB and you can't afford to guess.
The DE pandas playbook
1 · Read with explicit dtypes
Never trust pandas to infer. Explicit dtypes save memory, catch schema drift, and avoid silent float→int coercion bugs.
import pandas as pd
dtypes = {
"meter_id": "string",
"premise_zip": "string",
"consumption_kwh": "float32", # float32 saves 50% vs float64; plenty for kWh
"quality_flag": "category", # enum-like; tiny memory footprint
}
df = pd.read_csv(
"meter_reads.csv",
dtype=dtypes,
parse_dates=["read_ts"],
low_memory=False,
)
2 · Chunk for anything bigger than ~1GB
A 10GB CSV won't fit in memory on most laptops. Use chunksize:
for chunk in pd.read_csv("huge.csv", dtype=dtypes, chunksize=100_000):
transformed = transform(chunk)
write_to_warehouse(transformed)
3 · Use vectorized ops, not .iterrows()
Classic pitfall. .iterrows() is ~1000× slower than vectorized operations. If you're tempted to loop over rows in pandas, write it as a column expression instead.
# SLOW — don't do this
for i, row in df.iterrows():
df.at[i, "kwh_flagged"] = row["consumption_kwh"] * 1.1 if row["quality_flag"] == "E" else row["consumption_kwh"]
# FAST — 1000× quicker on a big frame
df["kwh_flagged"] = df["consumption_kwh"].where(
df["quality_flag"] != "E",
df["consumption_kwh"] * 1.1
)
4 · Know when to drop pandas for Polars / PyArrow
Pandas copies a lot. For very large transforms (>10GB) or columnar I/O (Parquet), PyArrow or Polars are 10× faster. The JD doesn't name them, but it's worth knowing. Minimum: use PyArrow-backed pandas (dtype_backend="pyarrow") when reading Parquet.
5 · Write to Snowflake with write_pandas
Don't loop INSERTs. Use the connector's bulk write — it stages a parquet, then COPY-INTOs:
from snowflake.connector.pandas_tools import write_pandas
from snowflake.connector import connect
conn = connect(user=..., password=..., account=..., warehouse="LEARN_WH", database="LEARN_DB", schema="UTILITY")
success, chunks, rows, _ = write_pandas(conn, df, "METER_READS", chunk_size=100_000)
Memory math
A quick mental model that will serve you for a decade:
int64= 8 bytes / value. 10M rows × 20 int columns = 1.6 GBfloat64= 8 bytes / value. Same math.float32= 4 bytes / value. Cut memory in half when precision doesn't require it (and for kWh, consumption, temperature — it doesn't).string(modern pandas) = pointer + interned where possible.categoryis much smaller for low-cardinality strings.datetime64[ns]= 8 bytes / value. Unavoidable but cheap.
Quick check: df.memory_usage(deep=True).sum() / 1e9 → GB.
Key terms from this lesson
Interview tell
If you can say "I use write_pandas for bulk inserts rather than looping INSERTs, and I enforce explicit dtypes on read to catch schema drift" — you've signaled you've done this for real. Two sentences, both demonstrably true after this lesson.