Lesson 4.1 — S3 as a Data Lake

You've used S3. This lesson is the short version of what changes when S3 is the data lake rather than just a bucket. The DE-specific patterns, not the basics.

The three jobs S3 does in a data platform

Landing zone — raw data arrives here first. Compressed, partitioned by date, immutable.
Staging for warehouse ingest — Snowflake's external stages point here; COPY INTO reads from it.
Glue Catalog-backed tables — Athena/Glue/Spark query S3 directly via the Catalog.

The folder conventions that matter

A good data-lake layout for a utility:

s3://utility-prod-lake/
 ├─ raw/
 │   ├─ ami_reads/       # Bronze — immutable, as-received
 │   │   └─ read_date=2026-04-22/
 │   │       └─ part-0000.parquet
 │   ├─ cis_accounts/
 │   └─ oms_outages/
 ├─ curated/              # Silver — cleaned, conformed schema
 │   └─ meter_reads/
 └─ marts/                # Gold — business-ready aggregates
     └─ monthly_consumption/

Hive-style partitions (`key=value`)

Use read_date=2026-04-22/ not 20260422/. The key=value convention is called Hive-style partitioning; Athena, Glue, Snowflake external tables all recognize it automatically and can prune by partition without scanning all files.

File formats — ranking

Format	Use for	Why
Parquet	Default for DE	Columnar, compressed, splittable. Works with every tool in this stack.
CSV (gzip)	Legacy, interop	Readable, but row-oriented and uncompressable column-wise. Fine at small scale.
JSON / JSONL	Semi-structured logs, API dumps	Not columnar; Snowflake handles as VARIANT. Land as JSONL, transform to Parquet in curated.
Iceberg / Delta	Advanced table formats	Not on JD. Worth knowing; see credits.

Lifecycle policies — set them once, save thousands

Every bucket touching DE should have lifecycle rules:

raw/: transition to S3 Infrequent Access after 30 days, Glacier Deep Archive after 180.
curated/, marts/: IA after 90, no archive (it's accessed).
Delete previous object versions after 30 days (if versioning is on).

Without this, a utility with 5 years of meter reads is paying S3 Standard rates for data nobody queries anymore.

Access patterns & IAM — preview

Separate paths → separate IAM. The ingest Lambda writes raw/*. The Glue ETL job reads raw/* and writes curated/*. Analysts read marts/*. Three IAM roles, three scoped policies. More in Lesson 4.4.

Interview tell

If asked "how would you organize a data lake?" — lead with partitioning scheme, file format, lifecycle policy, and IAM scope per prefix. Those four answers in order demonstrate mature DE thinking.

← Module 4 index Lesson 4.2 →

Lesson 4.1 — S3 as a Data Lake

The three jobs S3 does in a data platform

The folder conventions that matter

Hive-style partitions (key=value)

File formats — ranking

Lifecycle policies — set them once, save thousands

Access patterns & IAM — preview

Interview tell

Hive-style partitions (`key=value`)