Lesson 4.1 โ€” S3 as a Data Lake

You've used S3. This lesson is the short version of what changes when S3 is the data lake rather than just a bucket. The DE-specific patterns, not the basics.

The three jobs S3 does in a data platform

  1. Landing zone โ€” raw data arrives here first. Compressed, partitioned by date, immutable.
  2. Staging for warehouse ingest โ€” Snowflake's external stages point here; COPY INTO reads from it.
  3. Glue Catalog-backed tables โ€” Athena/Glue/Spark query S3 directly via the Catalog.

The folder conventions that matter

A good data-lake layout for a utility:

s3://utility-prod-lake/
 โ”œโ”€ raw/
 โ”‚   โ”œโ”€ ami_reads/       # Bronze โ€” immutable, as-received
 โ”‚   โ”‚   โ””โ”€ read_date=2026-04-22/
 โ”‚   โ”‚       โ””โ”€ part-0000.parquet
 โ”‚   โ”œโ”€ cis_accounts/
 โ”‚   โ””โ”€ oms_outages/
 โ”œโ”€ curated/              # Silver โ€” cleaned, conformed schema
 โ”‚   โ””โ”€ meter_reads/
 โ””โ”€ marts/                # Gold โ€” business-ready aggregates
     โ””โ”€ monthly_consumption/

Hive-style partitions (key=value)

Use read_date=2026-04-22/ not 20260422/. The key=value convention is called Hive-style partitioning; Athena, Glue, Snowflake external tables all recognize it automatically and can prune by partition without scanning all files.

File formats โ€” ranking

FormatUse forWhy
ParquetDefault for DEColumnar, compressed, splittable. Works with every tool in this stack.
CSV (gzip)Legacy, interopReadable, but row-oriented and uncompressable column-wise. Fine at small scale.
JSON / JSONLSemi-structured logs, API dumpsNot columnar; Snowflake handles as VARIANT. Land as JSONL, transform to Parquet in curated.
Iceberg / DeltaAdvanced table formatsNot on JD. Worth knowing; see credits.

Lifecycle policies โ€” set them once, save thousands

Every bucket touching DE should have lifecycle rules:

Without this, a utility with 5 years of meter reads is paying S3 Standard rates for data nobody queries anymore.

Access patterns & IAM โ€” preview

Separate paths โ†’ separate IAM. The ingest Lambda writes raw/*. The Glue ETL job reads raw/* and writes curated/*. Analysts read marts/*. Three IAM roles, three scoped policies. More in Lesson 4.4.

Interview tell

If asked "how would you organize a data lake?" โ€” lead with partitioning scheme, file format, lifecycle policy, and IAM scope per prefix. Those four answers in order demonstrate mature DE thinking.