Lesson 4.1 โ S3 as a Data Lake
You've used S3. This lesson is the short version of what changes when S3 is the data lake rather than just a bucket. The DE-specific patterns, not the basics.
The three jobs S3 does in a data platform
- Landing zone โ raw data arrives here first. Compressed, partitioned by date, immutable.
- Staging for warehouse ingest โ Snowflake's external stages point here; COPY INTO reads from it.
- Glue Catalog-backed tables โ Athena/Glue/Spark query S3 directly via the Catalog.
The folder conventions that matter
A good data-lake layout for a utility:
s3://utility-prod-lake/ โโ raw/ โ โโ ami_reads/ # Bronze โ immutable, as-received โ โ โโ read_date=2026-04-22/ โ โ โโ part-0000.parquet โ โโ cis_accounts/ โ โโ oms_outages/ โโ curated/ # Silver โ cleaned, conformed schema โ โโ meter_reads/ โโ marts/ # Gold โ business-ready aggregates โโ monthly_consumption/
Hive-style partitions (key=value)
Use read_date=2026-04-22/ not 20260422/. The key=value convention is called Hive-style partitioning; Athena, Glue, Snowflake external tables all recognize it automatically and can prune by partition without scanning all files.
File formats โ ranking
| Format | Use for | Why |
|---|---|---|
| Parquet | Default for DE | Columnar, compressed, splittable. Works with every tool in this stack. |
| CSV (gzip) | Legacy, interop | Readable, but row-oriented and uncompressable column-wise. Fine at small scale. |
| JSON / JSONL | Semi-structured logs, API dumps | Not columnar; Snowflake handles as VARIANT. Land as JSONL, transform to Parquet in curated. |
| Iceberg / Delta | Advanced table formats | Not on JD. Worth knowing; see credits. |
Lifecycle policies โ set them once, save thousands
Every bucket touching DE should have lifecycle rules:
raw/: transition to S3 Infrequent Access after 30 days, Glacier Deep Archive after 180.curated/,marts/: IA after 90, no archive (it's accessed).- Delete previous object versions after 30 days (if versioning is on).
Without this, a utility with 5 years of meter reads is paying S3 Standard rates for data nobody queries anymore.
Access patterns & IAM โ preview
Separate paths โ separate IAM. The ingest Lambda writes raw/*. The Glue ETL job reads raw/* and writes curated/*. Analysts read marts/*. Three IAM roles, three scoped policies. More in Lesson 4.4.
Interview tell
If asked "how would you organize a data lake?" โ lead with partitioning scheme, file format, lifecycle policy, and IAM scope per prefix. Those four answers in order demonstrate mature DE thinking.