Lesson 8.1 โ€” The Streaming Mental Model

One lesson that covers the concepts shared by Kafka, Kinesis, and every other log-based system you'll touch. Nail this and the next two lessons become "here's the API" rather than "here's the concept."

The log abstraction

A streaming system is an append-only log. Producers append records. Consumers read records in order and track their position (offset). That's it โ€” everything else is plumbing.

 Producer โ†’ โ†’ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚ 0 ยท 1 ยท 2 ยท 3 ยท 4 ยท 5 ยท 6 ยท 7 ยท 8โ”‚  Log (append-only)
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ–ฒ                       โ–ฒ
                    โ”‚                       โ”‚
              consumer A              consumer B
              at offset 3             at offset 7

Why this wins for DE

Partitions (Kafka) / Shards (Kinesis)

A single log scales to the IO of one machine. To go bigger, split it into parts. Kafka calls them partitions; Kinesis calls them shards. Same concept.

 Topic: meter_reads (3 partitions)
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚ partition 0 โ”‚ 0 ยท 1 ยท 2 ยท 3 ยท ...โ”‚  โ† records for meter_id hash=0
 โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
 โ”‚ partition 1 โ”‚ 0 ยท 1 ยท 2 ยท 3 ยท ...โ”‚  โ† records for meter_id hash=1
 โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
 โ”‚ partition 2 โ”‚ 0 ยท 1 ยท 2 ยท 3 ยท ...โ”‚  โ† records for meter_id hash=2
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key consequences:

The single biggest streaming design decision

Partition count. Too few: scaling cap. Too many: rebalance storms, broker memory pressure, slow startup. Target: peak ingest rate รท per-partition throughput (Kafka ~10 MB/s, Kinesis 1 MB/s). Round up. Kafka is painful to increase later (order guarantees shift); Kinesis is easier with shard splits/merges.

Consumer groups

Multiple instances of your consumer application share a group.id (Kafka) or application name (Kinesis). The broker/service assigns each partition/shard to exactly one member of the group. Scale by adding members โ€” the assignment rebalances.

Delivery guarantees โ€” the conversation the interviewer will start

GuaranteeMeaningCost
At-most-onceMight drop records; never duplicateSimplest, fire-and-forget producers. Rare for DE.
At-least-onceMight duplicate; never loseDefault. You de-dup downstream.
Exactly-onceRecords appear exactly once end-to-endExtra protocol overhead. Kafka supports it via idempotent producer + transactions. Kinesis: not native, rebuild via consumer idempotency.

Practical DE answer: "We run at-least-once and design the consumer to be idempotent โ€” it de-dupes on a natural key (meter_id, read_ts) when writing to Snowflake." That's the industry-standard answer; don't bother with exactly-once unless the use case demands it.

Kafka vs Kinesis โ€” the operational delta

Kafka (self-hosted or Confluent)Kinesis
OpsYou run brokers, ZK/KRaft, Connect (or pay Confluent)AWS-managed. No broker in your picture.
PartitionsTopics with partitions. Config-heavy.Streams with shards. Simpler surface.
EcosystemKafka Connect has hundreds of connectorsNative AWS integrations (Firehose, Lambda)
Throughput~10 MB/s per partition1 MB/s per shard write, 2 MB/s read (enhanced fan-out: per-consumer 2 MB/s)
RetentionAny length (days to forever)24h default, 365d max
Cost modelInfra ($$ fixed) + throughputPer-shard-hour + per-record

Which one for a utility's AMI?

Key terms from this lesson

Kafka Topic Partition Consumer Group Kinesis Data Stream Kinesis Firehose