Lesson 8.1 — The Streaming Mental Model

One lesson that covers the concepts shared by Kafka, Kinesis, and every other log-based system you'll touch. Nail this and the next two lessons become "here's the API" rather than "here's the concept."

The log abstraction

A streaming system is an append-only log. Producers append records. Consumers read records in order and track their position (offset). That's it — everything else is plumbing.

 Producer → → ┌───────────────────────────────────┐
             │ 0 · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8│  Log (append-only)
             └───────────────────────────────────┘
                    ▲                       ▲
                    │                       │
              consumer A              consumer B
              at offset 3             at offset 7

Records are immutable — you never edit position 4. You append position 9.
Each consumer independently tracks its offset. Consumer A reading old records doesn't affect Consumer B.
Retention is time or size based, not per-consumer. Once data ages out, all consumers lose it.

Why this wins for DE

Replay. A downstream consumer (say, a new analytics pipeline) can rewind and re-read history without re-running the producer.
Decoupling. Producer and consumer scale independently. New consumer added? Zero impact on producer.
Audit. The log is the source of truth. Downstream state can always be rebuilt from it.

Partitions (Kafka) / Shards (Kinesis)

A single log scales to the IO of one machine. To go bigger, split it into parts. Kafka calls them partitions; Kinesis calls them shards. Same concept.

 Topic: meter_reads (3 partitions)
 ┌──────────────────────────────────┐
 │ partition 0 │ 0 · 1 · 2 · 3 · ...│  ← records for meter_id hash=0
 ├──────────────────────────────────┤
 │ partition 1 │ 0 · 1 · 2 · 3 · ...│  ← records for meter_id hash=1
 ├──────────────────────────────────┤
 │ partition 2 │ 0 · 1 · 2 · 3 · ...│  ← records for meter_id hash=2
 └──────────────────────────────────┘

Key consequences:

Order is per-partition, not topic-wide. Within partition 0, reads are strictly ordered. Across partitions, there's no global order.
Partition count = max consumer parallelism within a consumer group. If you have 3 partitions, at most 3 consumers in a group can read concurrently. The 4th sits idle.
Choose a partition key that co-locates related records. For meter reads: meter_id — so all reads from one meter go to one partition and stay ordered.

The single biggest streaming design decision

Partition count. Too few: scaling cap. Too many: rebalance storms, broker memory pressure, slow startup. Target: peak ingest rate ÷ per-partition throughput (Kafka ~10 MB/s, Kinesis 1 MB/s). Round up. Kafka is painful to increase later (order guarantees shift); Kinesis is easier with shard splits/merges.

Consumer groups

Multiple instances of your consumer application share a group.id (Kafka) or application name (Kinesis). The broker/service assigns each partition/shard to exactly one member of the group. Scale by adding members — the assignment rebalances.

Different group.id = independent read position. Use different groups when two applications need to read the same log separately.
Rebalances are free during normal scaling. Frequent rebalances (consumer deaths, slow processing) are a symptom to fix.

Delivery guarantees — the conversation the interviewer will start

Guarantee	Meaning	Cost
At-most-once	Might drop records; never duplicate	Simplest, fire-and-forget producers. Rare for DE.
At-least-once	Might duplicate; never lose	Default. You de-dup downstream.
Exactly-once	Records appear exactly once end-to-end	Extra protocol overhead. Kafka supports it via idempotent producer + transactions. Kinesis: not native, rebuild via consumer idempotency.

Practical DE answer: "We run at-least-once and design the consumer to be idempotent — it de-dupes on a natural key (meter_id, read_ts) when writing to Snowflake." That's the industry-standard answer; don't bother with exactly-once unless the use case demands it.

Kafka vs Kinesis — the operational delta

	Kafka (self-hosted or Confluent)	Kinesis
Ops	You run brokers, ZK/KRaft, Connect (or pay Confluent)	AWS-managed. No broker in your picture.
Partitions	Topics with partitions. Config-heavy.	Streams with shards. Simpler surface.
Ecosystem	Kafka Connect has hundreds of connectors	Native AWS integrations (Firehose, Lambda)
Throughput	~10 MB/s per partition	1 MB/s per shard write, 2 MB/s read (enhanced fan-out: per-consumer 2 MB/s)
Retention	Any length (days to forever)	24h default, 365d max
Cost model	Infra ($$ fixed) + throughput	Per-shard-hour + per-record

Which one for a utility's AMI?

If the rest of the stack is AWS-heavy (which this JD implies) and you want "just make it work" → Kinesis Firehose to S3 or Snowpipe Streaming.
If you want Kafka Connect's ecosystem (CDC from CIS, Debezium) or multi-cloud → Kafka (Confluent Cloud or MSK).
Many utilities run both: Kafka for internal CDC + integrations, Kinesis as the AWS-native streaming lane.

Key terms from this lesson

Kafka Topic Partition Consumer Group Kinesis Data Stream Kinesis Firehose

← Module 8 index Lesson 8.2 →