Lesson 8.1 โ The Streaming Mental Model
One lesson that covers the concepts shared by Kafka, Kinesis, and every other log-based system you'll touch. Nail this and the next two lessons become "here's the API" rather than "here's the concept."
The log abstraction
A streaming system is an append-only log. Producers append records. Consumers read records in order and track their position (offset). That's it โ everything else is plumbing.
Producer โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 0 ยท 1 ยท 2 ยท 3 ยท 4 ยท 5 ยท 6 ยท 7 ยท 8โ Log (append-only)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โฒ โฒ
โ โ
consumer A consumer B
at offset 3 at offset 7
- Records are immutable โ you never edit position 4. You append position 9.
- Each consumer independently tracks its offset. Consumer A reading old records doesn't affect Consumer B.
- Retention is time or size based, not per-consumer. Once data ages out, all consumers lose it.
Why this wins for DE
- Replay. A downstream consumer (say, a new analytics pipeline) can rewind and re-read history without re-running the producer.
- Decoupling. Producer and consumer scale independently. New consumer added? Zero impact on producer.
- Audit. The log is the source of truth. Downstream state can always be rebuilt from it.
Partitions (Kafka) / Shards (Kinesis)
A single log scales to the IO of one machine. To go bigger, split it into parts. Kafka calls them partitions; Kinesis calls them shards. Same concept.
Topic: meter_reads (3 partitions) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ partition 0 โ 0 ยท 1 ยท 2 ยท 3 ยท ...โ โ records for meter_id hash=0 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ partition 1 โ 0 ยท 1 ยท 2 ยท 3 ยท ...โ โ records for meter_id hash=1 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ partition 2 โ 0 ยท 1 ยท 2 ยท 3 ยท ...โ โ records for meter_id hash=2 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key consequences:
- Order is per-partition, not topic-wide. Within partition 0, reads are strictly ordered. Across partitions, there's no global order.
- Partition count = max consumer parallelism within a consumer group. If you have 3 partitions, at most 3 consumers in a group can read concurrently. The 4th sits idle.
- Choose a partition key that co-locates related records. For meter reads:
meter_idโ so all reads from one meter go to one partition and stay ordered.
The single biggest streaming design decision
Partition count. Too few: scaling cap. Too many: rebalance storms, broker memory pressure, slow startup. Target: peak ingest rate รท per-partition throughput (Kafka ~10 MB/s, Kinesis 1 MB/s). Round up. Kafka is painful to increase later (order guarantees shift); Kinesis is easier with shard splits/merges.
Consumer groups
Multiple instances of your consumer application share a group.id (Kafka) or application name (Kinesis). The broker/service assigns each partition/shard to exactly one member of the group. Scale by adding members โ the assignment rebalances.
- Different
group.id= independent read position. Use different groups when two applications need to read the same log separately. - Rebalances are free during normal scaling. Frequent rebalances (consumer deaths, slow processing) are a symptom to fix.
Delivery guarantees โ the conversation the interviewer will start
| Guarantee | Meaning | Cost |
|---|---|---|
| At-most-once | Might drop records; never duplicate | Simplest, fire-and-forget producers. Rare for DE. |
| At-least-once | Might duplicate; never lose | Default. You de-dup downstream. |
| Exactly-once | Records appear exactly once end-to-end | Extra protocol overhead. Kafka supports it via idempotent producer + transactions. Kinesis: not native, rebuild via consumer idempotency. |
Practical DE answer: "We run at-least-once and design the consumer to be idempotent โ it de-dupes on a natural key (meter_id, read_ts) when writing to Snowflake." That's the industry-standard answer; don't bother with exactly-once unless the use case demands it.
Kafka vs Kinesis โ the operational delta
| Kafka (self-hosted or Confluent) | Kinesis | |
|---|---|---|
| Ops | You run brokers, ZK/KRaft, Connect (or pay Confluent) | AWS-managed. No broker in your picture. |
| Partitions | Topics with partitions. Config-heavy. | Streams with shards. Simpler surface. |
| Ecosystem | Kafka Connect has hundreds of connectors | Native AWS integrations (Firehose, Lambda) |
| Throughput | ~10 MB/s per partition | 1 MB/s per shard write, 2 MB/s read (enhanced fan-out: per-consumer 2 MB/s) |
| Retention | Any length (days to forever) | 24h default, 365d max |
| Cost model | Infra ($$ fixed) + throughput | Per-shard-hour + per-record |
Which one for a utility's AMI?
- If the rest of the stack is AWS-heavy (which this JD implies) and you want "just make it work" โ Kinesis Firehose to S3 or Snowpipe Streaming.
- If you want Kafka Connect's ecosystem (CDC from CIS, Debezium) or multi-cloud โ Kafka (Confluent Cloud or MSK).
- Many utilities run both: Kafka for internal CDC + integrations, Kinesis as the AWS-native streaming lane.