Day 1 - Introduction to Apache Kafka
I started reading this book
Kafka: The Definitive Guide
This is what I learnt today and the content is below.
Chapter 1: Meet Kafka
So, it’s all about data. Every application is data driven. Every piece of data has some story to tell.
The less effort we spend on moving the data around, the more we can focus on the core business at hand. This is why the pipeline is a critical component in the data-driven enterprise.
How we move the data becomes nearly as important as the data itself.
“Any time scientists disagree, it’s because we have insufficient data. Then we can agree on what kind of data to get; we get the data; and the data solves the problem. Either I’m right, or you’re right, or we’re both wrong. And we move on.” — Neil deGrasse Tyson
What is Pub/Sub Messaging?
- Publisher sends messages.
- Subscriber subscribes to certain classes of messages.
- Publisher is not directly targeting the subscriber.
- In large organizations, multiple pub/sub systems are used for queuing data.
- A centralized system to publish generic types of data is often lacking — Kafka solves this.
Kafka: Distributed Streaming Platform
- Kafka stores data durably and in order.
- Data can be read deterministically.
- Kafka acts as a central data pipeline.
Messages and Batches
- Unit of data =
message
- An array of bytes (Kafka doesn’t interpret format).
- Messages can have a key (optional metadata).
- Used for controlled message distribution across partitions.
- Key-based partitioning:
- Use a consistent hash of the key.
- Partition =
hash(key) % total_partitions
- Ensures all messages with the same key go to the same partition.
- Works only if partition count remains fixed.
- Messages are written in batches.
- A batch = collection of messages for the same topic and partition.
- Batching improves throughput but increases latency.
- Tradeoff: Larger batches = more messages per unit time, but slower propagation for single messages.
Schemas
- Schema = data structure definition (e.g., JSON, XML, Avro).
- Apache Avro:
- Serialization framework.
- Supports schema evolution (forward + backward compatibility).
- Ideal for Kafka use cases.
- Consistent data format allows for decoupling of producers and consumers.
- Schema Registry is often used to manage shared schemas.
Topics
- Messages are grouped into topics.
- Think of a topic like a database table.
- Topics are split into partitions.
- Each partition = append-only log.
- Messages within a partition are ordered.
- A topic can have many partitions:
- Ensures horizontal scalability and high throughput.
- No global ordering across partitions.
- Partitions can be replicated:
- For fault tolerance.
- One partition = multiple replicas on different brokers.
- Kafka Streams = logical view of topic data.
- A stream = sequence of records in a topic.
Kafka Clients
Producers
- Producers write messages to Kafka topics.
- Kafka client APIs:
- Basic producers (write code manually).
- Kafka Connect (ETL-style connectors).
- Default behavior: evenly distribute messages across partitions.
- Key-based targeting:
- Producer uses the key to determine partition via hashing.
Consumers
- Consumers read messages from Kafka topics.
- Also called subscribers or readers.
- Read order is preserved within a partition.
- Consumers track message offsets.
- Offset = unique, increasing integer per partition.
- Each consumer stores the last committed offset.
- Helps resume from last read in case of failure.
Consumer Groups
- Group of consumers collaboratively consuming a topic.
- Each partition is consumed by only one member in a group.
- This mapping is called ownership of the partition.
- If a consumer fails:
- Kafka reassigns its partitions to remaining consumers in the group.
Attribution: Kafka: The Definitive Guide by Gwen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty (O’Reilly). Copyright 2021 Chen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty, 978-1-491-93616-0.