what’s kafka
Kafka is a distributed streaming platform. Think of it as a high-throughput, fault-tolerant message queue on steroids. It’s designed for handling real-time data feeds.
Concepts
Topic:
A category or feed name to which records are published.
Partition:
A topic is divided into partitions, which are ordered, immutable sequences of records. Partitions enable parallelism and scalability.
Producer:
An application that publishes records to a Kafka topic.
Consumer:
An application that subscribes to one or more topics and processes the records.
Broker:
A Kafka server. Brokers store the data.
Cluster:
A group of brokers working together.
Replica:
Each partition can be replicated across multiple brokers for fault tolerance.
Leader:
One replica of a partition is designated as the leader, handling all read and write requests.
Follower:
Other replicas of a partition are followers, replicating data from the leader.
Offset:
A unique, sequential ID assigned to each record within a partition. Consumers track their position in a partition using offsets.
Consumer Group:
A group of consumers that work together to consume records from a topic. Each partition is assigned to one consumer within a group.
Retention Policy:
Defines how long Kafka retains records before deleting them.
ZooKeeper:
Used for managing and coordinating the Kafka cluster (though newer versions are moving away from ZooKeeper).
Common Use Cases
Real-time data pipelines:
Ingesting and processing streams of data from various sources.
Log aggregation:
Collecting logs from multiple servers into a central location.
Stream processing:
Building real-time applications that analyze and react to data streams.
Event sourcing:
Storing a sequence of events that represent changes to an application’s state.
Messaging:
Reliable, high-throughput messaging between applications.
Activity tracking:
Track user activity on a website or app in real time.
Commit log:
Used as a commit log for distributed databases.
Kafka’s Role in System Design
Decoupling:
Kafka decouples producers and consumers, allowing them to evolve independently.
Scalability:
Kafka can handle massive amounts of data and scale horizontally by adding more brokers.
Reliability:
and fault tolerance ensure that data is not lost.
Buffering:
Kafka acts as a buffer between producers and consumers, smoothing out spikes in traffic.
Data integration: Kafka can integrate data from various sources into a single platform.
about zookeeper
The key development is the move away from ZooKeeper with the introduction of KRaft.
What is KRaft (Kafka Raft)
the Shift Away from ZooKeeper:
KRaft is a consensus protocol that allows Kafka to manage its metadata internally, eliminating the need for an external ZooKeeper cluster.
It essentially integrates metadata management directly into Kafka itself.
Why the shift?
Simplified Operations:
Managing ZooKeeper adds complexity to Kafka deployments. Removing it streamlines operations.
Improved Scalability:
ZooKeeper can become a bottleneck in very large Kafka clusters. KRaft aims to improve scalability.
Unified Architecture:
A self-contained Kafka system is easier to understand and manage.
The timeline
The Kafka community has been progressively working towards making KRaft production-ready.
Kafka versions 3.x have seen increasing KRaft maturity. It is expected that future major releases of Kafka, like 4.0, will fully remove the dependancy of Zookeeper.
Key benefits
- Simplified deployments.
- Enhanced scalability.
- Improved resilience.
In essence:
Kafka’s future is focused on becoming a more self-sufficient and easier-to-manage distributed system. KRaft is a major step in that direction.
diagrams
architecture diagram
flowchart TD classDef producer fill:#92D050,color:#000,stroke:#92D050 classDef broker fill:#0072C6,color:#fff,stroke:#0072C6 classDef consumer fill:#B4A0FF,color:#000,stroke:#B4A0FF classDef zk fill:#FFC000,color:#000,stroke:#FFC000 subgraph Producers["Producers"] P1[Producer 1]:::producer P2[Producer 2]:::producer end subgraph Brokers["Kafka Cluster"] B1[Broker 1
Leader]:::broker B2[Broker 2
Follower]:::broker B3[Broker 3
Follower]:::broker subgraph Partitions["Topic Partitions"] TP1[P0]:::broker TP2[P1]:::broker TP3[P2]:::broker end end subgraph Consumers["Consumer Groups"] CG1[Group 1]:::consumer CG2[Group 2]:::consumer end ZK[ZooKeeper]:::zk P1 & P2 --> B1 & B2 & B3 B1 & B2 & B3 --> CG1 & CG2 ZK -.-> B1 & B2 & B3 ZK -.-> CG1 & CG2 %% Legend subgraph Legend["Legend"] L1[Producer]:::producer L2[Broker]:::broker L3[Consumer]:::consumer L4[ZooKeeper]:::zk end
Explanation of the architecture diagram
- Solid lines represent direct data flow between producers, brokers, and consumers
- Dotted lines show ZooKeeper’s coordination role (managing cluster state and consumer groups)
- Each broker can host multiple partitions (shown as P0, P1, P2)
- Consumer Groups allow multiple applications to consume the same topics independently
dataflow diagram
graph LR A[Producer] -->|Publish| B(Topic); B --> C{Partition}; C --> D[Partition 1]; C --> E[Partition 2]; C --> F[Partition N]; D --> G(Broker 1); E --> H(Broker 2); F --> I(Broker N); G --> J[Leader Replica]; H --> K[Follower Replica]; I --> L[Leader Replica]; J --> M{Offset}; K --> M; L --> M; N[Consumer Group] --> O[Consumer 1]; N --> P[Consumer 2]; N --> Q[Consumer N]; O --> D; P --> E; Q --> F; R[ZooKeeper] -- Manage --> G; R -- Manage --> H; R -- Manage --> I; S[Retention Policy] --> B; T[Data Source] --> A; M --> O; M --> P; M --> Q; subgraph Kafka Cluster G;H;I;J;K;L; end subgraph Topic and Partitions B;C;D;E;F; end subgraph Consumer Group N;O;P;Q; end
Explanation of the dataflow Diagram
Producer:
Publishes messages to a specific Topic.
Topic:
Is divided into multiple Partitions.
Partitions:
Are distributed across multiple Brokers.
Brokers:
Each Broker can have multiple Partitions, and each Partition has a Leader Replica and Follower Replicas.
Offset:
Each message within a Partition is assigned a unique Offset.
Consumer Group:
Consists of multiple Consumers.
Consumers:
Subscribe to a Topic and read messages from Partitions, tracking their position using Offsets.
ZooKeeper:
Manages the Kafka Cluster, coordinating Brokers and Leader election.
Retention Policy:
Determines how long messages are stored in the Topic.
Data Source:
Provides the data that the Producer sends to Kafka.
Connections
- The diagram illustrates the flow of data from Producers to Consumers through Topics and Partitions.
- It shows how Brokers and Replicas ensure fault tolerance.
- It highlights the role of Offsets in tracking message consumption.
- It shows the relationship between consumer group and consumer.
- It shows that zookeeper manages the brokers.
- It shows that retention policy applies to topic.
- It shows that data source connect to producer.
replication mechanism and leader select diagram
sequenceDiagram participant P as Producer participant L as Leader Broker participant F1 as Follower Broker 1 participant F2 as Follower Broker 2 participant ZK as ZooKeeper participant B3 as Broker 3 Note over P,ZK: Normal Operation P->>+L: Send Message L->>L: Write to Log L->>-P: Acknowledge par Replication L->>F1: Replicate Message F1->>F1: Write to Log L->>F2: Replicate Message F2->>F2: Write to Log end Note over L,ZK: Leader Failure Detected ZK->>F1: Elect as New Leader P->>+F1: Send New Message F1->>F1: Write to Log F1->>-P: Acknowledge par Recovery F1->>B3: Replicate Message B3->>B3: Write to Log end
Explanation of the Diagram three
- Parallel lines show simultaneous replication to multiple followers
- When the leader fails, ZooKeeper elects a new leader from available followers
- The system maintains consistency even during leader transitions