InfoQ Homepage Articles Analyzing Apache Kafka Stretch Clusters: WAN Disruptions, Failure Scenarios, and DR Strategies

DevOps

Architecting for Agentic DevOps: Designing Autonomous Systems for Cloud-Scale Delivery (Webinar Aug 28th)

美国熊孩子用企鹅背包戏耍北极熊场面滑稽有趣

百度而RNG在以0-2的战绩不敌JDG后，不少网友开始分锅，粉丝中也开始吵了起来，最大的因素还是上单首发人选的问题，虽然两局中第一局Letme赛恩表现不佳，不过姿态乌鸦在第二局中也一样没能力挽狂澜。

Jun 20, 2025 31 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Listen to this article - 0:00

Key Takeaways

Apache Kafka Stretch cluster setup in an on-premise environment poses a high risk of service unavailability during WAN disruptions, often leading to split-brain or brain-dead scenarios and violating Service Level Agreements (SLAs) due to degraded Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
Proactive monitoring of your Kafka environment to understand data skews across brokers is critical; an uneven data load on a single node can cause a stretch cluster failure.
An ungraceful broker shutdown during upgrades or due to poor security posture can make the Kafka service unavailable, resulting in a loss of data.
There are three popular and widely used Kafka disaster recovery strategies; each has challenges, complexities, and pitfalls.
Kafka Mirror Maker 2 replication lag (Δ) for the Disaster Recovery setup can cause message loss, inconsistent data.

Apache Kafka is a well-known publish-subscribe distributed system widely used across industries for use cases such as log analytics, observability, event-driven architectures, and real-time streaming. Its distribution allows it to become the critical piece or backbone of modern streaming architecture, offering many integrations and supporting near-real-time or real-time latency requirements.

Today, Kafka has been offered by major companies as a service, including Confluent Kafka as well as cloud offerings such as Confluent Cloud, AWS Managed Streaming Kafka , Google Cloud Managed Service for Apache Kafka, Azure Event Hub for Apache Kafka, etc. Additionally, Kafka has been deployed across over one-hundred-thousand companies, including Fortune 100 across various industry segments such as enterprise, financials, healthcare, automobile, startups, independent software vendors, and retail sectors.

The value of the data diminishes over time. Many businesses these days compete in time and demand real-time data processing with millisecond responses to power mission-critical applications. Financial services, e-commerce, IoT, and other latency-sensitive industries require highly available and resilient architectures that can withstand infrastructure failures without impacting business continuity.

Related Sponsor

August 28, 2025, 10 AM EST
Architecting for Agentic DevOps: Designing Autonomous Systems for Cloud-Scale Delivery

Presented by: Yogesh Chauhan - Staff Software Engineer at Harness

SPONSORED BY HARNESS

Save your seat

Companies operating across multiple geographic locations need solutions that provide seamless failover, data consistency, and disaster recovery (DR) capabilities while minimizing operational complexity and cost. One of the common architectural patterns that companies often consider using is a single Kafka cluster spanning multiple data center locations on WAN (Wide Area Network). It is called Stretch Cluster. This pattern is commonly considered by companies for disaster recovery (DR) and high availability (HA).

Stretch Cluster can theoretically provide data redundancy across regions and minimize data loss in the event of a region-wide failure. However, this approach comes with several challenges and trade-offs due to the inherent nature of Kafka and its dependency on network latency, consistency, and partition tolerance in a multi-region setup. In this article, we are going to focus on the stretch cluster architectural pattern for DR or HA-related implications and considerations.

The following testing has been conducted, and the cluster behavior has been documented.

Environment Details

Regions	London, Frankfurt
Kafka Distribution	Apache Kafka
Kafka Version	3.6.2
Zookeeper	3.8.2
Operating system	Linux

High-Level Environment Setup View

Below is the high-level view of a single stretch cluster setup that spans across the London and Frankfurt regions with

Four Kafka brokers, where actual data resides.
Four Zookeeper nodes (a three-node rather than a four-node setup would work as well, but to mimic the exact environment in both the regions, we have set up four Zookeeper nodes ).
Zookeeper follows [N/2] + 1 rule for quorum establishment. In this case, the Zookeeper quorum will be [4/2] + 1 = 2 + 1 = 3.
Network latency between these regions is ~ 15ms.
Multiple producers (1 to n) and consumers (1 to n) are producing and consuming the data from this cluster from different regions of the world.

Kafka Broker Hardware Configuration

Hard Disk Size	10 TB
Memory	120 GB
CPU	16 Core
Disk Setup	RAID 0

Kafka Broker Configuration

Replication factor set to 4
Data retention (aka log retention) period is 7 days
Auto-creation of topics allowed
Minimum in-sync replication (ISR) set to 3
The number of partitions per topic is 10
Other configuration details
- Acks = all
- max.in.flight.request.per.connection=1
- compression.type=lz4
- batch.size=16384
- buffer.memory=67108864
- auto.offset.reset=earliest
- send.buffer.bytes=524288
- receive.buffer.bytes=131072
- num.replica.fetchers=4
- num.network.threads=12
- num.io.threads=32

Producers and Consumers:

Producers:

In this use case, we have used a producer written in Java that consistently produces defined (K) Transactions Per Second (aka TPS), which can be controlled using properties files. Sample snippets using Kafka performance test api.

******** Start of Producer code snippet *******

?void performAction(int topicN){
? ? ? ? try {
? ? ? ? ? ? ProcessBuilder processBuilder = new ProcessBuilder();
? ? ? ? ? ? Properties prop = new Properties();
? ? ? ? ? ? InputStream input = null;
? ? ? ? ? ? input = new FileInputStream("config.properties");
? ? ? ? ? ? prop.load(input);

? ? ? ? ? ? ? ? SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
? ? ? ? ? ? ? ? sdf.setTimeZone(TimeZone.getTimeZone("America/New_York"));
? ? ? ? ? ? ? ? String startTime = sdf.format(new Date());
? ? ? ? ? ? System.out.println("Start time ---> "+startTime+" topic " +topicN);


String command = "/home/opc/kafka/bin/kafka-producer-perf-test --producer-props bootstrap.servers=x.x.x.x:9092"+ " --topic "+topicN+
? ? ? ? ? ? ? ? ? ? ? ? " --num-records "+ prop.getProperty("numOfRecords") +
? ? ? ? ? ? ? ? ? ? ? ? " --record-size "+ prop.getProperty("recordSize") +
? ? ? ? ? ? ? ? ? ? ? ? " --throughput "+ prop.getProperty("throughputRate") +
? ? ? ? ? ? ? ? ? ? ? ? " --producer.config "+ "/home/opc/kafka/bin/prod.config";

? ? ? ? ? ? ? ??
? ? ? ? ? ? ? ? //processBuilder.command("cmd.exe", "/c", "dir C:\\Users\\admin");
? ? ? ? ? ? ? ? processBuilder.command("bash", "-c", command);

? ? ? ? ? ? ? ? Process process = processBuilder.start();
? ? ? ? ? ? ? ? //StringBuilder output = new StringBuilder();
? ? ? ? ? ? ? ? BufferedReader reader = new BufferedReader(
? ? ? ? ? ? ? ? ? ? ? ? new InputStreamReader(process.getInputStream()));
? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? String line;
? ? ? ? ? ? ? ? while ((line = reader.readLine()) != null) {
? ? ? ? ? ? ? ? ? ? //output.append(line + "\n");
? ? ? ? ? ? ? ? ?? ?if(line!=null && line.contains("99.9th.")) {
? ? ? ? ? ? ? ? ?? ??? ??? ?break;
? ? ? ? ? ? ? ? ?? ??? ?} ??
? ? ? ? ? ? ? ? }

? ? ? ? ? ? ? ? int exitVal = process.waitFor();
? ? ? ? ? ? ? ? if (exitVal == 0) {
? ? ? ? ? ? ? ? System.out.println("Success!");
? ? ? ? ? ? ? ? String endTime = sdf.format(new Date());
?? ??? ?//send to opensearch for tracking purpose - optional step
? ? ? ? ? ? ?? ?pushToIndex(line,startTime, endTime, prop.getProperty("indexname"));
? ? ? ? ? ? ? ? //System.out.println(output);
? ? ? ? ? ? ? ? //System.exit(0);
? ? ? ? ? ? ? ? } else {
? ? ? ? ? ? ? ? ? ? System.out.println("Something unexpected happend!!!");
? ? ? ? ? ? ? ? }

? ? ? ? } catch (IOException e) {
? ? ? ? ? ? e.printStackTrace();
? ? ? ? } catch(Exception e){
? ? ? ? ? ? e.printStackTrace();
? ? ? ? }
? ? }

********** End of producer code snippet **************

****** Stat of Producer properties file *****?
numOfRecords=70000
recordSize=4000
throughputRate=20
topicstart=1
topicend=100
****** End of Producer properties file *****?

Consumers:

In this use case, we have used consumers written in Java that consistently consume defined (K) transactions per second (TPS), which can be controlled using properties files.

***** Start of consumer code snippet *******?
? ? void performAction(int topicN){
? ? ? ? try {
? ? ? ? ? ? ProcessBuilder processBuilder = new ProcessBuilder();
? ? ? ? ? ? Properties prop = new Properties();
? ? ? ? ? ? InputStream input = null;
? ? ? ? ? ? input = new FileInputStream("consumer.properties");
? ? ? ? ? ? prop.load(input);

? ? ? ? ? ? ? ? SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
? ? ? ? ? ? ? ? sdf.setTimeZone(TimeZone.getTimeZone("America/New_York"));
? ? ? ? ? ? ? ? String startTime = sdf.format(new Date());
? ? ? ? ? ? System.out.println("Start time ---> "+startTime+" topic " +topicN);

String command =null;
? ? ? ? ? ? if( prop.getProperty("latest").equalsIgnoreCase("no")) {
? ? ? ? ?
command = "/home/opc/kafka/bin/kafka-consumer-perf-test --broker-list=x.x.x.x:9092"+
? ? ? ? ? ? ? ? ? ? ? ? " --topic "+topicN+
? ? ? ? ? ? ? ? ? ? ? ? " --messages "+ prop.getProperty("numOfRecords") +
? ? ? ? ? ? ? ? ? ? ? ?// " --group "+ prop.getProperty("groupid") +
? ? ? ? ? ? ? ? ? ? ? ? " --timeout "+ prop.getProperty("timeout")+
?? ??? ??? ?" --consumer.config "+"/home/opc/kafka/bin/cons.config";
}else{

command = "/home/opc/kafka/bin/kafka-consumer-perf-test --broker-list=x.x.x.x:9092"+" --topic "+topicN+
? ? ? ? ? ? ? ? ? ? ? ? " --messages "+ prop.getProperty("numOfRecords") +
? ? ? ? ? ? ? ? ? ? ? ?// " --group "+ prop.getProperty("groupid") +
? ? ? ? ? ? ? ? ? ? ? ? " --timeout "+ prop.getProperty("timeout")+
?? ??? ??? ?" --consumer.config "+"/home/opc/kafka/bin/cons.config"+
? ? ? ? ? ? ? ? ? ? ? ? " --from-latest"; ? ? ? ? ? ??? ?
? ? ? ? ? ? }

System.out.println("command is ---> " +command); ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? //processBuilder.command("cmd.exe", "/c", "dir C:\\Users\\admin");
? ? ? ? ? ? ? ? processBuilder.command("bash", "-c", command);

? ? ? ? ? ? ? ? Process process = processBuilder.start();
? ? ? ? ? ? ? ? StringBuilder output = new StringBuilder();
? ? ? ? ? ? ? ? BufferedReader reader = new BufferedReader(
? ? ? ? ? ? ? ? ? ? ? ? new InputStreamReader(process.getInputStream()));
? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? String line;
? ? ? ? ? ? ? ? while ((line = reader.readLine()) != null) {
? ? ? ? ? ? ? ? ? ?// output.append(line + "\n"+"test");
? ? ? ? ? ? ? ? ?? ?//if(line.contains(":") && line.contains("-")) {
? ? ? ? ? ? ? ? ? ? ? if (line.contains(":") && line.contains("-") && !line.contains("WARNING:")) {
? ? ? ? ? ? ? ? ?? ??? ??? ?break;
? ? ? ? ? ? ? ? ?? ??? ?} ??
? ? ? ? ? ? ? ? }

? ? ? ? ? ? ? ? int exitVal = process.waitFor();
? ? ? ? ? ? ? ? if (exitVal == 0) {
? ? ? ? ? ? ? ? System.out.println("Success!");
? ? ? ? ? ? ? ?//optional step - send data to opensearch
? ? ? ? ? ? ? ? ?pushToIndex(line, prop.getProperty("indexname"),topicN);
? ? ? ? ? ? ? ? System.out.println(output);
? ? ? ? ? ? ? ? //System.exit(0);
? ? ? ? ? ? ? ? } else {
? ? ? ? ? ? ? ? ? ? System.out.println("Something unexpected happend!!!");
? ? ? ? ? ? ? ? }

? ? ? ? } catch (IOException e) {
? ? ? ? ? ? e.printStackTrace();
? ? ? ? } catch(Exception e){
? ? ? ? ? ? e.printStackTrace();
? ? ? ? }
? ? }

***** End of consumer code snippet *******

*****Start of Consumer properties snippet ****?

numOfRecords=70000
topicstart=1
topicend=100
groupid=cgg1
reportingIntervalms=10000
fetchThreads=1
latest=no
processingthreads=1
timeout=60000

***** End of consumer properties snippet ****

CAP Theorem

Before diving into further discussions and scenario executions, it's crucial to understand the fundamental algorithms of distributed systems. This foundational knowledge will help you better correlate WAN disruptions with their impact on system behavior.

CAP Theorem (i.e., Consistency, Availability, and Partition tolerance) applies to all the distributed systems like Kafka, Hadoop, Elasticsearch, and many others, where data is being distributed across nodes to:

Enable parallel data processing
Maintain high availability for data and service
Support horizontal or vertical scaling requirements

Consistency (C)

Every read gets the latest write or an error.

Kafka follows eventual consistency; data ingested into one node eventually syncs across others. Reads happen from the leader, so replicas might lag behind.

Availability (A)

Every request gets a response, even if some nodes fail.

Kafka ensures high availability using replication factor, min.insync.replicas, and acks. As long as a leader is available, Kafka keeps running.

Partition Tolerance (P)

The system keeps working despite network issues.

Even if some nodes disconnect, Kafka elects a new leader and continues processing.

Kafka is an AP system; it prioritizes Availability and Partition Tolerance, while consistency is eventual and tunable via settings.

Now, let's get into the scenarios.

Stretch Cluster failed scenarios

Before going in-depth into scenarios, let's understand the key components and terminology of Kafka internals.

Brokers represent data nodes where all the data is distributed with replicas for data high availability.
The Controller is a critical component of Kafka that maintains the cluster's metadata, including information about topics, partitions, replicas, and brokers. Also performs administrative tasks. Think of it as the brain of a Kafka cluster.
A Zookeeper quorum is a group of nodes (usually deployed on dedicated nodes) that act as a coordination service helping to manage the Controller election, Broker Registration, Broker Discovery, Cluster membership, Partition leader election, Topic configuration, Access Control List (ACLs), consumer group management, and offset management.
Topics are logical constructs of data to which all related data is published. Examples include all the network logs published to the "Network_logs" topic and application logs published to the "Application_logs" topic.
Partitions are the fundamental units or building blocks supporting scalability and parallelism within a topic. When a topic is created, it can have one or more partitions where all the data is published in order.
Offsets are unique identifiers assigned to each message within a partition of a topic, essentially a sequential number that indicates the position of a message within the partition.
A producer is a client application that publishes records (messages) to Kafka topics.
A consumer is a client application that consumes records (messages) from Kafka topics.

Scenario 1: WAN Disruption – Active Controllers as '0' – AKA Brain-Dead Scenario

Scenario	WAN Disruption – Active controllers as '0'.
Scenario Description	As discussed in earlier sections, in a distributed network setup across geographic locations, network service interruptions can occur because no service guarantees 100% availability. This scenario focuses on WAN disruption between two regions, causing Kafka service unavailability, leaving the cluster with zero active controllers and no self-recovery, even after hours of waiting.
Execution Steps	Kafka cluster (all brokers and Zookeeper quorum) should be up and running with the configuration stated in the environment section Produce and consume events with brokers' equilibrium state (this step is optional, but simulating a real use case) Disconnect/disrupt WAN service between regions Observe the cluster behavior. Check through the broker host CLI and ZK CLI. If you are using Confluent, you can check in Confluent center or AWS Cloud provided managed service you can check the same CloudWatch, respectively and other offerings too a. ZK quorum availability b. Brokers availability as cluster members c. Number of active controllers d. Producers and consumers progress e. Under replicated partitions, etc. Reestablish WAN service between regions and observe the cluster behavior.
Expected Behavior	On steady state, the Kafka cluster should be available with 4 brokers ZK quorum as 3 Active controllers as 1 Producers and consumers should produce and fetch events from topics Upon WAN service disruption, The Kafka cluster should not be available ZK quorum 3 will break Active controllers as 1 or zero Producers and consumers should be interrupted, create back pressure or time-outs. Reestablish WAN service ZK quorum 3 should be reestablished Kafka cluster should be available with 4 brokers The active controller should be 1 Producers and consumers should continue/resume
Actual/Observed Behavior	Kafka is unavailable and can't be recovered on its own upon WAN service disruption and reestablishment and shows active controllers as '0' Active brokers are reported as zero Active controller number is zero ZK quorum with 3 is formed on WAN service resume, and the quorum broke on disruption All brokers showed up in ZK cluster membership upon WAN service resume and only two brokers showed during WAN disruption. Producers and consumers completely interrupted and timed out.
Root Cause Analysis (RCA)	It's a "brain-dead" scenario. Upon WAN disruption, a single 4-node cluster becomes 2 independent clusters, each with 2 nodes. And the existing single Zookeeper quorum becomes two quorums i.e., one for each site. But each other quorum does not elect a new controller, resulting as it thinks that the controller already existing resulting in zero controllers upon WAN reestablishment. Imagine two cooks in a kitchen preparing a dish. Each cook assumes the other has added the salt, but in the end, no salt is added to the dish. From the logs, it was observed controller moved the exception from one broker to another broker, but no exit or hang was reported in the logs `Controller exceptions: controller.log:org.apache.kafka.common.errors.ControllerMovedException: Controller moved to another broker. Aborting? controller.log:java.io.IOException: Client was shutdown before response was read controller.log:java.lang.InterruptedException` From ZooKeeper (ZK) CLI, It's also observed that 'no broker claims to be the controller' in the cluster, the cluster failed to respond properly in the face of state changes, including topic or partition creation or broker failures. `ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler) java.lang.Exception: shutdown Leader! reason: Not sufficient followers synced, only synced with sids: [ 2 ] ERROR Exception while listening (org.apache.zookeeper.server.quorum.QuorumCnxManager)`
References	WAN disable can be achieved through route tables modify or delete. ZK commands. Refer ZK command section.
Solution/Remedy	Because there are no controller claims by brokers, all the brokers need to be restarted.

WAN Disruption:

WAN disruption can be done by deleting or modifying route tables.
Warning: Deleting a route table may disrupt all traffic in the associated subnets. Instead, consider modifying or detaching the route table.

Firewall Stop

You may need to shut down firewalls
Firewall status: systemctl status firewalld
Firewall stop: service firewalld stop

ZK Commands list:

ZK Server start: bin/zookeeper-server-start etc/kafka/zookeeper.properties
ZK Server stop: zookeeper-server-stop
ZK Shell commands: bin/zookeeper-shell.sh <zookeeper host>:2181
Active controllers check: get /controller or ls /controller
Brokers list: /home/opc/kafka/bin/zookeeper-shell kafkabrokeraddress:2181 <<<"ls /brokers/ids"

Alternative to check controllers and brokers in a cluster. Available from kafka 3.7.x
bin/kafka-metadata-shell.sh --bootstrap-server <KAFKA_BROKER>:9092 --describe-cluster

Scenario 2: WAN Disruption - Active Controllers as '2' – AKA Split-Brain Scenario

Scenario	WAN Disruption - Active controllers as '2'
Scenario Description	As discussed in earlier sections, in a distributed network setup across geographic locations, network service interruptions can occur because no service guarantees 100% availability. This scenario focuses on WAN disruption between two regions, causing Kafka service unavailability with active controllers as '2' and no self-recovery, even after hours of waiting.
Execution Steps	Kafka Cluster (all brokers and Zookeeper quorum) should be up and running with the configuration stated in the environment section Produce and consume events with brokers' equilibrium state (this step is optional) Disconnect/Disrupt WAN service between regions Observe the cluster behavior- Check through the broker host CLI and ZK CLI. If you are using Confluent, you can check in the Confluent center or the AWS Cloud provided managed service, you can check the same CloudWatch respectively ZK quorum availability Brokers' availability as cluster members Number of active controllers Producers and consumers progress Under replicated partitions etc. Reestablish WAN service between regions and observe the cluster behavior.
Expected Behavior	In a steady state, the Kafka cluster should be available with 4 brokers ZK quorum as 3 Active controllers as 1 Producers and consumers should produce and fetch events from topics Upon WAN service disruption, The Kafka cluster should not be available ZK quorum 3 will break Active controllers as 1 or zero Producers and consumers should be interrupted or create back pressure or time-outs. Re-establish WAN service ZK quorum 3 should be reestablished Kafka cluster should be available with 4 brokers Active controller should be 1 Producers and consumers should continue/resume
Actual/Observed Behavior	Kafka is not available and can't be recovered on its own upon WAN service disruption and re-establishment and shows active controllers as '2' Active brokers are reported as zero Active controller number is two ZK quorum with 3 is formed on WAN service resume and the quorum broke on disruption All brokers showed up in ZK cluster membership upon WAN service resume and only two brokers showed during WAN disruption. Producers and consumers completely interrupted and timed out. `NETWORK_EXCEPTIONs` Producers got failed requests with `NOT_LEADER_FOR_PARTITIONS` exceptions
Root Cause Analysis (RCA)	This is a split-brain scenario. Upon WAN disruption, a single 4-node cluster becomes 2 independent clusters with each consisting of 2 nodes. Also, an existing single Zookeeper quorum becomes two quorums, i.e., one for each site. As a result, each quorum will elect a new controller, resulting in two controllers. Upon WAN re-establishment, we see 2 controllers for 4 4-node clusters. Think of it like two co-pilots flying one plane from different cockpits after radio loss. Both think the other is gone and take over, conflicting actions crash the system. From ZooKeeper (ZK) CLI, it's also observed that two brokers claim to be the controller in the cluster; the cluster failed to respond properly in the face of state changes, including topic or partition creation or broker failures.
References	WAN disable can be achieved through the route tables delete ZK commands
Solution/Remedy	Restart both controllers

Scenario 3: Broker Disk full, resulting respective broker process crash and No Active controllers

Scenario	Broker disk full, resulting respective broker process crash, and No active controllers or '0' active controllers.
Scenario Description	On a steady Kafka cluster, which means all the brokers are up and running, and both producers and consumers are producing and consuming events. One of the 4 brokers' disks is full due to data skew, i.e., uneven or imbalanced distribution of data across and leading to an abrupt broker process crash and cascading the crash to the remaining brokers. This causes the Kafka cluster unavailability because of active controllers as '0'.
Execution Steps	Kafka Cluster (all brokers and Zookeeper quorum) should be up and running with the configuration stated in the environment section Produce and consume events with brokers' equilibrium state One of the 4 brokers' disks is full and causing the respective broker process to crash Observe the cluster behavior through the broker host CLI or ZK CLI. If you are using Confluent, you can check in the Confluent center or the AWS Cloud provided managed service you can check the same CloudWatch respectively ZK quorum availability Brokers availability as cluster members Number of active controllers Producers and consumers progress Under replicated partitions, etc.
Expected Behavior	In a steady state, Kafka cluster should be available with 4 brokers ZK quorum as 3 Active controllers as 1 Producers and consumers should produce and fetch events from topics Upon one broker's disk full, Disk error should be logged to logs. Kafka cluster should not be available; only one broker should be shown as not available or out of cluster membership ZK quorum shouldn't fail Active controllers can be zero (if the controller node is the same as a dead broker) and move the controller to another available broker. The cluster should report under replicated partitions due to one broker unavailability. Producers and consumers should be interrupted, create back pressure, or time-outs.
Actual/Observed Behavior	Kafka is not available and couldn't be recovered on its own upon and showed active controllers as '0' Active brokers are reported as zero The active controller number is zero No impact on Zookeeper quorum Producers and consumers completely interrupted and timed out
Root Cause Analysis (RCA)	From the logs, observed disk-related errors and exceptions `logs/state-change.log:102362: ERROR [Broker id=1] Skipped the become-follower state change with correlation id 331 from controller 1 epoch 1 for partition topic1-9 (last update controller epoch 1) with leader 5 since the replica for the partition is offline due to disk error org.apache.kafka.common.errors.KafkaStorageException: Error while creating log for topic1-9 in dir /kafka/kafka-logs (state.change.logger)` Logs show that the controller moved the exception from one broker to another broker, but no exit or hang was reported in the logs `Controller exceptions: controller.log:org.apache.kafka.common.errors.ControllerMovedException: Controller moved to another broker. Aborting? controller.log:java.io.IOException: Client was shutdown before response was read controller.log:java.lang.InterruptedException` Controllers '0' behavior is equal to the brain-dead scenario explained in scenario 2.
References	ZK active controller commands ZK broker membership commands
Solution/Remedy	It is essential to monitor your Kafka cluster with proper metrics such as CPU, Disk, Memory, Java Heap, etc. Once the disk reaches 80% or more, you should scale up your cluster. Because there are no controller claims by brokers, all the brokers need to be restarted.

Scenario 4: Corrupted indices upon Kafka process kill

Scenario	Corrupted indices upon Kafka process kill
Scenario Description	Killing broker and ZK processes (an uncontrolled shutdown can be comparable with machine abrupt reboot) and restarting brokers lead to corrupted indices, rebuilding indices doesn't work even after hours of waiting. This scenario is comparable to an ungraceful shutdown during an environment upgrade or to a poor security posture, where an unauthorized person could terminate the service.
Execution Steps	Kafka Cluster (all brokers and Zookeeper quorum) should be up and running with the configuration stated in the environment section Produce and consume events with brokers' equilibrium state Kill the broker and ZK processes, and restart the processes Observe the cluster behavior through the broker host CLI or ZK CLI. If you are using Confluent, you can check in the Confluent center or the AWS Cloud provided managed service you can check the same CloudWatch respectively ZK quorum availability Brokers' availability as cluster members Number of active controllers Producers and consumers progress Under replicated partitions etc.
Expected Behavior	In a steady state, the Kafka cluster should be available with 4 brokers ZK quorum is 3 Active controllers as 1 Producers and consumers should produce and fetch events from topics Kill Kafka processes (kill -9), i.e., both broker and ZK processes on all 4 nodes and restart processes ZK quorum should be formed Active controllers as '1' In-flight messages may be lost if not flushed to the disk May result in corrupted indices, but the broker rebuilds corrupted indices on its own without any manual work Upon rebuild of corrupted indices, the Kafka cluster should be available The cluster might show under replicated partitions due to corrupted indices Producers and consumers will get exceptions or errors and can't continue to perform their jobs
Actual/Observed Behavior	Kafka is not available and couldn't be recovered on its own. Active brokers are '4' Active controller number is '1' No impact on Zookeeper quorum Producers and consumers completely interrupted and timed out
Root Cause Analysis (RCA)	From the logs, 'index corrupted' logs for partitions are shown, and also 'rebuilding indices logs'. Rebuilding indices didn't work and doesn't show any related logs on progress. The resulting cluster is in unavailable status even after 1 hour. `WARN [Log partition=topic1-13, dir=/kafka/new/kafka-logs] Found a corrupted index file corresponding to log file /kafka/new/kafka-logs/topic1-13/00000000000000000000.log due to Corrupt time index found, time index file (/kafka/new/kafka-logs/topic1-13/00000000000000000000.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1553657637686}, recovering segment and rebuilding index files... (kafka.log.Log)` Also, it is recommended to do graceful shutdowns instead of using the kill switch or doing abrupt shutdowns so the data in the Kafka local cache will be flushed to the disks by maintaining meta data information and also committing offsets.
References	Not applicable
Solution/Remedy	If Kafka fails to rebuild all the indices on its own, manual deletion of all index files and restarting all brokers will fix the issue.

Conclusion on WAN disruptions failure scenarios

Deploying on-premise stretch clusters requires a deep understanding of failure scenarios, particularly during network disruptions. Unlike cloud environments, on-premises setups may lack highly available network infrastructure, fault tolerance mechanisms, or dedicated fiber lines, making them more susceptible to outages. In such cases, a Kafka cluster may experience downtime if network connectivity is disrupted.

Cloud providers, on the other hand, offer high-bandwidth, low-latency networking over fully redundant, dedicated metro fiber connections between regions and data centers. Their managed Kafka services are designed for high availability, typically recommending deployment across multiple zones or regions to withstand zonal failures. Additionally, these services incorporate zone-awareness features, ensuring Kafka data replication occurs across multiple zones, rather than being confined to a single one.

It is also important to understand and emphasize the latency impact on a stretch cluster deployed in two different Geo locations with a WAN connection. We can not escape physics; in our case, as per WonderNetwork, the average latency between London and Frankfurt is 14.88 milliseconds for a distance of 636km. So, for mission-critical applications, the setup of a stretch cluster driven by latencies, through testing, is needed to determine how the data is being replicated and consumed across.

By understanding these differences, organizations can make informed decisions about whether an on-premise stretch cluster aligns with their availability and resilience requirements or if a cloud-native Kafka deployment is a more suitable alternative.

Disaster Recovery Strategies:

A Business Continuity Plan (BCP) is essential for any business to maintain its brand and reputation. Almost all the companies think of High Availability (HA) and Disaster recovery (DR) strategies to meet their BCP. Below are the popular DR Strategies for Kafka.

Active – Standby (aka Active-passive)
Active – Active
Backup and restore

Before we dive deep into DR strategies further, let's understand that the Kafka Mirror Maker 2.0 we are using in the following architectural patterns. There are other components like Kafka Connect, Replicator, etc., offered by Confluent and other cloud services, but they are out of scope for this article.

DNS Redirection in Apache Kafka is a cross-cluster replication tool that uses the Kafka Connect framework to replicate topics between clusters, supporting active-active and active-passive setups. It provides automatic consumer group synchronization, offset translation, and failover support for disaster recovery and multi-region architectures.

Active Standby

This is a popular DR strategy pattern where there will be two different Kafka clusters, which are of identical size & configuration, deployed in two different Geo regions or Data centers. In this case, it's the London and Frankfurt regions. Usually, Geo location choice depends upon Earth's different tectonic plates, company or country compliance and regulations requirements, and other factors. Out of which only one cluster is active at a given point of time, which receives and serves the clients.

As shown in the following picture, Kafka cluster one (KC1) is located in the London region and Kafka cluster two (KC2) is located in the Frankfurt region with the same size and configuration.

All the producers are producing data to KC1 (London region cluster), and all consumers are consuming the data from the same cluster. KC2 (Frankfurt region cluster) receives the data from the KC1 cluster through Mirror Maker as shown in Figure 1.

Figure 1: Kafka Active Standby (aka Active Passive) cluster setup with Mirror role

In simple terms, we can treat the mirror maker as a combination of producer and consumer, where it consumes the topics from the source cluster KC1 (London region) and produces the data to the destination cluster KC2 (Frankfurt region).

During the disaster to the primary/active cluster KC1 (London region), the standby cluster (Frankfurt region) will become active, and all the producers and consumers' traffic redirected to the standby cluster. This can be achieved through DNS redirection via a load balancer in front of Kafka clusters. Fig 2 shows this scenario.

Figure 2: Kafka Active Standby setup - Standby cluster becomes active

When the London Region KC1 cluster becomes active, we need to set up a mirror maker instance to consume the data from the Frankfurt region cluster and produce back to it to the London region KC1 cluster. This scenario is shown in Fig. 3.

It's up to the business to keep the Frankfurt region (KC2) active or promote the London region cluster (KC1) back to primary once the data is fully synced/hydrated to the current state. In some cases, the standby cluster, i.e., the Frankfurt region cluster (KC2), continues to function like an active cluster and the London region cluster (KC1) becomes the standby as well.

It is essential to note that RTO (Recovery Time Objective) and RPO (Recovery Point Objective) values are high compared to the Active-Active setup. Active-standby setup cost is medium to high as we need to maintain the same Kafka clusters in both regions.

Figure 3: Kafka Active Standby setup – Standby cluster data sync happening to the KC1

Concerns with the Active-Standby setup

Kafka Mirror Maker Replication Lag (Δ):

Let's take a scenario, and it might happen very rarely, where we lose all the cluster completely, including nodes and disks, etc. Where the Active Kafka Cluster (London region - KC1) has 'm' messages, and the mirror maker is consuming and producing to the standby cluster. Due to network latency or consumers & producing process with acknowledgements, the standby cluster (i.e., Frankfurt region - KC2) lags behind and holds only 'n' messages. Which means the delta(\(\Delta\)) is m minus n.

\(\Delta = m - n\)

where:

m = Messages in the Active Kafka Cluster (London)
n = Messages in the Standby Kafka Cluster (Frankfurt)

Due to this replication lag (\(\Delta\)), it will have the following impact on mission-critical applications

Potential Message Loss:
If a failover occurs before the Standby cluster catches up, some messages (\(\Delta\)) may be missing.

Inconsistent Data:
Consumers relying on the Standby cluster may process incomplete or outdated data.

Delayed Processing:
Applications depending on real-time replication may experience latency in data availability.

Active-Active Setup:

The replication lag concern can be overcome with the Active-Active setup. Just like the Active standby setup, we will have the same Kafka clusters deployed in the London and Frankfurt regions.

The producers of the ingestion pipeline will write the data to both clusters at any given point in time. Which is also known as Dual Writes. So, both clusters' data is in sync as shown in Fig. 4.

Please note that the ingestion pipeline or producers should have the capability to support dual writes. It is also important to note that these write pipelines should be isolated so that any issue with one cluster should not impact other pipelines or create back pressure and producer latencies. For example, if your use case is log analytics or observability, considering Logstash in your ingestion and creating two separate pipelines within a single or more Logstash instances will act as an isolated pipeline. But only one cluster is active for the consumers at any given point of time.

Figure 4: Active-Active setup or Dual writes.

In case, as shown in Figure 5, any issues with the one cluster the consumers need to be redirected with DNS redirect and RTO and RPO or SLAs are near zero.

Figure 5: Active-Active setup, one cluster is unavailable

Once the London cluster is available and becomes active, data can sync back to the London cluster from the Frankfurt cluster using Mirror Maker. As shown in Fig. 6, consumers can be switched back to the London cluster using DNS redirect. Also, it is important to configure consumer groups to start where they left off.

During this setup, during the failover, it is important to configure consumers' offset to read the messages from a particular point in time, like reply messages from the last current time minus 15 mins, or set up a mirror maker between active clusters to read 'only' consumer offset topics from the London region cluster and replicate to the Frankfurt region cluster. And during the failure, you can use source.auto.offset.reset=latest to read the data from the latest offsets in the Frankfurt region cluster.

If the consumer groups are not properly configured, you may see duplicate messages in your downstream system. So, it's essential to design the downstream to handle duplicate records. For example, if it's an observability use case and if the downstream is OpenSearch, it's good practice to design a unique ID depending on the message timestamp and other parameters like session IDs, etc. So, that duplicate message will overwrite the existing duplicate record and maintain only one record.

Figure 6: Active-Active setup cluster syncing data with MirrorMaker

Challenges and Complexities in Active-Active Kafka Setups

Message ordering challenges occur when writing to two clusters simultaneously, messages may arrive in different orders due to network delays between regions. This can be particularly problematic for applications that require strict message ordering. So, it needs to be handled in the data processing time or downstream systems. For example, defining a primary key within the message body timestamp field would order the messages based on timestamps.

Data Consistency Issues can occur while establishing dual writes to two different clusters. It requires robust monitoring and reconciliation processes to ensure data integrity across clusters. Because there might be conditions where data might successfully write to one cluster but fail in another, resulting in data inconsistency.

Producer complexity is an issue in this active-active setup. Producers need to handle writing to two clusters and manage failures independently for each cluster, so isolation of each pipeline is important. This setup might increase the complexity of the producer applications and require more resources to manage dual writes effectively. There are some agents or tools in the market also as open-source offerings that will provide dual pipeline options such as Elastic Logstash.

Consumer group management is complex. Managing consumer offsets across multiple clusters could be challenging, especially during failover scenarios, and can lead to message duplication. Consumer applications or downstream systems need to be designed to handle duplicate messages or implement deduplication logic.

Operational Overhead - Running active-active clusters requires sophisticated monitoring tools and increases operational costs significantly. Teams need to maintain detailed runbooks and conduct regular disaster recovery drills to ensure smooth failover processes.

Network Latency Considerations- Cross-region network latency and bandwidth costs become critical factors in active-active setups. Reliable network connectivity between regions is essential for maintaining data consistency and managing replication effectively.

Backup and Restore

Backup and restore is a cost-effective architectural pattern compared to Active Standby(AP) or Active Active(AA) DR architectural patterns. It is suitable for non-critical applications with higher RTO and RPO values compared to AA or AP patterns.

Backup and restore is a cost-effective option because we need only one Kafka cluster. In case the Kafka cluster is unavailable, the data from the centralized repository can be replayed/restored to the Kafka cluster once it becomes available. In this architectural pattern, shown in Figure 7, all the data will be saved to a persistent location like Amazon S3 or another data lake, where the data can be restored/replayed to the Kafka cluster when the Kafka cluster is available. It is also common to see a new environment of Kafka clusters built together using terraform templates and build a new Kafka cluster and replay data.

Figure 7: Backup and restore architectural pattern

Challenges and Complexities in Kafka Backup and Restore Setup

Recovery time impact is a major consideration; Restoring (aka replaying) large volumes of data to Kafka can take significant time, resulting in a higher RTO. This makes it suitable only for applications that can tolerate longer downtimes.

Data consistency challenges are also an issue. When restoring data from backup storage like Amazon S3, maintaining the exact message ordering and partitioning as the original cluster can be challenging. Special attention is needed to ensure data consistency during restoration.

Backup performance overhead is a further consideration. Regular backups to external storage can impact the performance of the production cluster. The backup process needs careful scheduling and resource management to minimize impact on production workloads.

Storage is key. Managing large volumes of backup data requires effective storage lifecycle policies and controlling costs. Regular cleanup of old backups and efficient storage utilization strategies are essential.

Coordination during recovery is mandatory. The recovery process requires careful coordination between backing up offset information and actual messages. Missing either can lead to message loss or duplication during restoration.

Conclusion on Kafka DR Strategies

As we explored earlier, choosing a DR strategy isn't one-size-fits-all, it comes with trade-offs. It often boils down to finding the right balance between what the business needs and what it can afford. The decision usually depends on several factors.

SLAs (e.g., 99.99%+ uptime) are pre-defined commitments to uptime and availability. An SLA requiring 99.99% uptime allows only 52.56 minutes of downtime per year.

The RTO metric is used to define the maximum acceptable downtime for a system or application after a failure before it causes unacceptable business harm. Knowledge/content/training systems, for example, can expect one hour or more downtime because they are not mission critical. For banking applications, on the other hand, downtime acceptance could be none or a couple of minutes.

The RPO metric is used to define the maximum acceptable amount of data loss during a disruptive event, measured in time from the most recent backup. For example, an observability/monitoring application can have the acceptability for losing the past hour whereas a financial application can not tolerate losing a single record of transaction data.

Associated costs vary. Higher availability and faster recovery typically come at a higher price. Organizations must weigh their need for uptime and data protection against the financial investment required.

DR Strategy	RTO (Downtime tolerance)	RPO (Data loss tolerance)	Use case	Cost
Active Active	Near-Zero (Seconds)	Near-Zero (Seconds)	Mission-critical apps (e.g., banking, e-commerce, stock trading)	High
Active Standby	Minutes to Hours	Near-Zero to Minutes	Enterprise apps need high availability	Medium to High (No dual write ingestion pipeline cost)
Backup and Restore	Hours to Days	Hours to Days	Non-critical apps with occasional DR needs or cost-sensitive apps,	Low

References:

About the Authors

Srikanth Daggumalli

Show moreShow less

Nishchai Jayanna Manjula