犯贱是什么意思| hpv59高危阳性是什么意思| 甘心的近义词是什么| 笔什么龙什么| 体恤是什么意思| 嗓子有异物感吃什么药| 骨头是什么垃圾| 高处不胜寒什么意思| 红顶商人是什么意思| 老上火是什么原因造成的| 下巴发黑是什么原因| 吃什么助睡眠| 腿肚子抽筋是什么原因| app是什么缩写| 4.20是什么星座| 什么是粳米| 先父什么意思| 出佛身血是什么意思| 正值当年什么意思| 肛裂用什么药治最好效果最快| 舌头溃疡是什么原因| ysy是什么意思| 口杯是什么意思| 肝火旺盛失眠吃什么药| 两个gg是什么牌子的包包| 早上三点是什么时辰| 七月份有什么节日吗| 黄昏是什么时候| 短发适合什么脸型| 懵懂少年是什么意思| 单宁是什么意思| 尿胆红素阳性是什么意思| 贾字五行属什么| 身家是什么意思| 功夫2什么时候上映| 抗着丝点抗体阳性是什么| 光环是什么意思| 小孩子拉肚子吃什么药| 什么是灰指甲| 一直打嗝什么原因| 大象是什么意思| 便秘用什么方法治| 达菲是什么药| 为什么一直放屁| 白羊座前面是什么星座| 胃炎应该吃什么药| 淋巴结发炎吃什么药| 95年猪五行属什么| 鼻子突然出血是什么原因| 血氧饱和度是什么| 十月份什么星座| 左眼上眼皮跳是什么预兆| 祝著节是什么时候| 西游记是什么生肖| 同人文是什么意思| 肺主皮毛是什么意思| 热毒吃什么药好得快| 什么样的枫叶| hpv66阳性是什么意思| 前列腺炎吃什么食物好| 小腿抽筋什么原因| 13楼五行属什么| 弓箭是什么时候发明的| 紫色心情是什么意思| 爷爷和孙子是什么关系| 感冒了吃什么| 泡酒用什么酒好| 什么东西软化鱼刺最快| 什么东西补血最快| herry是什么意思| 蝴蝶什么意思| 文火是什么火| 子宫为什么会长息肉| sp是什么面料| 中年男人遗精是什么原因| 渣男之首是什么星座| 什么样的春光| 完全性右束支阻滞是什么意思| 刘封为什么不救关羽| 投资公司是做什么的| 什么气什么现| 四风指什么| 来月经喝什么茶好| 中国的母亲河是什么河| 养尊处优的意思是什么| 天肖是什么生肖| 发烧输液输的是什么药| 羲什么意思| 身份证穿什么颜色的衣服| 润滑油是干什么用的| 中性粒细胞低说明什么| 5月12日什么星座| 晚上睡觉出汗是什么原因| 什么凝视| 发物有什么| 性功能下降是什么原因| 冰袋里面装的是什么| nilm是什么意思| 整装是什么意思| 沵是什么意思| 类风湿是什么原因引起的| 肝气郁结是什么意思| 樱桃跟车厘子有什么区别| 常吃南瓜有什么好处和坏处| 什么长| 睡觉磨牙是什么情况| 6月18是什么星座| 三途苦是指的什么| 腰扭伤了挂什么科| xxoo什么意思| 燕麦色是什么颜色| 电动车不充电是什么原因| 米黄是什么颜色| 糖化血红蛋白是检查什么的| 红色加蓝色等于什么颜色| porridge什么意思| 小肚子胀气是什么原因| 芹菜吃多了会有什么影响| 用什么回奶最快最有效| 心脏有个小洞叫什么病| 晚上梦到蛇是什么意思| 小便很臭是什么原因| 莅临什么意思| 放是什么偏旁| 腺肌瘤是什么意思| 舌头变肥大什么原因| 红花是什么| bp什么意思| 张予曦为什么像混血| 0x00000024蓝屏代码是什么意思| 什么降血脂效果最好的| 梦见被狗咬是什么意思| 毁谤是什么意思| 水至清则无鱼什么意思| 芝士是什么味道| 雨露是什么意思| 想当演员考什么学校| 白色搭配什么颜色好看| 2024年属龙的是什么命| 太白金星是什么神| 西岳什么山| 标准差是什么| 戴芬是什么药| 属牛男最在乎女人什么| 备是什么意思| 义结金兰是什么意思| 氨基丁酸是什么| 手镯断了有什么预兆| 智齿什么时候拔合适| 刘胡兰是什么样的人| 过氧化氢浓度阳性是什么意思| 情人节送什么给女孩子| 蓝色加黄色等于什么颜色| 梦见好多黄鳝是什么意思| 什么狗聪明听话又好养| 什么手机电池最耐用| 我的部首是什么| 食物链是什么意思| 人间正道是沧桑是什么意思| 6.26是什么星座| 上四休二是什么意思| 20度穿什么衣服| 劳动的反义词是什么| 乡试第一名叫什么| 礼佛是什么意思| 小孩子打呼噜是什么原因| 说风就是雨什么意思| 小孩早上起床咳嗽是什么原因| 青蛙吃什么| 乾字五行属什么| 嘴唇颜色深是什么原因| 梦见大门牙掉了是什么意思| 硝酸咪康唑乳膏和酮康唑乳膏有什么区别| 中校军衔是什么级别| 低钾血症吃什么食补| 浅尝辄止什么意思| 纠察是什么意思| dior是什么牌子| 血管瘤有什么危害| rem睡眠是什么意思| 海淘是什么意思啊| 病毒感冒吃什么消炎药| 点天灯是什么意思| 大便羊屎粒是什么原因| 艾滋病是什么病毒| 眼睛浮肿是什么原因| 1.19是什么星座| 血痣是什么原因引起的| 俄罗斯的货币叫什么| 肢体麻木是什么原因| 小便短赤吃什么药| 爷们儿大结局是什么| 胆囊炎是什么原因引起的| 肺部炎症用什么药最好| 飞蓬草有什么功效| 谢邀什么意思| 黥面是什么意思| 色达在四川什么地方| 付诸东流是什么意思| 加鸡腿什么意思| 山茶花什么颜色| 古尔邦节什么意思| 1950年属什么生肖| 清奇是什么意思| 茶化石属于什么茶| 狐狸和乌鸦告诉我们什么道理| 汗血宝马什么意思| 什么是高潮| fcm是什么意思| 一什么公园| 梦见网鱼是什么征兆| 什么眼霜比较好用| bp是什么意思| 拿乔是什么意思| 低压高会引起什么后果| 膝盖凉是什么原因| 积食内热吃什么药| 什么的食物| 星辰大海什么意思| 浑身麻是什么原因| 菠萝是什么季节的水果| sp是什么意思啊| 火烧是什么| 鸡飞狗跳是什么意思| 公丁香和母丁香有什么区别| otc药物是什么意思| 奥肯能胶囊是什么药| 肝内胆管结石吃什么药好| 1904年属什么生肖| 每天吃一个西红柿有什么好处| 一什么尺子| 用什么泡脚能减肥| 查血糖是什么检查项目| 胆怯是什么意思| 今天是什么日子 农历| 十五年是什么婚| 四次元是什么意思啊| 什么是高脂肪食物| 什么值得买| 梨花是什么颜色| 穿孔是什么意思| 三人死亡属于什么事故| 打狗看主人打虎看什么答案| 女生无缘无故头疼是什么原因| 鲨鱼为什么怕海豚| 荀彧字什么| 衣原体支原体感染有什么症状| 吃什么东西降尿酸| 鼠分念什么| 11月1日是什么星座| 文化底蕴是什么意思| 磨牙是什么原因怎么治疗| 神戳戳是什么意思| 御风是什么意思| 花茶是什么茶| 敬谢不敏是什么意思| 决裂是什么意思| 瑶五行属什么| movies是什么意思| 天行健下一句是什么| 时间是什么| fred是什么牌子| 咖啡色是什么颜色| 莲子心和什么搭配最佳治失眠| 百度
BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Analyzing Apache Kafka Stretch Clusters: WAN Disruptions, Failure Scenarios, and DR Strategies

美国熊孩子用企鹅背包戏耍北极熊 场面滑稽有趣

Listen to this article -  0:00

Key Takeaways

  • Apache Kafka Stretch cluster setup in an on-premise environment poses a high risk of service unavailability during WAN disruptions, often leading to split-brain or brain-dead scenarios and violating Service Level Agreements (SLAs) due to degraded Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
  • Proactive monitoring of your Kafka environment to understand data skews across brokers is critical; an uneven data load on a single node can cause a stretch cluster failure.
  • An ungraceful broker shutdown during upgrades or due to poor security posture can make the Kafka service unavailable, resulting in a loss of data.
  • There are three popular and widely used Kafka disaster recovery strategies; each has challenges, complexities, and pitfalls.
  • Kafka Mirror Maker 2 replication lag (Δ) for the Disaster Recovery setup can cause message loss, inconsistent data.

 

Apache Kafka is a well-known publish-subscribe distributed system widely used across industries for use cases such as log analytics, observability, event-driven architectures, and real-time streaming. Its distribution allows it to become the critical piece or backbone of modern streaming architecture, offering many integrations and supporting near-real-time or real-time latency requirements.

Today, Kafka has been offered by major companies as a service, including Confluent Kafka as well as cloud offerings such as Confluent Cloud, AWS Managed Streaming Kafka , Google Cloud Managed Service for Apache Kafka, Azure Event Hub for Apache Kafka, etc. Additionally, Kafka has been deployed across over one-hundred-thousand companies, including Fortune 100 across various industry segments such as enterprise, financials, healthcare, automobile, startups, independent software vendors, and retail sectors.

The value of the data diminishes over time. Many businesses these days compete in time and demand real-time data processing with millisecond responses to power mission-critical applications. Financial services, e-commerce, IoT, and other latency-sensitive industries require highly available and resilient architectures that can withstand infrastructure failures without impacting business continuity.

Companies operating across multiple geographic locations need solutions that provide seamless failover, data consistency, and disaster recovery (DR) capabilities while minimizing operational complexity and cost. One of the common architectural patterns that companies often consider using is a single Kafka cluster spanning multiple data center locations on WAN (Wide Area Network). It is called Stretch Cluster. This pattern is commonly considered by companies for disaster recovery (DR) and high availability (HA).

Stretch Cluster can theoretically provide data redundancy across regions and minimize data loss in the event of a region-wide failure. However, this approach comes with several challenges and trade-offs due to the inherent nature of Kafka and its dependency on network latency, consistency, and partition tolerance in a multi-region setup. In this article, we are going to focus on the stretch cluster architectural pattern for DR or HA-related implications and considerations.

The following testing has been conducted, and the cluster behavior has been documented.

Environment Details

Regions London, Frankfurt
Kafka Distribution Apache Kafka
Kafka Version 3.6.2
Zookeeper 3.8.2
Operating system Linux

High-Level Environment Setup View

Below is the high-level view of a single stretch cluster setup that spans across the London and Frankfurt regions with

  • Four Kafka brokers, where actual data resides.
  • Four Zookeeper nodes (a three-node rather than a four-node setup would work as well, but to mimic the exact environment in both the regions, we have set up four Zookeeper nodes ).
  • Zookeeper follows [N/2] + 1 rule for quorum establishment. In this case, the Zookeeper quorum will be [4/2] + 1 = 2 + 1 = 3.
  • Network latency between these regions is ~ 15ms.
  • Multiple producers (1 to n) and consumers (1 to n) are producing and consuming the data from this cluster from different regions of the world.

Kafka Broker Hardware Configuration

Hard Disk Size 10 TB
Memory 120 GB
CPU 16 Core
Disk Setup RAID 0

Kafka Broker Configuration

  • Replication factor set to 4
  • Data retention (aka log retention) period is 7 days
  • Auto-creation of topics allowed
  • Minimum in-sync replication (ISR) set to 3
  • The number of partitions per topic is 10
  • Other configuration details
    • Acks = all
    • max.in.flight.request.per.connection=1
    • compression.type=lz4
    • batch.size=16384
    • buffer.memory=67108864
    • auto.offset.reset=earliest
    • send.buffer.bytes=524288
    • receive.buffer.bytes=131072
    • num.replica.fetchers=4
    • num.network.threads=12
    • num.io.threads=32

Producers and Consumers:

Producers:

In this use case, we have used a producer written in Java that consistently produces defined (K) Transactions Per Second (aka TPS), which can be controlled using properties files. Sample snippets using Kafka performance test api.

******** Start of Producer code snippet *******

?void performAction(int topicN){
? ? ? ? try {
? ? ? ? ? ? ProcessBuilder processBuilder = new ProcessBuilder();
? ? ? ? ? ? Properties prop = new Properties();
? ? ? ? ? ? InputStream input = null;
? ? ? ? ? ? input = new FileInputStream("config.properties");
? ? ? ? ? ? prop.load(input);

? ? ? ? ? ? ? ? SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
? ? ? ? ? ? ? ? sdf.setTimeZone(TimeZone.getTimeZone("America/New_York"));
? ? ? ? ? ? ? ? String startTime = sdf.format(new Date());
? ? ? ? ? ? System.out.println("Start time ---> "+startTime+" topic " +topicN);


String command = "/home/opc/kafka/bin/kafka-producer-perf-test --producer-props bootstrap.servers=x.x.x.x:9092"+ " --topic "+topicN+
? ? ? ? ? ? ? ? ? ? ? ? " --num-records "+ prop.getProperty("numOfRecords") +
? ? ? ? ? ? ? ? ? ? ? ? " --record-size "+ prop.getProperty("recordSize") +
? ? ? ? ? ? ? ? ? ? ? ? " --throughput "+ prop.getProperty("throughputRate") +
? ? ? ? ? ? ? ? ? ? ? ? " --producer.config "+ "/home/opc/kafka/bin/prod.config";

? ? ? ? ? ? ? ??
? ? ? ? ? ? ? ? //processBuilder.command("cmd.exe", "/c", "dir C:\\Users\\admin");
? ? ? ? ? ? ? ? processBuilder.command("bash", "-c", command);

? ? ? ? ? ? ? ? Process process = processBuilder.start();
? ? ? ? ? ? ? ? //StringBuilder output = new StringBuilder();
? ? ? ? ? ? ? ? BufferedReader reader = new BufferedReader(
? ? ? ? ? ? ? ? ? ? ? ? new InputStreamReader(process.getInputStream()));
? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? String line;
? ? ? ? ? ? ? ? while ((line = reader.readLine()) != null) {
? ? ? ? ? ? ? ? ? ? //output.append(line + "\n");
? ? ? ? ? ? ? ? ?? ?if(line!=null && line.contains("99.9th.")) {
? ? ? ? ? ? ? ? ?? ??? ??? ?break;
? ? ? ? ? ? ? ? ?? ??? ?} ??
? ? ? ? ? ? ? ? }

? ? ? ? ? ? ? ? int exitVal = process.waitFor();
? ? ? ? ? ? ? ? if (exitVal == 0) {
? ? ? ? ? ? ? ? System.out.println("Success!");
? ? ? ? ? ? ? ? String endTime = sdf.format(new Date());
?? ??? ?//send to opensearch for tracking purpose - optional step
? ? ? ? ? ? ?? ?pushToIndex(line,startTime, endTime, prop.getProperty("indexname"));
? ? ? ? ? ? ? ? //System.out.println(output);
? ? ? ? ? ? ? ? //System.exit(0);
? ? ? ? ? ? ? ? } else {
? ? ? ? ? ? ? ? ? ? System.out.println("Something unexpected happend!!!");
? ? ? ? ? ? ? ? }

? ? ? ? } catch (IOException e) {
? ? ? ? ? ? e.printStackTrace();
? ? ? ? } catch(Exception e){
? ? ? ? ? ? e.printStackTrace();
? ? ? ? }
? ? }

********** End of producer code snippet **************

****** Stat of Producer properties file *****?
numOfRecords=70000
recordSize=4000
throughputRate=20
topicstart=1
topicend=100
****** End of Producer properties file *****?

Consumers:

In this use case, we have used consumers written in Java that consistently consume defined (K) transactions per second (TPS), which can be controlled using properties files.

***** Start of consumer code snippet *******?
? ? void performAction(int topicN){
? ? ? ? try {
? ? ? ? ? ? ProcessBuilder processBuilder = new ProcessBuilder();
? ? ? ? ? ? Properties prop = new Properties();
? ? ? ? ? ? InputStream input = null;
? ? ? ? ? ? input = new FileInputStream("consumer.properties");
? ? ? ? ? ? prop.load(input);

? ? ? ? ? ? ? ? SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
? ? ? ? ? ? ? ? sdf.setTimeZone(TimeZone.getTimeZone("America/New_York"));
? ? ? ? ? ? ? ? String startTime = sdf.format(new Date());
? ? ? ? ? ? System.out.println("Start time ---> "+startTime+" topic " +topicN);

String command =null;
? ? ? ? ? ? if( prop.getProperty("latest").equalsIgnoreCase("no")) {
? ? ? ? ?
command = "/home/opc/kafka/bin/kafka-consumer-perf-test --broker-list=x.x.x.x:9092"+
? ? ? ? ? ? ? ? ? ? ? ? " --topic "+topicN+
? ? ? ? ? ? ? ? ? ? ? ? " --messages "+ prop.getProperty("numOfRecords") +
? ? ? ? ? ? ? ? ? ? ? ?// " --group "+ prop.getProperty("groupid") +
? ? ? ? ? ? ? ? ? ? ? ? " --timeout "+ prop.getProperty("timeout")+
?? ??? ??? ?" --consumer.config "+"/home/opc/kafka/bin/cons.config";
}else{

command = "/home/opc/kafka/bin/kafka-consumer-perf-test --broker-list=x.x.x.x:9092"+" --topic "+topicN+
? ? ? ? ? ? ? ? ? ? ? ? " --messages "+ prop.getProperty("numOfRecords") +
? ? ? ? ? ? ? ? ? ? ? ?// " --group "+ prop.getProperty("groupid") +
? ? ? ? ? ? ? ? ? ? ? ? " --timeout "+ prop.getProperty("timeout")+
?? ??? ??? ?" --consumer.config "+"/home/opc/kafka/bin/cons.config"+
? ? ? ? ? ? ? ? ? ? ? ? " --from-latest"; ? ? ? ? ? ??? ?
? ? ? ? ? ? }

System.out.println("command is ---> " +command); ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? //processBuilder.command("cmd.exe", "/c", "dir C:\\Users\\admin");
? ? ? ? ? ? ? ? processBuilder.command("bash", "-c", command);

? ? ? ? ? ? ? ? Process process = processBuilder.start();
? ? ? ? ? ? ? ? StringBuilder output = new StringBuilder();
? ? ? ? ? ? ? ? BufferedReader reader = new BufferedReader(
? ? ? ? ? ? ? ? ? ? ? ? new InputStreamReader(process.getInputStream()));
? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? String line;
? ? ? ? ? ? ? ? while ((line = reader.readLine()) != null) {
? ? ? ? ? ? ? ? ? ?// output.append(line + "\n"+"test");
? ? ? ? ? ? ? ? ?? ?//if(line.contains(":") && line.contains("-")) {
? ? ? ? ? ? ? ? ? ? ? if (line.contains(":") && line.contains("-") && !line.contains("WARNING:")) {
? ? ? ? ? ? ? ? ?? ??? ??? ?break;
? ? ? ? ? ? ? ? ?? ??? ?} ??
? ? ? ? ? ? ? ? }

? ? ? ? ? ? ? ? int exitVal = process.waitFor();
? ? ? ? ? ? ? ? if (exitVal == 0) {
? ? ? ? ? ? ? ? System.out.println("Success!");
? ? ? ? ? ? ? ?//optional step - send data to opensearch
? ? ? ? ? ? ? ? ?pushToIndex(line, prop.getProperty("indexname"),topicN);
? ? ? ? ? ? ? ? System.out.println(output);
? ? ? ? ? ? ? ? //System.exit(0);
? ? ? ? ? ? ? ? } else {
? ? ? ? ? ? ? ? ? ? System.out.println("Something unexpected happend!!!");
? ? ? ? ? ? ? ? }

? ? ? ? } catch (IOException e) {
? ? ? ? ? ? e.printStackTrace();
? ? ? ? } catch(Exception e){
? ? ? ? ? ? e.printStackTrace();
? ? ? ? }
? ? }

***** End of consumer code snippet *******

*****Start of Consumer properties snippet ****?

numOfRecords=70000
topicstart=1
topicend=100
groupid=cgg1
reportingIntervalms=10000
fetchThreads=1
latest=no
processingthreads=1
timeout=60000

***** End of consumer properties snippet ****

CAP Theorem

Before diving into further discussions and scenario executions, it's crucial to understand the fundamental algorithms of distributed systems. This foundational knowledge will help you better correlate WAN disruptions with their impact on system behavior.

CAP Theorem (i.e., Consistency, Availability, and Partition tolerance) applies to all the distributed systems like Kafka, Hadoop, Elasticsearch, and many others, where data is being distributed across nodes to:

  • Enable parallel data processing
  • Maintain high availability for data and service
  • Support horizontal or vertical scaling requirements

Consistency (C)

Every read gets the latest write or an error.

Kafka follows eventual consistency; data ingested into one node eventually syncs across others. Reads happen from the leader, so replicas might lag behind.

Availability (A)

Every request gets a response, even if some nodes fail.

Kafka ensures high availability using replication factor, min.insync.replicas, and acks. As long as a leader is available, Kafka keeps running.

Partition Tolerance (P)

The system keeps working despite network issues.

Even if some nodes disconnect, Kafka elects a new leader and continues processing.

Kafka is an AP system; it prioritizes Availability and Partition Tolerance, while consistency is eventual and tunable via settings.

Now, let's get into the scenarios.

Stretch Cluster failed scenarios

Before going in-depth into scenarios, let's understand the key components and terminology of Kafka internals.

  • Brokers represent data nodes where all the data is distributed with replicas for data high availability.
  • The Controller is a critical component of Kafka that maintains the cluster's metadata, including information about topics, partitions, replicas, and brokers. Also performs administrative tasks. Think of it as the brain of a Kafka cluster.
  • A Zookeeper quorum is a group of nodes (usually deployed on dedicated nodes) that act as a coordination service helping to manage the Controller election, Broker Registration, Broker Discovery, Cluster membership, Partition leader election, Topic configuration, Access Control List (ACLs), consumer group management, and offset management.
  • Topics are logical constructs of data to which all related data is published. Examples include all the network logs published to the "Network_logs" topic and application logs published to the "Application_logs" topic.
  • Partitions are the fundamental units or building blocks supporting scalability and parallelism within a topic. When a topic is created, it can have one or more partitions where all the data is published in order.
  • Offsets are unique identifiers assigned to each message within a partition of a topic, essentially a sequential number that indicates the position of a message within the partition.
  • A producer is a client application that publishes records (messages) to Kafka topics.
  • A consumer is a client application that consumes records (messages) from Kafka topics.

Scenario 1: WAN Disruption – Active Controllers as '0' – AKA Brain-Dead Scenario

Scenario WAN Disruption – Active controllers as '0'.
Scenario Description

As discussed in earlier sections, in a distributed network setup across geographic locations, network service interruptions can occur because no service guarantees 100% availability.

This scenario focuses on WAN disruption between two regions, causing Kafka service unavailability, leaving the cluster with zero active controllers and no self-recovery, even after hours of waiting.

Execution Steps
  1. Kafka cluster (all brokers and Zookeeper quorum) should be up and running with the configuration stated in the environment section
  2. Produce and consume events with brokers' equilibrium state (this step is optional, but simulating a real use case)
  3. Disconnect/disrupt WAN service between regions
  4. Observe the cluster behavior. Check through the broker host CLI and ZK CLI. If you are using Confluent, you can check in Confluent center or AWS Cloud provided managed service you can check the same CloudWatch, respectively and other offerings too
    • a. ZK quorum availability
    • b. Brokers availability as cluster members
    • c. Number of active controllers
    • d. Producers and consumers progress
    • e. Under replicated partitions, etc.
  5. Reestablish WAN service between regions and observe the cluster behavior.
Expected Behavior
  1. On steady state, the Kafka cluster should be available with
    1. 4 brokers
    2. ZK quorum as 3
    3. Active controllers as 1
    4. Producers and consumers should produce and fetch events from topics
  2. Upon WAN service disruption,
    1. The Kafka cluster should not be available
    2. ZK quorum 3 will break
    3. Active controllers as 1 or zero
    4. Producers and consumers should be interrupted, create back pressure or time-outs.
  3. Reestablish WAN service
    1. ZK quorum 3 should be reestablished
    2. Kafka cluster should be available with 4 brokers
    3. The active controller should be 1
    4. Producers and consumers should continue/resume
Actual/Observed Behavior

Kafka is unavailable and can't be recovered on its own upon WAN service disruption and reestablishment and shows active controllers as '0'

  • Active brokers are reported as zero
  • Active controller number is zero
  • ZK quorum with 3 is formed on WAN service resume, and the quorum broke on disruption
  • All brokers showed up in ZK cluster membership upon WAN service resume and only two brokers showed during WAN disruption.
  • Producers and consumers completely interrupted and timed out.
Root Cause Analysis (RCA)

It's a "brain-dead" scenario. Upon WAN disruption, a single 4-node cluster becomes 2 independent clusters, each with 2 nodes.

  • And the existing single Zookeeper quorum becomes two quorums i.e., one for each site. But each other quorum does not elect a new controller, resulting as it thinks that the controller already existing resulting in zero controllers upon WAN reestablishment.
  • Imagine two cooks in a kitchen preparing a dish. Each cook assumes the other has added the salt, but in the end, no salt is added to the dish.
  • From the logs, it was observed controller moved the exception from one broker to another broker, but no exit or hang was reported in the logs
Controller exceptions:
controller.log:org.apache.kafka.common.errors.ControllerMovedException: Controller moved to another broker. Aborting?
controller.log:java.io.IOException: Client was shutdown before response was read
controller.log:java.lang.InterruptedException
  • From ZooKeeper (ZK) CLI, It's also observed that 'no broker claims to be the controller' in the cluster, the cluster failed to respond properly in the face of state changes, including topic or partition creation or broker failures.
ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler)
java.lang.Exception: shutdown Leader! reason: Not sufficient followers synced, only synced with sids: [ 2 ]

ERROR Exception while listening (org.apache.zookeeper.server.quorum.QuorumCnxManager)
References
  • WAN disable can be achieved through route tables modify or delete.
  • ZK commands. Refer ZK command section.
Solution/Remedy Because there are no controller claims by brokers, all the brokers need to be restarted.

WAN Disruption:

WAN disruption can be done by deleting or modifying route tables.
Warning: Deleting a route table may disrupt all traffic in the associated subnets. Instead, consider modifying or detaching the route table.

Firewall Stop

You may need to shut down firewalls
Firewall status: systemctl status firewalld
Firewall stop: service firewalld stop

ZK Commands list:

ZK Server start: bin/zookeeper-server-start etc/kafka/zookeeper.properties
ZK Server stop: zookeeper-server-stop
ZK Shell commands: bin/zookeeper-shell.sh <zookeeper host>:2181
Active controllers check: get /controller or ls /controller
Brokers list: /home/opc/kafka/bin/zookeeper-shell kafkabrokeraddress:2181 <<<"ls /brokers/ids"

Alternative to check controllers and brokers in a cluster. Available from kafka 3.7.x
bin/kafka-metadata-shell.sh --bootstrap-server <KAFKA_BROKER>:9092 --describe-cluster

Scenario 2: WAN Disruption - Active Controllers as '2' – AKA Split-Brain Scenario

Scenario WAN Disruption - Active controllers as '2'
Scenario Description As discussed in earlier sections, in a distributed network setup across geographic locations, network service interruptions can occur because no service guarantees 100% availability. This scenario focuses on WAN disruption between two regions, causing Kafka service unavailability with active controllers as '2' and no self-recovery, even after hours of waiting.
Execution Steps
  1. Kafka Cluster (all brokers and Zookeeper quorum) should be up and running with the configuration stated in the environment section
  2. Produce and consume events with brokers' equilibrium state (this step is optional)
  3. Disconnect/Disrupt WAN service between regions
  4. Observe the cluster behavior- Check through the broker host CLI and ZK CLI. If you are using Confluent, you can check in the Confluent center or the AWS Cloud provided managed service, you can check the same CloudWatch respectively
    1. ZK quorum availability
    2. Brokers' availability as cluster members
    3. Number of active controllers
    4. Producers and consumers progress
    5. Under replicated partitions etc.
  5. Reestablish WAN service between regions and observe the cluster behavior.
Expected Behavior
  1. In a steady state, the Kafka cluster should be available with
    1. 4 brokers
    2. ZK quorum as 3
    3. Active controllers as 1
    4. Producers and consumers should produce and fetch events from topics
  2. Upon WAN service disruption,
    1. The Kafka cluster should not be available
    2. ZK quorum 3 will break
    3. Active controllers as 1 or zero
    4. Producers and consumers should be interrupted or create back pressure or time-outs.
  3. Re-establish WAN service
    1. ZK quorum 3 should be reestablished
    2. Kafka cluster should be available with 4 brokers
    3. Active controller should be 1
    4. Producers and consumers should continue/resume
Actual/Observed Behavior

Kafka is not available and can't be recovered on its own upon WAN service disruption and re-establishment and shows active controllers as '2'

  • Active brokers are reported as zero
  • Active controller number is two
  • ZK quorum with 3 is formed on WAN service resume and the quorum broke on disruption
  • All brokers showed up in ZK cluster membership upon WAN service resume and only two brokers showed during WAN disruption.
  • Producers and consumers completely interrupted and timed out. NETWORK_EXCEPTIONs
  • Producers got failed requests with NOT_LEADER_FOR_PARTITIONS exceptions
Root Cause Analysis (RCA)
  • This is a split-brain scenario. Upon WAN disruption, a single 4-node cluster becomes 2 independent clusters with each consisting of 2 nodes.
  • Also, an existing single Zookeeper quorum becomes two quorums, i.e., one for each site. As a result, each quorum will elect a new controller, resulting in two controllers.
  • Upon WAN re-establishment, we see 2 controllers for 4 4-node clusters. Think of it like two co-pilots flying one plane from different cockpits after radio loss. Both think the other is gone and take over, conflicting actions crash the system.
  • From ZooKeeper (ZK) CLI, it's also observed that two brokers claim to be the controller in the cluster; the cluster failed to respond properly in the face of state changes, including topic or partition creation or broker failures.
References
  • WAN disable can be achieved through the route tables delete
  • ZK commands
Solution/Remedy Restart both controllers

Scenario 3: Broker Disk full, resulting respective broker process crash and No Active controllers

Scenario Broker disk full, resulting respective broker process crash, and No active controllers or '0' active controllers.
Scenario Description On a steady Kafka cluster, which means all the brokers are up and running, and both producers and consumers are producing and consuming events. One of the 4 brokers' disks is full due to data skew, i.e., uneven or imbalanced distribution of data across and leading to an abrupt broker process crash and cascading the crash to the remaining brokers. This causes the Kafka cluster unavailability because of active controllers as '0'.
Execution Steps
  1. Kafka Cluster (all brokers and Zookeeper quorum) should be up and running with the configuration stated in the environment section
  2. Produce and consume events with brokers' equilibrium state
  3. One of the 4 brokers' disks is full and causing the respective broker process to crash
  4. Observe the cluster behavior through the broker host CLI or ZK CLI. If you are using Confluent, you can check in the Confluent center or the AWS Cloud provided managed service you can check the same CloudWatch respectively
    1. ZK quorum availability
    2. Brokers availability as cluster members
    3. Number of active controllers
    4. Producers and consumers progress
    5. Under replicated partitions, etc.
Expected Behavior
  1. In a steady state, Kafka cluster should be available with
    1. 4 brokers
    2. ZK quorum as 3
    3. Active controllers as 1
    4. Producers and consumers should produce and fetch events from topics
  2. Upon one broker's disk full,
    1. Disk error should be logged to logs.
    2. Kafka cluster should not be available; only one broker should be shown as not available or out of cluster membership
    3. ZK quorum shouldn't fail
    4. Active controllers can be zero (if the controller node is the same as a dead broker) and move the controller to another available broker.
    5. The cluster should report under replicated partitions due to one broker unavailability.
    6. Producers and consumers should be interrupted, create back pressure, or time-outs.
Actual/Observed Behavior

Kafka is not available and couldn't be recovered on its own upon and showed active controllers as '0'

  • Active brokers are reported as zero
  • The active controller number is zero
  • No impact on Zookeeper quorum
  • Producers and consumers completely interrupted and timed out
Root Cause Analysis (RCA)

From the logs, observed disk-related errors and exceptions

logs/state-change.log:102362: ERROR [Broker id=1] Skipped the become-follower state change with correlation id 331 from controller 1 epoch 1 for partition topic1-9 (last update controller epoch 1) with leader 5 since the replica for the partition is offline due to disk error org.apache.kafka.common.errors.KafkaStorageException: Error while creating log for topic1-9 in dir /kafka/kafka-logs (state.change.logger)

Logs show that the controller moved the exception from one broker to another broker, but no exit or hang was reported in the logs

Controller exceptions: controller.log:org.apache.kafka.common.errors.ControllerMovedException: Controller moved to another broker. Aborting? controller.log:java.io.IOException: Client was shutdown before response was read controller.log:java.lang.InterruptedException

Controllers '0' behavior is equal to the brain-dead scenario explained in scenario 2.

References
  • ZK active controller commands
  • ZK broker membership commands
Solution/Remedy
  • It is essential to monitor your Kafka cluster with proper metrics such as CPU, Disk, Memory, Java Heap, etc. Once the disk reaches 80% or more, you should scale up your cluster.
  • Because there are no controller claims by brokers, all the brokers need to be restarted.

Scenario 4: Corrupted indices upon Kafka process kill

Scenario Corrupted indices upon Kafka process kill
Scenario Description

Killing broker and ZK processes (an uncontrolled shutdown can be comparable with machine abrupt reboot) and restarting brokers lead to corrupted indices, rebuilding indices doesn't work even after hours of waiting.

This scenario is comparable to an ungraceful shutdown during an environment upgrade or to a poor security posture, where an unauthorized person could terminate the service.

Execution Steps
  1. Kafka Cluster (all brokers and Zookeeper quorum) should be up and running with the configuration stated in the environment section
  2. Produce and consume events with brokers' equilibrium state
  3. Kill the broker and ZK processes, and restart the processes
  4. Observe the cluster behavior through the broker host CLI or ZK CLI. If you are using Confluent, you can check in the Confluent center or the AWS Cloud provided managed service you can check the same CloudWatch respectively
    1. ZK quorum availability
    2. Brokers' availability as cluster members
    3. Number of active controllers
    4. Producers and consumers progress
    5. Under replicated partitions etc.
Expected Behavior
  1. In a steady state, the Kafka cluster should be available with
    1. 4 brokers
    2. ZK quorum is 3
    3. Active controllers as 1
    4. Producers and consumers should produce and fetch events from topics
  2. Kill Kafka processes (kill -9), i.e., both broker and ZK processes on all 4 nodes and restart processes
    1. ZK quorum should be formed
    2. Active controllers as '1'
    3. In-flight messages may be lost if not flushed to the disk
    4. May result in corrupted indices, but the broker rebuilds corrupted indices on its own without any manual work
    5. Upon rebuild of corrupted indices, the Kafka cluster should be available
    6. The cluster might show under replicated partitions due to corrupted indices
    7. Producers and consumers will get exceptions or errors and can't continue to perform their jobs
Actual/Observed Behavior

Kafka is not available and couldn't be recovered on its own.

  • Active brokers are '4'
  • Active controller number is '1'
  • No impact on Zookeeper quorum
  • Producers and consumers completely interrupted and timed out
Root Cause Analysis (RCA)

From the logs, 'index corrupted' logs for partitions are shown, and also 'rebuilding indices logs'. Rebuilding indices didn't work and doesn't show any related logs on progress. The resulting cluster is in unavailable status even after 1 hour.

WARN [Log partition=topic1-13, dir=/kafka/new/kafka-logs] Found a corrupted index file corresponding to log file /kafka/new/kafka-logs/topic1-13/00000000000000000000.log due to Corrupt time index found, time index file (/kafka/new/kafka-logs/topic1-13/00000000000000000000.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1553657637686}, recovering segment and rebuilding index files... (kafka.log.Log)

Also, it is recommended to do graceful shutdowns instead of using the kill switch or doing abrupt shutdowns so the data in the Kafka local cache will be flushed to the disks by maintaining meta data information and also committing offsets.

References
  • Not applicable
Solution/Remedy If Kafka fails to rebuild all the indices on its own, manual deletion of all index files and restarting all brokers will fix the issue.

Conclusion on WAN disruptions failure scenarios

Deploying on-premise stretch clusters requires a deep understanding of failure scenarios, particularly during network disruptions. Unlike cloud environments, on-premises setups may lack highly available network infrastructure, fault tolerance mechanisms, or dedicated fiber lines, making them more susceptible to outages. In such cases, a Kafka cluster may experience downtime if network connectivity is disrupted.

Cloud providers, on the other hand, offer high-bandwidth, low-latency networking over fully redundant, dedicated metro fiber connections between regions and data centers. Their managed Kafka services are designed for high availability, typically recommending deployment across multiple zones or regions to withstand zonal failures. Additionally, these services incorporate zone-awareness features, ensuring Kafka data replication occurs across multiple zones, rather than being confined to a single one.

It is also important to understand and emphasize the latency impact on a stretch cluster deployed in two different Geo locations with a WAN connection. We can not escape physics; in our case, as per WonderNetwork, the average latency between London and Frankfurt is 14.88 milliseconds for a distance of 636km. So, for mission-critical applications, the setup of a stretch cluster driven by latencies, through testing, is needed to determine how the data is being replicated and consumed across.

By understanding these differences, organizations can make informed decisions about whether an on-premise stretch cluster aligns with their availability and resilience requirements or if a cloud-native Kafka deployment is a more suitable alternative.

Disaster Recovery Strategies:

A Business Continuity Plan (BCP) is essential for any business to maintain its brand and reputation. Almost all the companies think of High Availability (HA) and Disaster recovery (DR) strategies to meet their BCP. Below are the popular DR Strategies for Kafka.

  • Active – Standby (aka Active-passive)
  • Active – Active
  • Backup and restore

Before we dive deep into DR strategies further, let's understand that the Kafka Mirror Maker 2.0 we are using in the following architectural patterns. There are other components like Kafka Connect, Replicator, etc., offered by Confluent and other cloud services, but they are out of scope for this article.

DNS Redirection in Apache Kafka is a cross-cluster replication tool that uses the Kafka Connect framework to replicate topics between clusters, supporting active-active and active-passive setups. It provides automatic consumer group synchronization, offset translation, and failover support for disaster recovery and multi-region architectures.

Active Standby

This is a popular DR strategy pattern where there will be two different Kafka clusters, which are of identical size & configuration, deployed in two different Geo regions or Data centers. In this case, it's the London and Frankfurt regions. Usually, Geo location choice depends upon Earth's different tectonic plates, company or country compliance and regulations requirements, and other factors. Out of which only one cluster is active at a given point of time, which receives and serves the clients.

As shown in the following picture, Kafka cluster one (KC1) is located in the London region and Kafka cluster two (KC2) is located in the Frankfurt region with the same size and configuration.

All the producers are producing data to KC1 (London region cluster), and all consumers are consuming the data from the same cluster. KC2 (Frankfurt region cluster) receives the data from the KC1 cluster through Mirror Maker as shown in Figure 1.

Figure 1: Kafka Active Standby (aka Active Passive) cluster setup with Mirror role

In simple terms, we can treat the mirror maker as a combination of producer and consumer, where it consumes the topics from the source cluster KC1 (London region) and produces the data to the destination cluster KC2 (Frankfurt region).

During the disaster to the primary/active cluster KC1 (London region), the standby cluster (Frankfurt region) will become active, and all the producers and consumers' traffic redirected to the standby cluster. This can be achieved through DNS redirection via a load balancer in front of Kafka clusters. Fig 2 shows this scenario.

Figure 2: Kafka Active Standby setup - Standby cluster becomes active

When the London Region KC1 cluster becomes active, we need to set up a mirror maker instance to consume the data from the Frankfurt region cluster and produce back to it to the London region KC1 cluster. This scenario is shown in Fig. 3.

It's up to the business to keep the Frankfurt region (KC2) active or promote the London region cluster (KC1) back to primary once the data is fully synced/hydrated to the current state. In some cases, the standby cluster, i.e., the Frankfurt region cluster (KC2), continues to function like an active cluster and the London region cluster (KC1) becomes the standby as well.

It is essential to note that RTO (Recovery Time Objective) and RPO (Recovery Point Objective) values are high compared to the Active-Active setup. Active-standby setup cost is medium to high as we need to maintain the same Kafka clusters in both regions.

Figure 3: Kafka Active Standby setup – Standby cluster data sync happening to the KC1

Concerns with the Active-Standby setup

Kafka Mirror Maker Replication Lag (Δ):

Let's take a scenario, and it might happen very rarely, where we lose all the cluster completely, including nodes and disks, etc. Where the Active Kafka Cluster (London region - KC1) has 'm' messages, and the mirror maker is consuming and producing to the standby cluster. Due to network latency or consumers & producing process with acknowledgements, the standby cluster (i.e., Frankfurt region - KC2) lags behind and holds only 'n' messages. Which means the delta(\(\Delta\)) is m minus n.

\(\Delta = m - n\)

where:

  • m = Messages in the Active Kafka Cluster (London)
  • n = Messages in the Standby Kafka Cluster (Frankfurt)

Due to this replication lag (\(\Delta\)), it will have the following impact on mission-critical applications

Potential Message Loss:
If a failover occurs before the Standby cluster catches up, some messages (\(\Delta\)) may be missing.

Inconsistent Data:
Consumers relying on the Standby cluster may process incomplete or outdated data.

Delayed Processing:
Applications depending on real-time replication may experience latency in data availability.

Active-Active Setup:

The replication lag concern can be overcome with the Active-Active setup. Just like the Active standby setup, we will have the same Kafka clusters deployed in the London and Frankfurt regions.

The producers of the ingestion pipeline will write the data to both clusters at any given point in time. Which is also known as Dual Writes. So, both clusters' data is in sync as shown in Fig. 4.

Please note that the ingestion pipeline or producers should have the capability to support dual writes. It is also important to note that these write pipelines should be isolated so that any issue with one cluster should not impact other pipelines or create back pressure and producer latencies. For example, if your use case is log analytics or observability, considering Logstash in your ingestion and creating two separate pipelines within a single or more Logstash instances will act as an isolated pipeline. But only one cluster is active for the consumers at any given point of time.

Figure 4: Active-Active setup or Dual writes.

In case, as shown in Figure 5, any issues with the one cluster the consumers need to be redirected with DNS redirect and RTO and RPO or SLAs are near zero.

Figure 5: Active-Active setup, one cluster is unavailable

Once the London cluster is available and becomes active, data can sync back to the London cluster from the Frankfurt cluster using Mirror Maker. As shown in Fig. 6, consumers can be switched back to the London cluster using DNS redirect. Also, it is important to configure consumer groups to start where they left off.

During this setup, during the failover, it is important to configure consumers' offset to read the messages from a particular point in time, like reply messages from the last current time minus 15 mins, or set up a mirror maker between active clusters to read 'only' consumer offset topics from the London region cluster and replicate to the Frankfurt region cluster. And during the failure, you can use source.auto.offset.reset=latest to read the data from the latest offsets in the Frankfurt region cluster.

If the consumer groups are not properly configured, you may see duplicate messages in your downstream system. So, it's essential to design the downstream to handle duplicate records. For example, if it's an observability use case and if the downstream is OpenSearch, it's good practice to design a unique ID depending on the message timestamp and other parameters like session IDs, etc. So, that duplicate message will overwrite the existing duplicate record and maintain only one record.

Figure 6: Active-Active setup cluster syncing data with MirrorMaker

Challenges and Complexities in Active-Active Kafka Setups

Message ordering challenges occur when writing to two clusters simultaneously, messages may arrive in different orders due to network delays between regions. This can be particularly problematic for applications that require strict message ordering. So, it needs to be handled in the data processing time or downstream systems. For example, defining a primary key within the message body timestamp field would order the messages based on timestamps.

Data Consistency Issues can occur while establishing dual writes to two different clusters. It requires robust monitoring and reconciliation processes to ensure data integrity across clusters. Because there might be conditions where data might successfully write to one cluster but fail in another, resulting in data inconsistency.

Producer complexity is an issue in this active-active setup. Producers need to handle writing to two clusters and manage failures independently for each cluster, so isolation of each pipeline is important. This setup might increase the complexity of the producer applications and require more resources to manage dual writes effectively. There are some agents or tools in the market also as open-source offerings that will provide dual pipeline options such as Elastic Logstash.

Consumer group management is complex. Managing consumer offsets across multiple clusters could be challenging, especially during failover scenarios, and can lead to message duplication. Consumer applications or downstream systems need to be designed to handle duplicate messages or implement deduplication logic.

Operational Overhead - Running active-active clusters requires sophisticated monitoring tools and increases operational costs significantly. Teams need to maintain detailed runbooks and conduct regular disaster recovery drills to ensure smooth failover processes.

Network Latency Considerations- Cross-region network latency and bandwidth costs become critical factors in active-active setups. Reliable network connectivity between regions is essential for maintaining data consistency and managing replication effectively.

Backup and Restore

Backup and restore is a cost-effective architectural pattern compared to Active Standby(AP) or Active Active(AA) DR architectural patterns. It is suitable for non-critical applications with higher RTO and RPO values compared to AA or AP patterns.

Backup and restore is a cost-effective option because we need only one Kafka cluster. In case the Kafka cluster is unavailable, the data from the centralized repository can be replayed/restored to the Kafka cluster once it becomes available. In this architectural pattern, shown in Figure 7, all the data will be saved to a persistent location like Amazon S3 or another data lake, where the data can be restored/replayed to the Kafka cluster when the Kafka cluster is available. It is also common to see a new environment of Kafka clusters built together using terraform templates and build a new Kafka cluster and replay data.

Figure 7: Backup and restore architectural pattern

Challenges and Complexities in Kafka Backup and Restore Setup

Recovery time impact is a major consideration; Restoring (aka replaying) large volumes of data to Kafka can take significant time, resulting in a higher RTO. This makes it suitable only for applications that can tolerate longer downtimes.

Data consistency challenges are also an issue. When restoring data from backup storage like Amazon S3, maintaining the exact message ordering and partitioning as the original cluster can be challenging. Special attention is needed to ensure data consistency during restoration.

Backup performance overhead is a further consideration. Regular backups to external storage can impact the performance of the production cluster. The backup process needs careful scheduling and resource management to minimize impact on production workloads.

Storage is key. Managing large volumes of backup data requires effective storage lifecycle policies and controlling costs. Regular cleanup of old backups and efficient storage utilization strategies are essential.

Coordination during recovery is mandatory. The recovery process requires careful coordination between backing up offset information and actual messages. Missing either can lead to message loss or duplication during restoration.

Conclusion on Kafka DR Strategies

As we explored earlier, choosing a DR strategy isn't one-size-fits-all, it comes with trade-offs. It often boils down to finding the right balance between what the business needs and what it can afford. The decision usually depends on several factors.

SLAs (e.g., 99.99%+ uptime) are pre-defined commitments to uptime and availability. An SLA requiring 99.99% uptime allows only 52.56 minutes of downtime per year.

The RTO metric is used to define the maximum acceptable downtime for a system or application after a failure before it causes unacceptable business harm. Knowledge/content/training systems, for example, can expect one hour or more downtime because they are not mission critical. For banking applications, on the other hand, downtime acceptance could be none or a couple of minutes.

The RPO metric is used to define the maximum acceptable amount of data loss during a disruptive event, measured in time from the most recent backup. For example, an observability/monitoring application can have the acceptability for losing the past hour whereas a financial application can not tolerate losing a single record of transaction data.

Associated costs vary. Higher availability and faster recovery typically come at a higher price. Organizations must weigh their need for uptime and data protection against the financial investment required.

DR Strategy RTO (Downtime tolerance) RPO (Data loss tolerance) Use case Cost
Active Active Near-Zero (Seconds) Near-Zero (Seconds) Mission-critical apps (e.g., banking, e-commerce, stock trading) High
Active Standby Minutes to Hours Near-Zero to Minutes Enterprise apps need high availability Medium to High (No dual write ingestion pipeline cost)
Backup and Restore Hours to Days Hours to Days Non-critical apps with occasional DR needs or cost-sensitive apps, Low

References:

About the Authors

Rate this Article

Adoption
Style

BT
锤子什么意思 阿莫西林不能和什么药一起吃 全身无力吃什么药 肚子疼吃什么药 着实是什么意思
蜈蚣属于什么类动物 云为什么是白色的 一月来两次月经是什么原因 6月5日是世界什么日 月经期体重增加是什么原因
醋酸泼尼松片治什么病 魏大勋什么星座 苏州有什么好玩的 ad是什么病的简称 肝郁症是什么病
碱性是什么意思 f代表什么 九霄云外是什么生肖 女人为什么会得甲状腺 槊是什么意思
哔哩哔哩是什么网站hcv7jop9ns4r.cn rt是什么单位hcv8jop9ns9r.cn 什么是七七事变jiuxinfghf.com 黄芪泡水喝有什么功效hcv7jop5ns0r.cn 西柚不能和什么一起吃hcv9jop0ns0r.cn
6月24是什么日子hcv9jop3ns7r.cn 脖子长痘痘是什么原因hcv8jop2ns1r.cn 在家无聊可以做什么hcv8jop9ns2r.cn 什么时候进伏sanhestory.com 保养是什么意思hcv9jop4ns9r.cn
经期洗澡有什么影响hcv9jop3ns7r.cn 发烧咳嗽挂什么科hcv9jop1ns2r.cn 叶酸片什么时候吃合适hcv8jop1ns5r.cn 同型半胱氨酸查什么hcv8jop7ns2r.cn 造影是检查什么hcv8jop2ns6r.cn
诈尸是什么意思hcv7jop5ns3r.cn 为什么超市大米不生虫hcv8jop9ns8r.cn 泡黄芪水喝有什么好处jinxinzhichuang.com 梦见自己掉牙是什么意思hcv9jop6ns1r.cn 莯字五行属什么hcv9jop0ns8r.cn
百度