羊水污染对宝宝有什么影响| 1997年7月1日属什么生肖| 甲功五项查的是什么| 鸟儿为什么会飞| 婴儿胎发什么时候剪最好| 腺肌瘤是什么病| 枸杞有什么用| 刘亦菲是什么国籍| 82属什么生肖| 蹒跚什么意思| 左眼皮上有痣代表什么| hpmc是什么| 沙肝是什么| 没有什么就没有发言权| 桃花什么时候开| 全血细胞减少是什么意思| 异国风情是什么意思| 吕布属什么生肖| 牙疼是什么火引起的| 怀孕前期有什么征兆| 早上11点是什么时辰| 跟泰迪很像的狗叫什么| 近视是什么原因造成的| 什么的愿望| 腰果是什么树的果实| 咽炎吃什么消炎药| 种植牙有什么危害| 什么叫压缩性骨折| 阳痿挂什么科| 狮子座前面是什么星座| 二月一日是什么星座| ca是什么意思| 克氏针是什么| 腹泻不能吃什么食物| 手足口是什么引起的| 在减肥期间吃什么最好| 三个水读什么| 同病相怜什么意思| 怀孕一个月有什么反应| 脚怕冷是什么原因引起的| 心慌挂什么科| 12月21日是什么星座| 喝最烈的酒下一句是什么| 泡热水脚有什么好处| 蓝色加红色是什么颜色| 什么时候种大白菜| 马甲线是什么| 大枣和红枣有什么区别| 宝宝拉肚子吃什么药好| 男人吃荔枝有什么好处| 天空是什么颜色| 藏红花有什么功效| 五行缺什么怎么查询| 检查阑尾炎挂什么科| 校草是什么意思| 频发室性早搏是什么意思| 做梦梦见蛇是什么征兆| 葫芦五行属什么| 97年是什么生肖| 清白是什么意思| 姨妈期间不能吃什么| 尿不出来吃什么药| 长骨刺是什么原因导致的| 桃子是什么形状| 便秘挂什么科| 五十年婚姻是什么婚| 上下眼皮肿是什么原因| 怀孕检查挂什么科| 为什么叫中日友好医院| 发膜和护发素有什么区别| 930是什么意思| 毕婚族是什么意思| xo酱是什么酱| ldh是什么| h是什么意思| 王加玉念什么| 牙龈疼是什么原因| 5月1日什么星座| 低脂牛奶适合什么人喝| 毛囊炎的症状是什么原因引起的| 为什么医生都不体检| 脑血栓适合吃什么水果| 老年人吃什么好| 非萎缩性胃炎伴糜烂吃什么药| 蚊子喜欢叮什么样的人| 经常腿抽筋是什么原因| 单身为什么中指戴戒指| 胆固醇高是什么原因| 公务员是什么职业| 体癣用什么药| 榴莲和什么不能一起吃| 89年什么命| 青岛市市长什么级别| 为什么怀孕前三个月不能说| 肝火旺有什么症状| 清一色是什么意思| 张飞为什么不救关羽| 武则天什么朝代| 70岁是什么之年| 手指代表什么生肖| 水过鸭背是什么意思| 转基因和非转基因有什么区别| 椰子水是什么颜色| 尿红细胞高是什么原因| 地果是什么| 什么水果对胃好| 非文念什么| 囊是什么意思| 九条鱼代表什么意思| 明亮的什么| 钧五行属什么| 零八年属什么生肖| 环比增长什么意思| 直率是什么意思| 败血症是什么症状| 手指抽筋是什么原因| 为什么月经不来| 仰面朝天是什么生肖| 烈士家属有什么待遇| 黑金刚是什么药| 便秘吃什么快速通便| 什么症状提示月经马上要来了| 灌注是什么意思| 发际线长痘痘是什么原因| 花红是什么意思| 望眼欲穿是什么意思| 蔓越莓是什么水果| 活色生香的意思是什么| 佛法无边是什么生肖| 什么属相不能养龙鱼| 夏天什么花会开| 胃肠感冒吃什么药| 盐酸盐是什么| 一节黑一节白是什么蛇| 幺妹是什么意思| 白醋和白米醋有什么区别| 喝鲜羊奶有什么好处和坏处| 今天是什么生肖| 眼睛酸疼是什么原因| 桃李满天下是什么意思| 夜宵吃什么| 手没有力气是什么原因| 肌电图是检查什么的| 女生不来大姨妈是什么原因| 胃在什么地方| 井底之蛙是什么意思| 为什么脚会肿| 什么是脊柱侧弯| 八月份什么星座| 事半功倍的意思是什么| 勾引什么意思| 安抚是什么意思| 做活检是什么意思| 人体最大器官是什么| 市检察长是什么级别| zbc什么意思| c1e驾照能开什么车| 肺ca是什么病| 血糖高一日三餐吃什么东西最适合| 摩羯前面是什么星座| 脚趾缝脱皮是什么原因| hcg翻倍不好是什么原因造成的| 发烧是什么感觉| 百合什么时候种植| 蚊子怕什么植物| 扁平足是什么样子图片| 赵子龙属什么生肖| 92年属猴是什么命| 感冒发烧吃什么药比较好| 球菌阳性是什么意思| 草长莺飞是什么生肖| 什么是单亲家庭| 儿童口臭什么原因引起的| 利妥昔单抗是什么药| 疱疹是什么| 兄长是什么意思| 乙肝病毒携带者有什么症状| joy什么意思| 午马是什么意思| 地道战在河北什么地方| 三姓家奴是什么意思| pyq是什么意思| 复方石韦胶囊治什么病| mg什么意思| 刷脂是什么意思| 枕大神经痛吃什么药| 胸闷是什么原因| 女性尿路感染吃什么药好得快| 化疗后吃什么补身体| 教学相长是什么意思| 非洲说什么语言| 腰疼想吐什么原因| 佛爷是什么意思| 五月二十是什么星座| 小三阳吃什么食物好得快| 水代表什么| 胆结石吃什么药可以化掉结石| 前列腺肥大是什么意思| 双喜临门指什么生肖| 骨加客读什么| 狗属于什么类动物| 睡觉中途总醒什么原因| levis是什么牌子| 天天想睡觉没精神是什么原因| 钮祜禄氏是什么旗| 胃胀气是什么原因| 下眼皮跳是什么原因| 海参不能和什么一起吃| 三国时期是什么朝代| 异常白细胞形态检查是查什么病| 经常便秘吃什么药好| runosd是什么牌子的手表| 肠炎可以吃什么水果| 小狗起什么名字好听| 梦见被追杀是什么预兆| 家伙是什么意思| 男人为什么喜欢舔女人下面| 什么是婚姻| 产检挂什么科室| 推头是什么意思| 倪什么意思| 猫牙米是什么米| 灵芝长什么样子| 牙结石用什么牙膏最好| 集锦是什么意思| 鹿吃什么食物| 致爱是什么意思| plover是什么牌子| 口舌是什么意思| 双子座前面是什么星座| 儿童红眼病用什么眼药水| 肝火胃火旺盛吃什么药| 鱼子酱是什么鱼的鱼子| 中午可以吃什么| 舌头边疼是什么原因| 虎是什么结构| 拉肚子吃什么食物好| ader是什么牌子| 唐朝什么时候灭亡的| cba是什么意思| 嘴干嘴苦是什么原因| 月字旁的有什么字| 什么是马克杯| 痛经喝什么可以缓解| 六月五日是什么日子| stranger什么意思| 宝宝益生菌什么时候吃最好| 泊字五行属什么| 甲亢病是什么病| as是什么| 肾阴虚吃什么药最好| 蛋白低是什么原因| 腺样体增生是什么意思| 买二手苹果手机要注意什么| 狗狗冠状是什么症状| 增强免疫力吃什么药| 双肺结节是什么病| 11月26日是什么星座| 高兴的什么| 鲁字五行属什么| 大地色眼影是什么颜色| 李世民的字是什么| 辅酶q10什么时候吃| 喝什么补羊水| 百度
BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Designing Resilient Event-Driven Systems at Scale

交管“神器”:驾驶员乱按喇叭会被抓拍罚500元

Listen to this article -  0:00

Key Takeaways

  • Event-driven architectures often break under pressure due to retries, backpressure, and startup latency, especially during load spikes.
  • Latency isn’t always the problem; resilience depends on system-wide coordination across queues, consumers, and observability.
  • Patterns like shuffle sharding, provisioning, and failing fast significantly improve durability and cost-efficiency.
  • Common failure modes include designing for average workloads, misconfigured retries, and treating all events equally.
  • Designing for resilience means anticipating operational edge cases, not just optimizing for happy paths.

Event-driven architectures (EDA) look great on paper, as they have decoupled producers, scalable consumers, and clean async flows. But real systems are much messier than that.

Consider this common scenario: during a Black Friday event, your payment processing service receives five times the normal traffic. When that happens, your serverless architecture hits edge cases. For example, Lambda functions cold-starts, your simple queue service (SQS) queues back up as a result, and independently, you see DynamoDB throttles. Somewhere in this chaos, customer orders start failing). This isn't a theoretical problem, it's a normal day for many teams.

And it's not limited only to eCommerce. In SaaS platforms, feature launches lead to backend config spikes. In the FinTech business, which sees a huge influx of events during some fraud activity, even a few milliseconds makes a big difference. We can find similar examples (popular media broadcasts, live events like the Super Bowl) in our day-to-day life that follow similar patterns.

If you look at a high-level picture of the system, it’s fundamentally broken into three parts: producer, intermediate buffer, and consumer.

When you talk about resilience in these systems, it isn't just about staying available; it's also about staying predictable under pressure. Traffic spikes because of some upstream integrations, downstream hitting some bottlenecks, or some components doing unbounded retries all test how well your architecture can handle the spike. But real systems have their own opinions.

In this article, we will talk about how to think about building resilient and scalable event processing systems. We will look at different operational events that disrupt the reliability and scale, and use the learnings from them to design a better system.

Latency Isn’t the Only Concern

More often than not, when people talk about performance in event-driven systems, they talk about latency. But they forget that latency is only part of the story. For resilient systems, throughput of your system, how well your resources are utilized, and how well data is flowing between components matter equally.

Let's consider this example. You own a service whose underlying infrastructure depends on an SQS queue. Suddenly, there is a spike in traffic that overwhelms the downstream systems, leading to their full or partial failure, and that failure leads to inflated retries and subsequently causes monitoring data to skew. Additionally, if your consumer has a high startup time, whether due to cold start or container load time, you now have contention between messages that need fast processing and infrastructure that’s still getting ready. Now, if you think about it, the failure mode is not the timeout. Rather, the failure is the setup that shows up as lag, retries, and increased cost to customers.

Now let's add either dead letter queues (DLQs), exponential backoff, throttling policies, or stream partitions into the mix, and the problem becomes more complex. So, instead of debugging a function, you are figuring out what the different contracts are and what might be going on.

To design for resilience, we need to treat latency as a signal of pressure building up in the system. It shouldn’t only be about minimizing it. That shift in mindset is needed.

Given all this, let's look at some of the practical approaches that can be used to address the concerns identified:

Patterns That Scale Under Pressure

When talking about resiliency, I want you to think beyond just fixing things like reducing latency, tuning retries, or lowering failures. Consider designing a system that degrades gracefully when met with an unseen scenario and recovers automatically. Let’s talk about some of these patterns at different layers of your architecture:

Design patterns

Shard and shuffle shard

One of the foundational concepts in resilient system building is to degrade gracefully while also containing the blast radius. One way to do that is to segment your customers and make sure a problematic customer is not taking the whole fleet down. You can take your design a step further and add shuffle sharding. Shuffle sharding is assigning customers to a random subset of shards and, by doing that, reducing the probability of well-behaved customers fully colliding with noisy customers. Async systems which are backed by queues, for example, often hash all their customers to a handful of queues. When a noisy customer comes, it overwhelms the queue and in turn impacts every other customer who is also hashed onto that same queue. By applying shuffle sharding, the probability of a noisy customer falling onto the same shards as another customer drops drastically; the isolated failure minimizes the impact to others. You can see this concept in action in this blog: Handling billions of invocations – best practices from AWS Lambda.

Provisioning for Latency-Sensitive Workload

Provisioning means pre-allocating resources. It’s similar to reserving EC2 capacity upfront. It has a cost associated with it, so you need to be careful. Not all workloads need provisioned concurrency, but some may. In the FinTech industry, for example, fraud detection systems often rely on real-time signals, so if a fraudulent transaction is not flagged within seconds, it can damage the whole system. In that case, identify paths where seconds matter and invest accordingly. You can take it a notch up and use autoscaling in provisioned concurrency to further make it cost-effective if the workload is spiky and you are time-sensitive. You can see this concept in action in this blog: How Smartsheet reduced latency and optimized costs in their serverless architecture.

Infrastructure patterns

Decouple using Queues and Buffers

Resilient systems absorb load, rather than rejecting it. Queues like SQS, Kafka, and Kinesis, and buffers like EventBridge act as a shock absorber between producers and consumers. They protect consumers from bursty spikes and offer natural retry and replay semantics.

With Amazon SQS, you get powerful knobs like visibility timeout to control retry behavior, message tension for reprocessing, DLQs to isolate poison-pill messages, and batching/long-polling to improve efficiency and reduce costs. If you need ordering and exactly-once processing, FIFO queues are a better fit. Similarly, Kafka and Kinesis offer high throughput via partitioning while preserving record order within each shard or partition.

For example, a real-time bidding system in an ad tech platform decouples high-volume clickstream data via Kinesis using region-id for the sharding. Billing events, on the other hand, are routed through FIFO queues to guarantee order and avoid duplicate charges (especially during retries). This pattern ensures that each workload type can independently scale or fail without causing cascading impact across the system.

Operational Patterns

Fail Fast and Break Things

It's not only Meta/Facebook’s engineering tenet, it's also about the resilience mindset. In this context, if your consumer knows it’s in trouble (e.g., can’t connect to a database or fetch config), fail quickly. This helps avoid visibility timeouts, retries from poison-pill records, and also helps signal the platform to back off sooner rather than later. I once debugged an issue where a container-based consumer would hang on a failed DB auth call for thirty seconds. Once we added a five-second timeout and explicit error signaling, visibility timeout errors dropped, and retries were no longer added to the failure. There are numerous examples of this sort. Another common example is when the head of the queue message processing is done without any strict timeout, leading to a buildup of the backlog. This pattern is not about making systems aggressive but more predictable and recoverable.

Other design tools, like using batching and polling intervals to help with the overhead or lazy initialization to avoid loading big dependencies when they are not needed, come in handy and help improve overall resiliency.

Common pitfalls (how to handle them)

Resilient systems often break, not because of one big outage, but because of a slow buildup of architectural debt. This idea is beautifully captured in a paper I read a couple of years back, which had an excellent explanation about metastable systems and when they break and have catastrophic effects. In the paper, they specifically discussed how a system transitions from a stable state to a vulnerable state under load, then progresses to a metastable state where long-lasting impacts are observed before manual intervention is typically required. I won't go into much detail, but just highlight that it talks about a similar mindset shift to avoid painful service outages.

Let’s look at some of the characteristics that lead to this:

Over-indexing on Average Load Instead of Spiky Behavior

Real-world traffic is hardly smooth; it’s mostly unpredictable. If you tune batch sizes, memory, or concurrency for the fiftieth percentile, your system will break at the ninetieth percentile or higher. Even a well-architected system can crash under pressure if they are not designed to expect and absorb unpredictable loads. It's not a "if" question but a "when" question; the key is that you should be prepared for crashes. Most of the time, there are ways to build to be ready. Consider the case of latency-sensitive workloads processed through AWS Lambda functions. You can set the auto-scaling policy for adjusting the provisioned concurrency config by looking at different cloudwatch metrics like invocation error, latency, or queue depth. You can generate a load test in your test environment, which helps you exercise higher percentiles (p95, p99) use cases.

Treating Retries as Panacea

Retries are cheap, until they aren’t. If retries are your only line of defense, they may not be sufficient. They also have the potential to multiply failure. Retries can overwhelm downstream systems; creating invisible traffic loops is all too easy when retry logic is not smart. This retry logic often shows up in systems where every error, transient or not, gets retried with no cap, no delay, and no contextual awareness. This approach leads to issues like throttled databases, increased latency, and even a total system collapse, which is common, unfortunately.

Instead, what you need is bounded retries to avoid infinite failure loops, or if you retry, use exponential backoff with jitter to avoid contention. Along with this approach, you should always keep context in mind. Divide your errors into retryable and non-retryable buckets and smartly retry. When upstream is down, you won’t get much help if you continue to hit the network with the same speed. It also won’t help service recover faster and could instead lead to delayed recovery because of extra pressure caused by retries. I wrote about the retries and the dilemma that comes with it in much more detail in the article, Overcoming the Retry Dilemma in Distributed Systems.

Not taking observability seriously

Expecting resilience is one thing; knowing your system is resilient is another. I often remind teams that 

"Observability separates the intentions from actualities". 

You may intend your system to be resilient, but only observability confirms whether that's true. It's not enough to monitor latency or error metrics. Resilient systems need to have clear resilience indicators that go beyond surface-level monitoring. These indicators should ask harder questions. Howfast do you detect failure (time to detect)? How quick is recovery (time to recover)? Does the system fail gracefully? Is the blast radius contained to a tenant, availability zone, or region? Are retries helping or just hiding the real problem? How does the system handle backpressure or upstream outages? These are high-level signals that test your architecture under stress; they only make sense when viewed together, not in isolation.

You can implement these insights using CloudWatch metrics for queue depth, Log Insights for retry patterns, and X-ray to trace the request flows across services. For example, in one case, a customer’s system ran smoothly until a Lambda error started silently pushing messages to the DLQ. Everything appeared green until users reported missing data. The issue was only discovered hours later because no one had set an alarm on the DLQ size. Afterwards, the team added DLQ alerts and integrated this into their internal service level objective (SLO) dashboard.

Observability gives you the only lens to ask, "Is the system doing what I expect even under stress?" If the answer is "I don’t know", it’s time to level up!

Treating all the events equally

Not all events are created equal. A customer order event isn’t the same as a logging event. If your architecture treats them equally, you’re either wasting resources/computations or at least introducing risks. Consider the example of a payment confirmation event sitting behind hundreds of low-priority logging events in a queue and impacting the business outcomes. Worse still, these low priorities can be retried or reprocessed for some reason and starve the critical events. You need to have a way to differentiate between critical and low-priority events.

Either establish different queues (high priority and low priority) or event routing rules that filter these events into two different Lambdas. This filtering will also help with using something like provisioned mode only for the high priority queue and not the other for cost effectiveness. Teams often catch these issues too late, when cost spikes, retries spiral, or SLAs break. But with the right signals and architectural intent, most issues can be avoided early or at least recovered from predictably.

Final Thoughts

When we're architecting event-driven systems at scale, resilience isn't about avoiding failure, it's about embracing it. We're not chasing some mythical "perfect" system. Instead, we're building systems that can take a hit and keep running.

Think about it: robust retry mechanisms that don't cascade into system-wide failures, elasticity that absorbs traffic spikes without breaking a sweat, and failure modes that are predictable and manageable. That's the goal. But if you’re just starting, building a resilient system can feel overwhelming. Where do you even begin?

Start small! Try building a sample event-driven application using Amazon SQS and AWS Lambda. Don't try anything fancy in the beginning. Just a simple queue and a Lambda function. Once you get that working, explore other features like DLQs, failure handling, etc. You can use EventBridge Event Bus and learn how events can be routed to different targets using rules. Once you get comfortable, layer it with techniques like shuffle sharding and autoscaling provisioned concurrency using metrics.

If you are looking for practical examples and tutorials, Serverless Land is a great place to explore patterns, code, and architectural guidance tailored for AWS Native EDA systems.

Building resilience isn’t a single step, it’s a mindset. Start small, learn from how your system behaves, and layer in complexity as your confidence grows!

About the Author

Rate this Article

Adoption
Style

BT
什么叫腔梗 前列腺不能吃什么食物 芸豆是什么 九月十二号是什么星座 戒指丢了暗示着什么
肺部微结节是什么意思 π是什么意思 女无是什么字 潜伏是什么意思 中药为什么要热着喝
3月17日是什么星座 荔枝有什么好处 武则天为什么立无字碑 身主天相是什么意思 正月十五是什么节
吃什么清肺效果最好 西红柿和什么榨汁减肥 一模一样的意思是什么 什么是腐女 辛是什么味道
丁香是什么hcv7jop6ns2r.cn 幽闭恐惧症是什么症状hcv8jop1ns4r.cn 脑梗死是什么意思hcv9jop4ns5r.cn 一天什么时候最热jinxinzhichuang.com 囍是什么意思hcv7jop9ns6r.cn
止血芳酸又叫什么名hcv7jop6ns1r.cn 七月与安生讲的是什么0735v.com 1965年属什么生肖hcv7jop9ns3r.cn 银行支行行长什么级别hcv8jop4ns1r.cn 药物流产最佳时间什么时候jinxinzhichuang.com
床头朝什么方向是正确的hcv8jop3ns2r.cn 欺世盗名是什么生肖hcv7jop6ns8r.cn 蛇鼠一窝什么意思xjhesheng.com 为什么会缺乏维生素dhcv9jop1ns8r.cn ptt是什么hcv8jop7ns6r.cn
育婴员是做什么的hcv9jop2ns4r.cn 猫在风水上代表什么hcv8jop9ns5r.cn 话糙理不糙是什么意思hcv8jop8ns5r.cn 阴阳两虚吃什么药最好hcv8jop3ns7r.cn 相招是什么意思hcv8jop1ns2r.cn
百度