When the Cloud Gets Bombed: Rethinking Resilience After AWS Middle East
"Structural damage, power disruptions, and water damage from fire suppression activities."
Over the weekend, something happened that most enterprise architects and business leaders probably never seriously modelled: drone strikes took out Amazon Web Services data centres in the Middle East.
According to AWS's own status update from March 2nd, two facilities in the UAE (ME-CENTRAL-1) were directly struck, while a strike near their Bahrain facility (ME-SOUTH-1) caused collateral damage. The result? Structural damage, power disruptions, and water damage from fire suppression activities. Two entire AWS regions, offline simultaneously, with recovery timelines measured in "we're working with local authorities" rather than hours.
This isn't a hypothetical scenario from a tabletop exercise. This actually happened. And it forces a conversation that many of us in the technology industry have been avoiding.
The Cloud Is a Place
We've gotten comfortable talking about "the cloud" as if it were some abstract, ethereal thing. It's everywhere and nowhere. It's infinitely scalable. It's someone else's problem.
But the cloud is not abstract. The cloud is buildings. Buildings with cooling systems and power feeds and fire suppression equipment. Buildings that exist in specific countries with specific governments, specific neighbours, and specific geopolitical realities.
The abstraction layer we rely on as architects and executives can obscure real-world fragility. We draw diagrams with little cloud icons and assume the magic happens. We forget that underneath those icons are concrete, steel, diesel generators, and people who need to be safe before anyone worries about your application's uptime.
We Planned for Failure, But Did We Plan for War?
Most enterprise disaster recovery strategies are designed around a particular set of assumptions: hardware fails randomly, power grids have outages, fiber gets cut by backhoes, natural disasters hit specific geographic areas. We design for these scenarios because they're statistically predictable.
Multi-region deployments, availability zones, automated failover, geo-redundant storage. We've built sophisticated systems to handle these risks. And they work. AWS, Azure, and Google Cloud have genuinely delivered resilience that would have been unimaginable for most organizations to build themselves.
But here's the uncomfortable question: how many disaster recovery plans explicitly model a military attack on infrastructure?
Traditional DR assumes randomness. A server fails here, a network switch fails there. We calculate mean time between failures and design accordingly.
Military conflict is not random. It's targeted and strategic. When infrastructure becomes a legitimate military objective, the risk model changes completely. You're no longer dealing with incremental, statistically predictable failures. You're dealing with deliberate, coordinated attacks designed to cause maximum disruption.
Multi-Region Is Not Multi-Geopolitical
I've sat in plenty of architecture reviews where teams proudly explain their multi-region deployment strategy. "We're in us-east-1 AND us-west-2! We're covered!"
And for many scenarios, they are. But "multi-region" within a single provider, or even within a single geopolitical theatre, is not true diversification.
ME-CENTRAL-1 and ME-SOUTH-1 are in different countries (UAE and Bahrain), but they share a geopolitical context. When tensions escalate in the Persian Gulf, both regions are exposed. Simultaneously, as we just witnessed.
Risk domains need to include political and military stability, not just seismic zones and floodplains. The question isn't just "are these regions far enough apart?" but "are these regions likely to be affected by the same geopolitical events?"
For global companies, distributing workloads across politically diverse regions may need to become a strategic risk management decision, not just a latency optimization.
Concentration Risk in Hyperscalers
Let me be clear: the hyperscalers (AWS, Azure, Google Cloud) provide genuine resilience within their design assumptions. They have engineering capabilities that dwarf what any individual company could build. They've made multi-availability-zone, multi-region architectures accessible to organizations that never could have built them alone.
But customers inherit geographic concentration risk whether they realize it or not.
The economic gravity of cloud has centralized workloads in fewer providers than we like to admit. When I look at most enterprise architectures, there's a primary cloud provider handling 80-90% of workloads, with maybe some token presence elsewhere. That's not multi-cloud resilience. That's economic convenience.
And when your primary provider has infrastructure in a conflict zone, you have exposure in that conflict zone, even if your applications are deployed elsewhere. Supply chain dependencies, operational priorities, even attention and engineering resources get affected during a major incident.
Conflict as a First-Class Risk Scenario
Boards discuss cybersecurity risks now. Ransomware is on every audit committee's radar. Good.
But few organizations explicitly model physical attacks on cloud infrastructure. Drone strikes. Missile attacks. State-level infrastructure targeting. This sounds like science fiction until it happens.
The reality is that escalating global tensions make physical targeting of infrastructure a plausible scenario, not a theoretical one. Critical infrastructure has always been a military target. Data centres are now critical infrastructure. The logic follows.
This isn't about fear-mongering. It's about updating our threat models to reflect the world as it actually is, not the world as we wish it were.
The Physical Layer Is Back
For years, we've pushed the physical layer out of our consciousness. Infrastructure-as-a-service meant we didn't have to think about power grids, cooling systems, fire suppression, or physical security. That was the provider's problem.
But during a physical attack, all of those systems become failure domains again.
Power delivery gets disrupted. Cooling systems fail. Fire suppression activates, and water damage cascades into extended outages. The AWS update explicitly mentions "fire suppression activities that resulted in additional water damage." A building fire got put out, which is good, but now you have water in places water shouldn't be.
And recovery timelines during active conflict differ fundamentally from recovery during storms. Access becomes complicated. Safety becomes paramount. Logistics become unpredictable. You can't just dispatch repair crews when there are active military operations in the area.
Hard Questions for Leadership Teams
If you're a CTO, CIO, or board member, this incident should prompt some uncomfortable questions:
If a region disappears for 30 days, what actually breaks? Not "what's our RTO" but genuinely, what business processes stop? What customer commitments can't be met? What data becomes inaccessible?
If two regions in the same geopolitical theatre are affected simultaneously, what then? Most multi-region strategies assume one region fails while others remain operational. What's the plan when that assumption doesn't hold?
Do we actually know our blast radius, or are we assuming the provider has it handled? AWS's shared responsibility model is clear about what they're responsible for. It doesn't include protecting you from geopolitical risk.
Does our business continuity plan assume regional inaccessibility for weeks? Not hours. Not days. Weeks. With no clear timeline for recovery.
Are failover tests designed around partial degradation or total regional loss? There's a big difference between "one availability zone is having issues" and "the entire region is a smoking crater."
Rethinking Business Continuity
I think this incident exposes a gap in how many organizations think about resilience.
We've optimized for steady state with occasional disruptions. Our architectures are designed to handle brief outages, partial degradation, and graceful failover. We measure success in nines of availability.
But can your organization operate with sustained, degraded capabilities? Not for an hour during a failover, but for weeks while a region is rebuilt? Do you have manual fallbacks for automated processes? Can you serve customers at reduced capacity, or is it all-or-nothing?
These questions move beyond technical architecture into operational readiness. And they require executive leadership to be part of the conversation, not just architects and SREs.
A New Maturity Model
Perhaps we need to think about organizational resilience maturity differently:
- Stage 1: On-premises risk (single site, single point of failure)
- Stage 2: Single-region cloud risk (better than on-prem, still geographically concentrated)
- Stage 3: Multi-region cloud risk (handles hardware and natural disaster scenarios)
- Stage 4: Multi-provider, multi-geopolitical resilience (handles provider-level and regional political risk)
- Stage 5: Operational readiness for sustained geopolitical disruption (organizational capability to function during prolonged instability)
Most organizations I work with are somewhere between stages 2 and 3. This incident suggests that stage 4 deserves more attention than it's getting.
The Core Reflection
For years, the narrative has been that cloud increases resilience beyond what any single company could build. And that's still largely true. I'm not arguing against cloud adoption. The hyperscalers have engineering capabilities and operational maturity that most organizations simply cannot match.
But resilience is contextual. It's not an absolute property. A system that's resilient against hardware failure and power outages may not be resilient against military attack. A system that's resilient against localized natural disasters may not be resilient against regional conflict.
The threat model has changed. Not hypothetically. Actually.
The cloud is not immune to the world's instability. It is embedded within it. Data centres exist in the same world as drone strikes and missile attacks and geopolitical conflict. The abstraction layer doesn't change that. It just makes it easier to forget.
The question is not whether the cloud is resilient. The question is whether your strategy accounts for the kinds of events we once dismissed as unthinkable.
This weekend, those events stopped being unthinkable. What happens next is up to you.