May 10, 2026

Single-AZ Isn't Wrong. Accidental Single-AZ Is.

A thermal event in one Amazon Web Services (AWS) Northern Virginia data center cut power to part of a single availability zone, knocking out the EC2 instances and EBS volumes running on the affected hardware for several hours. Coinbase, FanDuel, CME Group. Major platforms degraded because of the temperature inside a single building. Single-AZ workloads that happened to sit on the affected hardware were completely unavailable for the duration, with restore-from-backup as the only recovery path, if they had backups at all.

The reflexive industry response is "This is why everyone should be multi-AZ." It's not wrong, exactly, but it skips the harder question: was the single-AZ deployment a deliberate decision with the right RTO baked in, or an inherited assumption no one audited? Single-AZ isn't automatically wrong. Accidental single-AZ is.

To get to that distinction honestly, you have to understand what actually failed and why, then apply a real risk methodology to the result. That's what the rest of this is about.

What an Availability Zone Actually Is

An availability zone is one or more discrete data centers with independent power, cooling, and physical security. Inside a single AZ, AWS may operate multiple buildings, and the us-east-1 thermal event hit one of those buildings. AWS was able to shift traffic away from the impacted zone, but instances and volumes anchored to the affected hardware stayed down. The blast radius was smaller than the AZ, and still big enough to matter. AZs in a region are connected by low-latency redundant fiber, typically single-digit milliseconds round-trip, which makes cross-AZ replication a viable design choice. Cross-region usually isn't, at least not synchronously.

Here's the part that matters for the rest of this discussion: AZ isolation is a data-plane concept. The compute, block storage, and in-AZ networking inside the box are isolated from peer AZs by design. The APIs that manage those resources are not. Most architects know this; not all architectures reflect it.

The same point shows up in resource scoping. EC2 instances, EBS volumes, and RDS instances are AZ-scoped: they live in a particular AZ. ELB, S3, IAM, and Route 53 are regional or global. The control planes that orchestrate any of them are regional too.

The Failure Hierarchy

Failures don't all live at one tier. A useful risk model has to account for the full stack:

  • Instance-level. A single host fails; Auto Scaling Groups (ASGs) detect it and launch a replacement. Largely solved.
  • AZ-level. What happened in us-east-1. Mitigated by multi-AZ deployment of stateless tiers and replicated stateful services like RDS Multi-AZ. This is the tier where most "are we resilient?" conversations stop.
  • Regional-service / control-plane. The tier most architectures don't account for. EC2 RunInstances, EBS attach/detach, and ASG scaling are all regional APIs. When the control plane is processing every other customer's failover at once, even your healthy AZ can't scale or attach storage on the original timeline.
  • Region-level. Rare but real: March 2017 (S3 in us-east-1, for hours), December 2021 (DNS and SSO substrate). Only multi-region helps.
  • Global-service level. IAM, Route 53, CloudFront. Designed across regions, but historically with us-east-1 dependencies that have surfaced in prior incidents.

us-east-1 specifically: oldest region (online since 2006), largest blast radius for global-service issues because so many production workloads live there. It has also historically housed control-plane components for some global services (IAM is the canonical example), and although AWS has progressively decentralized this, the legacy is real. The implication is that us-east-1 is not a neutral choice. Anchoring production-critical workloads there is an input to the risk model most architects never name explicitly.

Why Multi-AZ Workloads in us-east-1 Still Felt It

Multi-AZ at the data plane is necessary but not sufficient. The mechanisms that make multi-AZ save you (scaling out the surviving AZ, attaching replacement storage, absorbing redirected load) depend on conditions a multi-hour AZ outage doesn't reliably give you. Some hit regional APIs that are themselves overloaded; others assume capacity that was never provisioned.

Three examples make the pattern obvious.

  1. An Auto Scaling Group configured across us-east-1c and us-east-1d. 1c goes dark. The ASG detects the loss and tries to launch replacement capacity in 1d. RunInstances queues behind every other customer in the region doing the same thing. New instances that normally take 90 seconds now take 20 minutes. The ASG behavior is correct; the timeline is not.

  2. A stateful service with replicas in 1c and 1d. When 1c fails, 1d carries the load alone. Restoring redundancy means launching a fresh replica with attached EBS volumes, and EBS attach is a regional API. During the event it's processing every other customer's recovery at the same time. Attach calls slow, some time out, and the cluster runs without a peer until the control plane catches up. The AZ is healthy; the recovery path isn't.

  3. An ELB configured across both AZs. Health checks correctly mark 1c targets unhealthy and drain them. But the surviving 1d capacity was sized to absorb a few minutes of failover, not several hours of full load. Latency climbs. Some requests fail. The architecture passed every test except the one that lasted.

The practitioner takeaway is that designing for AZ failure means designing for the failover mechanism to be degraded too: pre-warmed capacity in each AZ, runtime independence from regional APIs during the event window, and stateful services that can run in a known-degraded mode without the control plane in front of them.

What AWS Actually Promises

A single EC2 instance has a 99.5% monthly uptime SLA. That works out to roughly 3.6 hours per month of allowed downtime before AWS owes credits. A multi-AZ deployment in a region has a 99.99% SLA, or about 4.3 minutes per month. Those figures come from AWS's published Compute SLA.

The SLA is a credit floor, not a reliability target. Real reliability is usually higher, sometimes dramatically. But the SLA is what's enforceable, which makes it the number the risk register should use. The shared responsibility model means those numbers describe what AWS owes you, not what you owe your customers. AWS guarantees the resilience of the cloud. You guarantee the resilience of your architecture in the cloud. "We use AWS, so we have AWS uptime" is a category error that shows up in too many DR plans.

The risk assessment is the bridge between what AWS owes you and what your architecture actually needs. Those two numbers are rarely the same.

A Framework, Not a Checklist

The same risk methodology a security team uses for any threat applies cleanly to availability: estimate likelihood, estimate impact, choose a treatment, document the residual. The structure isn't novel; the inputs are.

Three likelihood estimates matter. AZ-level, customer-impacting events run in the low single-digit percent annualized. Regional events run sub-one-percent. Control-plane storms (the kind that take "multi-AZ" architectures into the same hole) are the residual risk most architectures don't account for. AWS doesn't publish these, and they aren't actuarial, but defensible ranges exist from public incident history.

The impact side is where most of the work is. Customer SLA sets a contractual floor. Regulatory floors (PCI DSS contingency planning, HIPAA, SOC 2) set another. On top of those, direct revenue impact per hour, broken into peak and off-peak. Then the indirect costs: SLA credit exposure, renewal churn, reputational exposure. And the soft ones: engineering, support, executive attention.

Treatment tiers are well-defined. Backup-and-restore is cheapest and slowest. Multi-AZ typically adds 30-50% to the affected tier. Multi-region, active-passive runs 1.7-2x baseline. Active-active runs around 2.5x plus an operational complexity tax: eventual consistency, cross-region replication, at least one engineer's worth of ongoing care.

One trap is worth naming. Contractual constraints can override expected-value reasoning. A customer SLA at 99.9% cannot be met with single-AZ EC2, which has a 99.5% AWS SLA, regardless of what the expected-loss math says. Expected-value reasoning is good, but make sure the choice is sufficient.

Workload 1: An Internal Employee Tool

Start with a simple case: the HR portal, the internal wiki, the expense system. Roughly 200 employees, no external customers.

There's no contractual SLA. Internal tolerance at the 8-hour mark sits closer to "annoying" than "catastrophic". Direct revenue impact is zero. Productivity impact is roughly 200 employees × 8 hours × $50 an hour fully loaded, or about $80K per event. AZ-failure likelihood around 0.75% annualized puts expected annual loss near $600. Multi-AZ uplift for the affected tier runs $3-5K a month, or $36-60K a year.

The math points cleanly at single-AZ, with a backup-and-restore tier documented at an 8-hour RTO and a 24-hour RPO.

The math is the easy part. The defensibility test is whether leadership has explicitly accepted that 8-hour RTO. If the CEO finds out the tool was down for a day and demands an explanation, the answer isn't a math problem, it's a conversation-that-was-never-had problem.

Workload 2: A Customer-Facing SaaS API

The Setup

A mid-market B2B SaaS at roughly $60M ARR, around 500 customers. Customer-facing transactional API plus a web tier. Stateful on RDS PostgreSQL, S3 for object storage. Customer SLA: 99.9% monthly. Not PCI- or HIPAA-bearing. Engineering: twelve people, no dedicated SRE function. The workload most mid-market organizations are actually running.

Step 1: Derive RTO/RPO

A 99.9% SLA gives you 43 minutes of allowed monthly downtime, planned and unplanned together. Reserve roughly a third of that for planned maintenance windows, and the unplanned RTO target lands around 30 minutes. On the data side, the customer-trust impact of lost transactional writes is meaningfully worse than the impact of being down, so the RPO needs to be tighter than the RTO. Set it at 5 minutes.

Step 2: Quantify Impact

Direct revenue runs $60M divided by 8,760 hours, or about $6,850 an hour straight-line. Concentrated in business hours, that's roughly $20K at peak, $2K off-peak. Plan with a blended $10K. SLA credits typically trigger at the 99.9% breach point: 10% MRR credit on affected customers, around $17K a day of exposure once a breach starts. Renewal risk is harder to anchor but documented in B2B SaaS. Assume 1-2% renewal-cohort impact per visible outage. Soft costs (engineering, customer success, executive attention) are real but smaller. Model it out: a 4-hour outage runs around $100K total; a full-day event lands near $250K.

Step 3: Quantify Likelihood

An AZ-level event affecting a specific AZ runs around 0.75% annualized. That's an estimate practitioners can defend from public incident history, not an actuarial figure AWS publishes. Regional events run roughly 0.2% annualized. Control-plane storms affecting multi-AZ workloads don't have a clean number, but they're increasingly the dominant residual risk once you're past the AZ tier.

Step 4: Cost the Treatment Tiers

Assume the API and database tier currently runs about $20K a month. Single-AZ adds nothing on top; expected annualized loss is 0.75% × $250K + 0.2% × $250K, or about $2,375 a year. Multi-AZ doubles the database via RDS Multi-AZ (the stateless tier across AZs is near-free), total uplift $5-7K a month, or $60-84K a year. Multi-region, active-passive (warm standby with full replicas) runs 1.7-2x baseline, around $200K a year. Active-active runs 2.5x plus an operational complexity tax (eventual consistency, cross-region replication, one to two engineers' worth of ongoing care). $300K+ a year all in.

Step 5: The Decision

Three decisions fall out of those numbers.

Single-AZ has the lowest expected annual loss on paper. But the 99.9% customer SLA cannot be met with single-AZ EC2. The AWS SLA caps at 99.5%, which permits more downtime than the architecture is contractually allowed. Single-AZ fails on contract, not on math.

Multi-region passes the contract test, but the math doesn't justify it. Roughly $200K a year of uplift to mitigate about 0.2% of additional annualized exposure on a $250K event works out to around $500 a year of expected loss reduction. Multi-region fails on math.

Multi-AZ is the defensible decision. It meets the SLA. It mitigates the dominant likelihood. It accepts regional events and control-plane storms as documented residual risk, mitigated by a tested backup-and-restore tier with an under-8-hour RTO as the secondary fallback.

That decision becomes one line in a risk register:

RR-2026-014: Customer-Facing SaaS API Availability
Asset:                 Customer API + database tier (us-east-1)
Threat:                AZ-level infrastructure failure / regional control-plane disruption
Likelihood:            Medium (~0.75% annualized AZ; ~0.2% regional)
Impact:                High ($250K modeled per multi-hour event; SLA breach + credit exposure)
Inherent risk:         High
Treatment:             Multi-AZ deployment (RDS Multi-AZ; ASG across two AZs)
                       Backup/restore tier with <8h RTO as secondary fallback
Residual risk:         Medium (regional events / control-plane storms)
Residual accepted by:  VP Engineering / CTO
Verification:          Annual DR test (next: 2026-Q3)
Review:                2027-05

That block is what operationalizes the conversation: the bet, written down, with a named approver and a review date. It is also the simplest test of whether the risk assessment actually happened: if you can't produce one, you didn't.

Workload 3: A Regulated Transactional Flow

Now a payment processing flow. The customer SLA is 99.95%, or 22 minutes a month. The regulatory floor is PCI DSS contingency planning, with documented BCP requirements. Revenue during peak transaction windows runs $50-200K an hour. A 4-hour outage during peak is around $400K of direct revenue plus regulatory exposure plus customer-trust damage in a financial-services context that doesn't forgive.

Single-AZ is non-compliant on SLA and inadequate against PCI's contingency expectations. Out. Multi-AZ meets the SLA, but a multi-hour regional outage during peak is contractually and reputationally unacceptable. Multi-region, active-passive runs $150-300K a year of uplift on top of multi-AZ. The expected-value math alone wouldn't quite get there, but regulatory pressure and customer-trust exposure push the decision over the line. Active-active was considered and rejected. Operational complexity (consistent ledger, double-bookkeeping risk) exceeds the marginal RTO improvement.

The decision is multi-region, active-passive with a 5-minute RTO and a sub-30-second RPO. Same methodology, different inputs, materially different right answer.

The Common Failure Mode

Two artifacts close the gap between assumed and actual recovery. The risk register entry: the documented bet, with the math, residual risks accepted, named approver, review cadence. The DR test: the verification that the bet is the bet you think it is.

Risk register without DR test is theory. DR test without risk register is a tactical exercise without a strategic anchor.

In real-world failures, the runbook is what hadn't been exercised. Teams rediscover their failover limits during the event itself, when it costs the most.

If your last DR test was a meeting, today is the day to schedule a real one.

The Deliberate Bet

Every architecture is a bet on availability. The only question is whether yours is one you'd defend in front of your board.

Single-AZ is a perfectly defensible bet for the right workload, in the right context, with leadership's accepted RTO sitting in the risk register. It becomes a liability the moment it's an inherited assumption no one has audited.

The cheap tier and the expensive tier can both be defensible architectures. The difference is whether the choice was made deliberately, with the residual risk named, accepted, and revisited, or made by default, by inertia, by no one in particular.

The organizations that struggle when an outage hits are rarely the ones with weak controls. They're the ones who discover, mid-incident, that the architecture they have isn't the architecture leadership thought they had.