When Load Balancers Fail: The 15-Hour AWS Outage That Broke the Internet

On October 20, 2025, Amazon Web Services experienced a catastrophic 15-hour outage in their US-EAST-1 region that crippled over 91 AWS services and thousands of customer applications worldwide 1. The incident began with seemingly innocent DNS resolution failures for DynamoDB but cascaded into Network Load Balancer health check failures that brought down entire infrastructures. This wasn't just a technical glitch—it was a wake-up call for every developer relying on load balancing systems that seemed infallible until they weren't.

The Domino Effect: How Health Checks Can Betray You

Picture this: your application is running smoothly, handling thousands of requests per second, when suddenly your load balancer starts marking healthy servers as failed. That's exactly what happened during the AWS outage. The root cause? Health check systems in Layer 4 load balancers failed to account for network state propagation delays 2 . ⚠️ Watch Out : Many developers assume health checks are infallible, but they're only as reliable as the network they traverse. When underlying infrastructure components experience delays, even sophisticated health checking algorithms can fail catastrophically. The AWS incident revealed a critical flaw: rapid capacity removal without velocity controls. As health checks failed, the load balancer aggressively removed servers from rotation, creating a cascading failure that overwhelmed the remaining infrastructure 1 .

Layer 4 vs Layer 7: The Speed vs Intelligence Trade-off

Load balancing isn't one-size-fits-all. Understanding the fundamental difference between Layer 4 and Layer 7 load balancers can save your infrastructure from collapsing under pressure. Layer 4 (Transport Layer) operates at the TCP/UDP level, offering blazing-fast performance with minimal overhead—typically around 1ms latency 3 . It makes routing decisions based on IP addresses and port numbers without inspecting packet contents. This speed comes at a cost: limited intelligence and session persistence challenges. Layer 7 (Application Layer) inspects HTTP headers and application data, enabling intelligent routing based on content, paths, or host headers. This power comes with higher latency—typically 3-8ms overhead 4 . But here's the plot twist: that extra processing time buys you crucial features like SSL termination, content-based routing, and sophisticated health checking. 💡 Insight : The sweet spot for most applications? A hybrid approach using Layer 7 for edge routing and Layer 4 for internal service-to-service communication. Load balancing algorithms determine how traffic flows across your server infrastructure

The Algorithm Showdown: Round Robin, Least Connections, and IP Hash

Choosing the right load balancing algorithm is like picking the right tool for the job—each has strengths that shine in specific scenarios. Round Robin is the simplest and most predictable: distribute requests evenly across all servers. It's perfect for homogeneous servers with identical capabilities. But what if your servers have different specs? Enter Weighted Round Robin , where stronger servers handle more traffic 5 . Least Connections shines when requests have varying processing times. Instead of blindly distributing requests, it routes new traffic to the server with the fewest active connections. This is particularly effective for applications with long-running requests or varying response times 6 . IP Hash maintains session affinity by consistently routing the same client IP to the same server. This is crucial for stateful applications but can lead to uneven distribution if your client IP pool is small 7 . 🔥 Hot Take : Most developers default to round robin, but least connections often provides better real-world performance for modern web applications with variable request durations.

Building Resilient Architecture: Lessons from the Trenches

The AWS outage taught us that resilience isn't about preventing failures—it's about surviving them gracefully. Here's how to build load balancing systems that don't crumble under pressure: // Graceful degradation implementation class ResilientLoadBalancer { constructor(servers) { this.servers = servers; this.healthCheckInterval = 5000; this.failureThreshold = 3; this.recoveryTimeout = 30000; } healthCheck(server) { // Implement exponential backoff for health checks // Don't mark servers as failed immediately // Use velocity controls to prevent rapid capacity removal } } The key is implementing graceful degradation rather than aggressive failure detection. Instead of immediately removing failed servers, implement a circuit breaker pattern that gradually reduces traffic to struggling servers while allowing them to recover 8 . 🎯 Key Point : Your load balancer should be your most resilient component, not your most fragile. Implement health check delays, velocity controls, and gradual capacity removal to prevent cascading failures. Real-World Case Study Amazon Web Services On October 20, 2025, AWS experienced a massive 15-hour outage in their US-EAST-1 region that affected over 91 AWS services and thousands of customer applications worldwide. The incident began with DNS resolution failures for DynamoDB and cascaded into Network Load Balancer health check failures. Key Takeaway: Health check systems in Layer 4 load balancers must account for network state propagation delays and implement velocity controls to prevent rapid capacity removal. The incident showed that even sophisticated health checking algorithms can fail when underlying infrastructure components experience delays, making graceful degradation essential.

Resilient Load Balancing Architecture

flowchart TD A[Client Request] --> B[CDN] B --> C[Layer 7 Load Balancer] C --> D[WAF & Rate Limiting] D --> E[Health Check Engine] E --> F[Layer 4 Load Balancer] F --> G{Server Selection} G --> H[Server 1] G --> I[Server 2] G --> J[Server 3] E --> K[Circuit Breaker] K --> L[Graceful Degradation] L --> M[Auto Scaling Group] M --> N[Recovery Timeout] N --> E Did you know? The first load balancer was created in the 1990s at Cisco to handle the explosive growth of web traffic. Today, a single AWS Network Load Balancer can handle over 100 million TLS connections per second—more than the entire internet handled in 1995! Key Takeaways Layer 4 load balancers offer ~1ms latency but limited routing intelligence Layer 7 load balancers provide intelligent routing at 3-8ms latency cost Implement health check delays and velocity controls to prevent cascading failures Least connections algorithm often outperforms round robin for variable request durations Circuit breaker patterns enable graceful degradation during partial outages References 1 Summary of the Amazon DynamoDB Service Disruption in the Northern Virginia (US-EAST-1) Region article 2 Load Balancing Health Checks documentation 3 Network Load Balancer Features documentation 4 Application Load Balancer Features documentation 5 Load Balancing Algorithms documentation 6 IP Hash Load Balancing documentation 7 Circuit Breaker Pattern blog 8 TCP Protocol Specification documentation 9 HTTP/1.1 Protocol Specification documentation 10 Load Balancer Performance Metrics documentation 11 Kubernetes Service Load Balancing documentation Share This 🚨 The 15-hour AWS outage that broke 91 services taught us a brutal lesson about load balancers... • Health checks can betray you when network delays occur • Layer 4 speed vs Layer 7 intelligence isn't a simple trade-off • Most developers are using the wrong load balancing algorithm • Circuit breakers and velocity controls prevent cascading failures Discover the resilient architecture patterns

System Flow

flowchart TD A[Client Request] --> B[CDN] B --> C[Layer 7 Load Balancer] C --> D[WAF & Rate Limiting] D --> E[Health Check Engine] E --> F[Layer 4 Load Balancer] F --> G{Server Selection} G --> H[Server 1] G --> I[Server 2] G --> J[Server 3] E --> K[Circuit Breaker] K --> L[Graceful Degradation] L --> M[Auto Scaling Group] M --> N[Recovery Timeout] N --> E

Did you know? The first load balancer was created in the 1990s at Cisco to handle the explosive growth of web traffic. Today, a single AWS Network Load Balancer can handle over 100 million TLS connections per second—more than the entire internet handled in 1995!

Wrapping Up

The AWS outage wasn't just a failure—it was a lesson in humility. Even the most sophisticated load balancing systems can fail when underlying infrastructure experiences delays. The key takeaway? Build for resilience, not perfection. Implement health check delays, velocity controls, and graceful degradation patterns. Your load balancer should be your most resilient component, not your weakest link. Tomorrow, audit your health check configurations and ask yourself: what happens when the network lies?

Satishkumar Dhule
Satishkumar Dhule
Software Engineer

Ready to put this into practice?

Practice Questions
Start typing to search articles…
↑↓ navigate open Esc close
function openSearch() { document.getElementById('searchModal').classList.add('open'); document.getElementById('searchInput').focus(); document.body.style.overflow = 'hidden'; } function closeSearch() { document.getElementById('searchModal').classList.remove('open'); document.body.style.overflow = ''; document.getElementById('searchInput').value = ''; document.getElementById('searchResults').innerHTML = '
Start typing to search articles…
'; } document.addEventListener('keydown', e => { if ((e.metaKey || e.ctrlKey) && e.key === 'k') { e.preventDefault(); openSearch(); } if (e.key === 'Escape') closeSearch(); }); document.getElementById('searchInput')?.addEventListener('input', e => { const q = e.target.value.toLowerCase().trim(); const results = document.getElementById('searchResults'); if (!q) { results.innerHTML = '
Start typing to search articles…
'; return; } const matches = searchData.filter(a => a.title.toLowerCase().includes(q) || (a.intro||'').toLowerCase().includes(q) || a.channel.toLowerCase().includes(q) || (a.tags||[]).some(t => t.toLowerCase().includes(q)) ).slice(0, 8); if (!matches.length) { results.innerHTML = '
No articles found
'; return; } results.innerHTML = matches.map(a => `
${a.title}
${a.channel.replace(/-/g,' ')}${a.difficulty}
`).join(''); }); function toggleTheme() { const html = document.documentElement; const next = html.getAttribute('data-theme') === 'dark' ? 'light' : 'dark'; html.setAttribute('data-theme', next); localStorage.setItem('theme', next); } // Reading progress window.addEventListener('scroll', () => { const bar = document.getElementById('reading-progress'); const btt = document.getElementById('back-to-top'); if (bar) { const doc = document.documentElement; const pct = (doc.scrollTop / (doc.scrollHeight - doc.clientHeight)) * 100; bar.style.width = Math.min(pct, 100) + '%'; } if (btt) btt.classList.toggle('visible', window.scrollY > 400); }); // TOC active state const tocLinks = document.querySelectorAll('.toc-list a'); if (tocLinks.length) { const observer = new IntersectionObserver(entries => { entries.forEach(e => { if (e.isIntersecting) { tocLinks.forEach(l => l.classList.remove('active')); const active = document.querySelector('.toc-list a[href="#' + e.target.id + '"]'); if (active) active.classList.add('active'); } }); }, { rootMargin: '-20% 0px -70% 0px' }); document.querySelectorAll('.article-content h2[id]').forEach(h => observer.observe(h)); } function filterArticles(difficulty, btn) { document.querySelectorAll('.diff-filter').forEach(b => b.classList.remove('active')); if (btn) btn.classList.add('active'); document.querySelectorAll('.article-card').forEach(card => { card.style.display = (difficulty === 'all' || card.dataset.difficulty === difficulty) ? '' : 'none'; }); } function copySnippet(btn) { const snippet = document.getElementById('shareSnippet')?.innerText; if (!snippet) return; navigator.clipboard.writeText(snippet).then(() => { btn.innerHTML = ''; if (typeof lucide !== 'undefined') lucide.createIcons(); setTimeout(() => { btn.innerHTML = ''; if (typeof lucide !== 'undefined') lucide.createIcons(); }, 2000); }); } if (typeof lucide !== 'undefined') lucide.createIcons();