Back

Networking & Systems

24 deep dives

Unix intermediate

10 Minutes to Clarity: Uber’s Open-Source Backbone and the Quest for Per-Tenant Error Counts

In Uber's world, metrics exploded across thousands of microservices, and Prometheus alone couldn’t scale to keep up [1]....

Linux advanced

The Slab Whisperer: How to Tame Tail Latency in a High-Concurrency Linux World

Picture this: a Linux host under heavy, multi-tenant load starts coughing up tens-of-millisecond tail latencies during b...

Linux intermediate

What Netflix Learned About Tail Latency on NUMA: A Linux Toolkit-Driven Debugging Journey

It was a night when Netflix encountered tail latency bursts on a quad-socket NUMA machine as containerized workloads sur...

Linux advanced

The Memory Trap: How to Diagnose Tail Latency in Containerized Linux Clusters

In Elastic Cloud, production memory pressure on Kubernetes nodes surfaced during bursts, driven by kernel memory account...

Unix intermediate

The Night Logs Escaped: Netflix’s Wake-Up Call and a Robust Way to Reclaim Open File Descriptors

It was a night when a busy Unix host started screaming in silence. Netflix’s Dynomite project faced a production race wh...

Unix beginner

When Petabytes Go Silent: A Netflix-Scale Journey Through Logs

Picture this: Netflix sits on a mountain of logs—petabytes pouring in from thousands of microservices—yet near real-time...

Linux intermediate

A NUMA Tale: Unraveling Tail Latency with Linux Memory Reclaim

In 2013, LinkedIn faced intermittent tail latency spikes on NUMA servers during peak ingestion for an online-graph workl...

Linux advanced

The Burst That Revealed the Hidden Cache War

It was a 50-server Microsoft production cluster—the OneRF lineage—that first whispered the truth: tail latency can spike...

Unix beginner

Two-Phase Compression: How Uber Tamed a Log Mountain Without Breaking the Bank

Uber’s Spark-driven data platform faced log volumes so monstrous that retention costs threatened to swallow the budget. ...

Linux beginner

The Kernel, the Firewall, and the Command Line: A DevOps Journey Through Linux Mastery

It started in Automattic's WordPress VIP infrastructure on Kubernetes: a routine firewall-rule reload slowed to a crawl,...

Unix beginner

The 24-Hour Log Hunt: A One-Liner That Surfaces Busy Users (And Why Knight Capital's Lesson Still Matters)

In August 2012, Knight Capital Group deployed a new trading system. In about 45 minutes, a faulty deployment flooded the...

Linux intermediate

Linux on Fire: A Netflix‑style 60‑Second Triage That Cracks Tail Latency

Picture this: a Linux node in a high‑throughput data ingestion pipeline suddenly shows tail latency spikes after 1s duri...

Networking intermediate

Sticky Sessions at Scale: Booking.com's HAProxy Playbook and the Locality Dilemma

Booking.com scaled its global application delivery network using an internal LBaaS built around HAProxy to manage billio...

Linux intermediate

Latency Unmasked: A Triaged Journey Through Linux Kernel Hurdles

It started with a single, stubborn question: why would a Linux-powered Redis-backed web app experience 30-second tail la...

Linux advanced

The Midnight Mystery: Why Your Linux Server Lies About Memory

It was 3am when the pager went off. Production services were crashing, but `free -m` showed 8GB available RAM. I stared ...

Operating Systems advanced

The $2 Million Memory Mistake That Broke NVIDIA's GPU Demo

Picture this: It's GTC Europe 2018, and NVIDIA's team is preparing to showcase their revolutionary RAPIDS platform. The ...

Unix advanced

The Night Tasks Hung: A Production-Trior story of taming I/O waits in Linux

Picture this: a production cluster rigged with cloud-scale services suddenly emits hung-task warnings, and every attempt...

Unix intermediate

The Silent Killer: When Your Linux Processes Vanish into Uninterruptible Sleep

Picture this: It's 2 AM and your monitoring dashboard is screaming. Dozens of unrelated processes are stuck in uninterru...

Linux intermediate

The Mysterious Case of the OOM Killer: How to Diagnose a Production Outage You Can’t Ignore

Seqera Labs faced a brutal wake-up call in mid-2022: Nextflow tasks on AWS EC2 containers began dying with OOM errors ev...

Operating Systems intermediate

The Vanishing CPU: A ClickHouse Case Study on Debugging with Kernel Memory Reclaim in the Clouds

Picture this: ClickHouse Cloud on GCP encounters random, unresponsive pods where CPU spikes to 100% and signals go unhea...

Unix advanced

When D-Stated Chaos Strikes: A Red Hat War Story That Teaches You to Debug Like a Pro

It was 3am when a flood of D-state processes made a Red Hat Enterprise Linux 7 machine go non-responsive, a scene later ...

Linux advanced

NUMA in the Night: A Journey from Tail Latency to Locality

Hook: It was 3am when the pager woke the data hall. A Linux host in a multi‑tenant analytics cluster began exhibiting in...

Unix beginner

One-Liner to Save the Day: Surfacing the Heaviest Directories in a Sea of Logs

Picture this: Uber’s logs were exploding, with up to 200TB of Spark-generated data on a single busy day and a monthly mo...

Networking advanced

When Load Balancers Fail: The 15-Hour AWS Outage That Broke the Internet

On October 20, 2025, Amazon Web Services experienced a catastrophic 15-hour outage in their US-EAST-1 region that crippl...

Start typing to search articles…
↑↓ navigate open Esc close
function openSearch() { document.getElementById('searchModal').classList.add('open'); document.getElementById('searchInput').focus(); document.body.style.overflow = 'hidden'; } function closeSearch() { document.getElementById('searchModal').classList.remove('open'); document.body.style.overflow = ''; document.getElementById('searchInput').value = ''; document.getElementById('searchResults').innerHTML = '
Start typing to search articles…
'; } document.addEventListener('keydown', e => { if ((e.metaKey || e.ctrlKey) && e.key === 'k') { e.preventDefault(); openSearch(); } if (e.key === 'Escape') closeSearch(); }); document.getElementById('searchInput')?.addEventListener('input', e => { const q = e.target.value.toLowerCase().trim(); const results = document.getElementById('searchResults'); if (!q) { results.innerHTML = '
Start typing to search articles…
'; return; } const matches = searchData.filter(a => a.title.toLowerCase().includes(q) || (a.intro||'').toLowerCase().includes(q) || a.channel.toLowerCase().includes(q) || (a.tags||[]).some(t => t.toLowerCase().includes(q)) ).slice(0, 8); if (!matches.length) { results.innerHTML = '
No articles found
'; return; } results.innerHTML = matches.map(a => `
${a.title}
${a.channel.replace(/-/g,' ')}${a.difficulty}
`).join(''); }); function toggleTheme() { const html = document.documentElement; const next = html.getAttribute('data-theme') === 'dark' ? 'light' : 'dark'; html.setAttribute('data-theme', next); localStorage.setItem('theme', next); } // Reading progress window.addEventListener('scroll', () => { const bar = document.getElementById('reading-progress'); const btt = document.getElementById('back-to-top'); if (bar) { const doc = document.documentElement; const pct = (doc.scrollTop / (doc.scrollHeight - doc.clientHeight)) * 100; bar.style.width = Math.min(pct, 100) + '%'; } if (btt) btt.classList.toggle('visible', window.scrollY > 400); }); // TOC active state const tocLinks = document.querySelectorAll('.toc-list a'); if (tocLinks.length) { const observer = new IntersectionObserver(entries => { entries.forEach(e => { if (e.isIntersecting) { tocLinks.forEach(l => l.classList.remove('active')); const active = document.querySelector('.toc-list a[href="#' + e.target.id + '"]'); if (active) active.classList.add('active'); } }); }, { rootMargin: '-20% 0px -70% 0px' }); document.querySelectorAll('.article-content h2[id]').forEach(h => observer.observe(h)); } function filterArticles(difficulty, btn) { document.querySelectorAll('.diff-filter').forEach(b => b.classList.remove('active')); if (btn) btn.classList.add('active'); document.querySelectorAll('.article-card').forEach(card => { card.style.display = (difficulty === 'all' || card.dataset.difficulty === difficulty) ? '' : 'none'; }); } function copySnippet(btn) { const snippet = document.getElementById('shareSnippet')?.innerText; if (!snippet) return; navigator.clipboard.writeText(snippet).then(() => { btn.innerHTML = ''; if (typeof lucide !== 'undefined') lucide.createIcons(); setTimeout(() => { btn.innerHTML = ''; if (typeof lucide !== 'undefined') lucide.createIcons(); }, 2000); }); } if (typeof lucide !== 'undefined') lucide.createIcons();