Networking & Systems
24 deep dives
10 Minutes to Clarity: Uber’s Open-Source Backbone and the Quest for Per-Tenant Error Counts
In Uber's world, metrics exploded across thousands of microservices, and Prometheus alone couldn’t scale to keep up [1]....
The Slab Whisperer: How to Tame Tail Latency in a High-Concurrency Linux World
Picture this: a Linux host under heavy, multi-tenant load starts coughing up tens-of-millisecond tail latencies during b...
What Netflix Learned About Tail Latency on NUMA: A Linux Toolkit-Driven Debugging Journey
It was a night when Netflix encountered tail latency bursts on a quad-socket NUMA machine as containerized workloads sur...
The Memory Trap: How to Diagnose Tail Latency in Containerized Linux Clusters
In Elastic Cloud, production memory pressure on Kubernetes nodes surfaced during bursts, driven by kernel memory account...
The Night Logs Escaped: Netflix’s Wake-Up Call and a Robust Way to Reclaim Open File Descriptors
It was a night when a busy Unix host started screaming in silence. Netflix’s Dynomite project faced a production race wh...
When Petabytes Go Silent: A Netflix-Scale Journey Through Logs
Picture this: Netflix sits on a mountain of logs—petabytes pouring in from thousands of microservices—yet near real-time...
A NUMA Tale: Unraveling Tail Latency with Linux Memory Reclaim
In 2013, LinkedIn faced intermittent tail latency spikes on NUMA servers during peak ingestion for an online-graph workl...
The Burst That Revealed the Hidden Cache War
It was a 50-server Microsoft production cluster—the OneRF lineage—that first whispered the truth: tail latency can spike...
Two-Phase Compression: How Uber Tamed a Log Mountain Without Breaking the Bank
Uber’s Spark-driven data platform faced log volumes so monstrous that retention costs threatened to swallow the budget. ...
The Kernel, the Firewall, and the Command Line: A DevOps Journey Through Linux Mastery
It started in Automattic's WordPress VIP infrastructure on Kubernetes: a routine firewall-rule reload slowed to a crawl,...
The 24-Hour Log Hunt: A One-Liner That Surfaces Busy Users (And Why Knight Capital's Lesson Still Matters)
In August 2012, Knight Capital Group deployed a new trading system. In about 45 minutes, a faulty deployment flooded the...
Linux on Fire: A Netflix‑style 60‑Second Triage That Cracks Tail Latency
Picture this: a Linux node in a high‑throughput data ingestion pipeline suddenly shows tail latency spikes after 1s duri...
Sticky Sessions at Scale: Booking.com's HAProxy Playbook and the Locality Dilemma
Booking.com scaled its global application delivery network using an internal LBaaS built around HAProxy to manage billio...
Latency Unmasked: A Triaged Journey Through Linux Kernel Hurdles
It started with a single, stubborn question: why would a Linux-powered Redis-backed web app experience 30-second tail la...
The Midnight Mystery: Why Your Linux Server Lies About Memory
It was 3am when the pager went off. Production services were crashing, but `free -m` showed 8GB available RAM. I stared ...
The $2 Million Memory Mistake That Broke NVIDIA's GPU Demo
Picture this: It's GTC Europe 2018, and NVIDIA's team is preparing to showcase their revolutionary RAPIDS platform. The ...
The Night Tasks Hung: A Production-Trior story of taming I/O waits in Linux
Picture this: a production cluster rigged with cloud-scale services suddenly emits hung-task warnings, and every attempt...
The Silent Killer: When Your Linux Processes Vanish into Uninterruptible Sleep
Picture this: It's 2 AM and your monitoring dashboard is screaming. Dozens of unrelated processes are stuck in uninterru...
The Mysterious Case of the OOM Killer: How to Diagnose a Production Outage You Can’t Ignore
Seqera Labs faced a brutal wake-up call in mid-2022: Nextflow tasks on AWS EC2 containers began dying with OOM errors ev...
The Vanishing CPU: A ClickHouse Case Study on Debugging with Kernel Memory Reclaim in the Clouds
Picture this: ClickHouse Cloud on GCP encounters random, unresponsive pods where CPU spikes to 100% and signals go unhea...
When D-Stated Chaos Strikes: A Red Hat War Story That Teaches You to Debug Like a Pro
It was 3am when a flood of D-state processes made a Red Hat Enterprise Linux 7 machine go non-responsive, a scene later ...
NUMA in the Night: A Journey from Tail Latency to Locality
Hook: It was 3am when the pager woke the data hall. A Linux host in a multi‑tenant analytics cluster began exhibiting in...
One-Liner to Save the Day: Surfacing the Heaviest Directories in a Sea of Logs
Picture this: Uber’s logs were exploding, with up to 200TB of Spark-generated data on a single busy day and a monthly mo...
When Load Balancers Fail: The 15-Hour AWS Outage That Broke the Internet
On October 20, 2025, Amazon Web Services experienced a catastrophic 15-hour outage in their US-EAST-1 region that crippl...