A real-world alarm clock
In Expensify’s Onyx system, memory-heavy workloads on multi-socket servers produced tail latencies that spiked unpredictably during heavy I/O. The question wasn’t just how to measure latency, but how to prove it was NUMA remote access driving the delay, and how to steer the system back to locality without blinding observability. The stakes are real: in multi-tenant environments, a single noisy neighbor can cascade into SLO violations for everyone. This section sketches the opening gambit and sets the stage for a disciplined, reproducible workflow. 1 Core idea: NUMA topology matters because memory access time depends on the node where memory resides. Understanding topology is the first compass rose in the journey 2 .
The diagnostic toolkit you already own
Building on the premise that the default Linux memory balancing can drift under load, the diagnostic plan uses only built‑in utilities to confirm, locate, and mitigate remote NUMA effects. 1) Map the NUMA topology on the host: numactl --hardware Expected output shows the number of NUMA nodes, CPUs attached to each node, and local memory sizes. This establishes the landscape before any measurements. 3 2) For a target process, inspect per‑process NUMA accounting and remote allocations: numastat -p
The twist: what if the data path really is remote
Many developers discover that the mere presence of remote pages isn’t inherently bad—until sustained I/O and multi‑threaded workloads magnify the problem. The twist is to quantify how much remote memory contributes to tail latency and to rule out other bottlenecks first. The core signal is clear: when a large fraction of memory accesses originate from a remote node, reads land in higher latency paths, and tail latency inflates. This aligns with NUMA fundamentals and the policy surface Linux exposes for balancing and binding. 2 3 4
Safe mitigations that preserve observability
If remote access dominates, a conservative path is to enforce locality for the offending workload or thread groups, while keeping the system observable: Bind memory and CPUs to a local NUMA node for the workload: numactl --membind=0 --cpunodebind=0
Real-world proof in practice
The Expensify case demonstrates that a pragmatic, tool-based approach can tame pathological remote allocations under heavy memory pressure. By mapping topology, watching per‑process NUMA accounting, and applying targeted locality controls, tail latency was reduced and predictability improved across multi‑tenant workloads. The lesson: default NUMA balancing isn’t a panacea under memory‑heavy, concurrent workloads; manual balancing and selective interleaving can deliver measurable stability while keeping observability intact. 1 2 Real-World Case Study Expensify Expensify's Onyx system runs memory-heavy, multi-threaded workloads on multi-socket servers. Under sustained I/O, the team observed intermittent tail latency spikes as memory allocations spilled across NUMA nodes, causing remote memory accesses and unpredictable delays. They documented a concrete diagnostic workflow and safe mitigations to restore locality and performance. Key Takeaway: Default Linux NUMA balancing can cause pathological remote allocations under heavy, memory-intensive workloads. Manual balancing or selective interleaving—while preserving observability—can dramatically reduce tail latency and stabilize performance in multi-tenant environments.
System Flow
flowchart TD A[User workload] --> B{NUMA Nodes} B --> C[Local memory access] B --> D[Remote memory access] D --> E[Tail latency spike] C --> F[Mitigation: bind to local node] F --> G[Observability preserved] G --> H[Latency stabilizes] Did you know? Tail latency isn’t just about slow code; sometimes it’s about where memory lives in a multi-socket machine. Key Takeaways Realize NUMA locality matters under memory-heavy, multi-threaded workloads. Use default tools to map topology, measure per-process NUMA accounting, and inspect per-page NUMA maps. Mitigate with membind/cpunodebind or selective interleaving while staying observable. References 1 How Expensify achieves extreme concurrency with NUMA balancing article 2 Non-Uniform Memory Access encyclopedia 3 NUMA architecture in Linux documentation 4 numactl(8) manual manual 5 numastat(1) manual manual 6 Procfs encyclopedia 7 /proc/[pid]/numa_maps (procfs documentation) manual 8 CPU affinity encyclopedia 9 numactl repository repository 10 Linux kernel source repository 11 Cache coherence encyclopedia Share This Ever wondered why tail latency spikes while CPUs sit idle? NUMA topology might be the culprit 🧭 Discover a concrete, default-tooling workflow for confirming remote NUMA access.,Learn exact commands to map topology, inspect per-process NUMA accounting, and pin memory/CPUs safely.,See how Expensify tamed aggression from NUMA balancing with disciplined mitigation—and observability intact. Read the full journey to learn how to run this diagnostic playbook in your own clusters. #SoftwareEngineering #SystemDesign #Linux #PerformanceEngineering #NUMA #Observability #DevOps #Backend undefined function copySnippet(btn) { const snippet = document.getElementById('shareSnippet').innerText; navigator.clipboard.writeText(snippet).then(() => { btn.innerHTML = ' '; setTimeout(() => { btn.innerHTML = ' '; }, 2000); }); }
System Flow
Did you know? Tail latency isn’t just about slow code; sometimes it’s about where memory lives in a multi-socket machine.
References
- 1How Expensify achieves extreme concurrency with NUMA balancingarticle
- 2Non-Uniform Memory Accessencyclopedia
- 3NUMA architecture in Linuxdocumentation
- 4numactl(8) manualmanual
- 5numastat(1) manualmanual
- 6Procfsencyclopedia
- 7/proc/[pid]/numa_maps (procfs documentation)manual
- 8CPU affinityencyclopedia
- 9numactl repositoryrepository
- 10Linux kernel sourcerepository
- 11Cache coherenceencyclopedia
Wrapping Up
The journey shows that the fastest way to tame NUMA-induced tail latency starts with a clear map of topology, a disciplined measurement routine, and safe, locality-preserving mitigations. When a real-world case like Expensify’s teaches how to observe and act, teams gain a reliable playbook for any multi-tenant workload. The next move is to make this workflow repeatable across clusters and automate the correlation between latency spikes and NUMA patterns.