NUMA in the Night: A Journey from Tail Latency to Locality

Hook: It was 3am when the pager woke the data hall. A Linux host in a multi‑tenant analytics cluster began exhibiting intermittent 5–20 ms tail latency under sustained I/O. The culprit wasn’t a stray bug, but memory traffic leaping across NUMA nodes, turning local reads into remote headaches. Expensify faced a nearly identical nightmare and documented a concrete, default-tooling workflow that restored locality and stability 1. You’re about to embark on that exact diagnostic journey, step by step.

A real-world alarm clock

In Expensify’s Onyx system, memory-heavy workloads on multi-socket servers produced tail latencies that spiked unpredictably during heavy I/O. The question wasn’t just how to measure latency, but how to prove it was NUMA remote access driving the delay, and how to steer the system back to locality without blinding observability. The stakes are real: in multi-tenant environments, a single noisy neighbor can cascade into SLO violations for everyone. This section sketches the opening gambit and sets the stage for a disciplined, reproducible workflow. 1 Core idea: NUMA topology matters because memory access time depends on the node where memory resides. Understanding topology is the first compass rose in the journey 2 .

The diagnostic toolkit you already own

Building on the premise that the default Linux memory balancing can drift under load, the diagnostic plan uses only built‑in utilities to confirm, locate, and mitigate remote NUMA effects. 1) Map the NUMA topology on the host: numactl --hardware Expected output shows the number of NUMA nodes, CPUs attached to each node, and local memory sizes. This establishes the landscape before any measurements. 3 2) For a target process, inspect per‑process NUMA accounting and remote allocations: numastat -p cat /proc//numa_maps numastat -p prints per‑process NUMA statistics, including pages allocated on each node and migrations. 5 /proc/ /numa_maps reveals per‑page NUMA locality, helping to spot remote allocations that correlate with latency spikes. 8 ``` 3) Correlate spikes with remote pages by cross‑referencing latency observations with the location data from numa_maps. This is where observability stays intact: these tools are lightweight, do not require intrusive instrumentation, and preserve normal production visibility. 2 3 5 8

The twist: what if the data path really is remote

Many developers discover that the mere presence of remote pages isn’t inherently bad—until sustained I/O and multi‑threaded workloads magnify the problem. The twist is to quantify how much remote memory contributes to tail latency and to rule out other bottlenecks first. The core signal is clear: when a large fraction of memory accesses originate from a remote node, reads land in higher latency paths, and tail latency inflates. This aligns with NUMA fundamentals and the policy surface Linux exposes for balancing and binding. 2 3 4

Safe mitigations that preserve observability

If remote access dominates, a conservative path is to enforce locality for the offending workload or thread groups, while keeping the system observable: Bind memory and CPUs to a local NUMA node for the workload: numactl --membind=0 --cpunodebind=0 Bind a running process to a local node (if restart is acceptable): taskset -p 0x1 # pin to CPU 0 on the local node (example) If interleaving across nodes is warranted, use selective interleaving to limit cross‑node traffic while preserving visibility of per‑node metrics: numactl --interleave=0 After applying any binding, continue observing with numastat and numa_maps to confirm locality improvements and ensure no new hot spots appear. 3 5 9 11

Real-world proof in practice

The Expensify case demonstrates that a pragmatic, tool-based approach can tame pathological remote allocations under heavy memory pressure. By mapping topology, watching per‑process NUMA accounting, and applying targeted locality controls, tail latency was reduced and predictability improved across multi‑tenant workloads. The lesson: default NUMA balancing isn’t a panacea under memory‑heavy, concurrent workloads; manual balancing and selective interleaving can deliver measurable stability while keeping observability intact. 1 2 Real-World Case Study Expensify Expensify's Onyx system runs memory-heavy, multi-threaded workloads on multi-socket servers. Under sustained I/O, the team observed intermittent tail latency spikes as memory allocations spilled across NUMA nodes, causing remote memory accesses and unpredictable delays. They documented a concrete diagnostic workflow and safe mitigations to restore locality and performance. Key Takeaway: Default Linux NUMA balancing can cause pathological remote allocations under heavy, memory-intensive workloads. Manual balancing or selective interleaving—while preserving observability—can dramatically reduce tail latency and stabilize performance in multi-tenant environments.

System Flow

flowchart TD A[User workload] --> B{NUMA Nodes} B --> C[Local memory access] B --> D[Remote memory access] D --> E[Tail latency spike] C --> F[Mitigation: bind to local node] F --> G[Observability preserved] G --> H[Latency stabilizes] Did you know? Tail latency isn’t just about slow code; sometimes it’s about where memory lives in a multi-socket machine. Key Takeaways Realize NUMA locality matters under memory-heavy, multi-threaded workloads. Use default tools to map topology, measure per-process NUMA accounting, and inspect per-page NUMA maps. Mitigate with membind/cpunodebind or selective interleaving while staying observable. References 1 How Expensify achieves extreme concurrency with NUMA balancing article 2 Non-Uniform Memory Access encyclopedia 3 NUMA architecture in Linux documentation 4 numactl(8) manual manual 5 numastat(1) manual manual 6 Procfs encyclopedia 7 /proc/[pid]/numa_maps (procfs documentation) manual 8 CPU affinity encyclopedia 9 numactl repository repository 10 Linux kernel source repository 11 Cache coherence encyclopedia Share This Ever wondered why tail latency spikes while CPUs sit idle? NUMA topology might be the culprit 🧭 Discover a concrete, default-tooling workflow for confirming remote NUMA access.,Learn exact commands to map topology, inspect per-process NUMA accounting, and pin memory/CPUs safely.,See how Expensify tamed aggression from NUMA balancing with disciplined mitigation—and observability intact. Read the full journey to learn how to run this diagnostic playbook in your own clusters. #SoftwareEngineering #SystemDesign #Linux #PerformanceEngineering #NUMA #Observability #DevOps #Backend undefined function copySnippet(btn) { const snippet = document.getElementById('shareSnippet').innerText; navigator.clipboard.writeText(snippet).then(() => { btn.innerHTML = ' '; setTimeout(() => { btn.innerHTML = ' '; }, 2000); }); }

System Flow

Did you know? Tail latency isn’t just about slow code; sometimes it’s about where memory lives in a multi-socket machine.

References

1How Expensify achieves extreme concurrency with NUMA balancingarticle
2Non-Uniform Memory Accessencyclopedia
3NUMA architecture in Linuxdocumentation
4numactl(8) manualmanual
5numastat(1) manualmanual
6Procfsencyclopedia
7/proc/[pid]/numa_maps (procfs documentation)manual
8CPU affinityencyclopedia
9numactl repositoryrepository
10Linux kernel sourcerepository
11Cache coherenceencyclopedia

Wrapping Up

The journey shows that the fastest way to tame NUMA-induced tail latency starts with a clear map of topology, a disciplined measurement routine, and safe, locality-preserving mitigations. When a real-world case like Expensify’s teaches how to observe and act, teams gain a reliable playbook for any multi-tenant workload. The next move is to make this workflow repeatable across clusters and automate the correlation between latency spikes and NUMA patterns.

NUMA in the Night: A Journey from Tail Latency to Locality

A real-world alarm clock

The diagnostic toolkit you already own

The twist: what if the data path really is remote

Safe mitigations that preserve observability

Real-world proof in practice

System Flow

System Flow

References

Wrapping Up

Continue Reading