The Zone That Became a Scheduler: A Real-World Tale of Deterministic Placement

It was 3am when the pager pinged. CockroachCloud’s multi-region CockroachDB clusters were teetering on the edge of chaos during scale operations—zone presence could drift, and data availability was at stake. The default Kubernetes scheduler couldn’t guarantee zone-level placement for StatefulSets, risking data locality and reliability. Cockroach Labs built a custom Kubernetes scheduler to lock StatefulSet pods to specific zones based on ordinal, delivering deterministic zonal placement and safer scale behavior 1.

Hooked by a Real-World Constraint

The opening crisis reveals a core truth: topology and data locality aren’t afterthoughts—they’re the backbone ofHA. In CockroachCloud, the need wasn't just faster scheduling; it was placing stateful pods where data lives and performs best. When the default scheduler can’t enforce this, a custom path emerges, one that encodes topology constraints directly into scheduling decisions. Kubernetes supports multiple schedulers in a cluster, and pods can opt for a specific one using the schedulerName field in their pod spec, setting the stage for deterministic behavior 1 .

From Default to Designer’s Clockwork

Building on this, the reader learns why the default kube-scheduler exists: it uses a two-phase filtering and scoring process to match pods with nodes, considering resources, affinity/anti-affinity, taints/tolerations, and other constraints. Yet, when the workload’s requirements diverge—GPU memory topology, NVLink connectivity, or strict geographic proximity—a custom scheduler can replace or extend this logic with bespoke algorithms or plugins. In practice, teams deploy the scheduler as a separate service, wire it into the cluster API, and give it RBAC visibility to place pods exactly where they belong 2 3 4 .

When to Consider a Custom Scheduler

Common scenarios push teams toward custom schedulers: GPU-intensive workloads needing precise hardware placement, latency-sensitive apps targeting nearby regions, or cost-aware deployments balancing spot and on-demand nodes. A machine learning platform, for instance, might choose a scheduler that weighs GPU memory availability and NVLink topology to maximize training throughput while minimizing cross-node data transfer 4 9 . This is a deliberate shift from generic scheduling to workload-aware orchestration.

A Practical Pattern for the Brave

Implementing a custom scheduler isn’t just about writing new logic; it’s about designing for reliability. The scheduler runs as a separate service, must implement the Kubernetes scheduling interface, and should be highly available with multiple replicas. It consumes the cluster state from the API server, observes pod lifecycle events, and makes placement decisions that honor topology, locality, and constraints that the default scheduler would struggle to enforce 2 4 .

Real-World Proof in the Wild

The Cockroach Labs case isn’t fiction. By locking StatefulSet pods to specific zones based on ordinal, they achieved deterministic placement, improved availability, and safer scale behavior in multi-region deployments. This blueprint shows how topology-aware scheduling can transform stateful workloads and topologies that demand strict data locality 1 .

Design Principles for Your Own Scheduler

If topology and data locality matter for your HA story, a custom scheduler offers a principled path forward. Start by defining the constraints that the default scheduler misses, instrument the plugin interfaces for clear traceability, and design for failure modes—retries, backoffs, and simulated outages. Embrace the scheduling framework as a foundation rather than reinventing the wheel; leverage existing patterns for integration with RBAC, metrics, and observability 2 3 4 .

Putting It All Together

The journey comes full circle when the team realizes the real win: deterministic placement as a feature, not a byproduct. The right scheduler isn’t just about faster deployments; it’s about making scale safe and predictable for workloads that refuse to be caged by generic heuristics. The pattern invites teams to codify operational realities—data locality, hardware topology, and regional latency—into the scheduling decision itself 1 2 4 . Real-World Case Study Cockroach Labs CockroachCloud runs multi-region CockroachDB clusters across several availability zones. The default Kubernetes scheduler could not guarantee zone-level placement during scale operations for StatefulSets, risking loss of zone presence and data availability. To achieve deterministic zonal placement, Cockroach Labs built a custom Kubernetes scheduler that locks StatefulSet pods to specific zones based on their ordinal. Key Takeaway: When topology and data locality are critical for HA, the default scheduler may not suffice. The Kubernetes scheduling framework allows custom plugins to encode workload-specific constraints, enabling deterministic placement and safer scale behavior for stateful, multi-zone workloads.

Scheduling Process Flow

flowchart TD A[Kubernetes Cluster] --> B{Pod scheduling} B -->|Default| C[Default kube-scheduler] B -->|Custom| D[Custom Scheduler] C --> E[Evaluate resources, taints, affinities] D --> E E --> F[Pod assigned to Node] F --> G[Pod running] Did you know? Many developers discover that the real killer is data locality, not just raw compute capacity. Key Takeaways SchedulerName enables multiple schedulers in a cluster Custom schedulers replace or extend the default kube-scheduler Topology-aware placement improves HA for stateful workloads References 1 A Custom Kubernetes Scheduler to Orchestrate Highly Available Applications article 2 Scheduling Framework documentation 3 StatefulSet - Kubernetes documentation 4 Scheduler Performance Tuning documentation 5 Kubernetes article 6 Kubernetes - Kubernetes blog 7 Kueue: Kubernetes Queuing for Scheduling repository 8 What is Kubernetes? - AWS documentation 9 Container orchestration - Wikipedia article 10 HTTP/1.1: RFC 2616 documentation Share This Ever wondered why some deployments stay rock-stable across zones while others don’t? 🧭 A real-world case shows topology-aware scheduling turning chaos into determinism.,Kubernetes supports multiple schedulers; pods can pick the right one with schedulerName.,Deterministic placement unlocks safer scale for stateful, multi-region workloads. Read the full journey to see how a custom scheduler reshaped reliability and scale. #SoftwareEngineering #SystemDesign #DevOps #Kubernetes #CloudComputing #DataEngineering #HA #Scheduling undefined function copySnippet(btn) { const snippet = document.getElementById('shareSnippet').innerText; navigator.clipboard.writeText(snippet).then(() => { btn.innerHTML = ' '; setTimeout(() => { btn.innerHTML = ' '; }, 2000); }); }

System Flow

Did you know? Many developers discover that the real killer is data locality, not just raw compute capacity.

References

1A Custom Kubernetes Scheduler to Orchestrate Highly Available Applicationsarticle
2Scheduling Frameworkdocumentation
3StatefulSet - Kubernetesdocumentation
4Scheduler Performance Tuningdocumentation
5Kubernetesarticle
6Kubernetes - Kubernetesblog
7Kueue: Kubernetes Queuing for Schedulingrepository
8What is Kubernetes? - AWSdocumentation
9Container orchestration - Wikipediaarticle
10HTTP/1.1: RFC 2616documentation

Wrapping Up

Topology-aware scheduling turns a distribution of resources into a deliberate architectural choice. When data locality and deterministic placement matter for high availability, the path from default scheduling to a custom scheduler becomes not just practical, but essential. The next step is to map real-world constraints into scheduling decisions and explore how plugins can encode those rules without sacrificing cluster health.