Global Distributed Consensus: The Missing Piece in Kubernetes
Early in my time on the Kubernetes team, a customer proposed something that was both brilliant and beyond what we were ready for: a global footprint of clusters, one per region, with a synchronized set of jobs. They were running a low-latency application and wanted to coordinate workloads across continents. The problem was, we hadn't figured out how to do that yet.
As an aside, we had an internal project called "Ubernetes" that aimed to solve this, but the complexity of tackling multi-cluster coordination in the API Server was too high. We had to leave it as an exercise for the user.
So we helped the customer build a solution, but it immediately raised a point for all of os. This wasn't just a deployment problem, which GitOps has since made more reliable. It was a consensus problem. The moment you have multiple systems that can't talk to each other, you have a recipe for things getting out of sync.
The Consensus We Have vs. The Consensus We Need
Kubernetes runs on etcd, which uses the Raft consensus algorithm. It's a proven model for what it was designed to do: keep a single cluster's state perfectly consistent. When you create a deployment or a pod dies, every node in the cluster agrees on the new state of the world almost instantly.
But step outside that single cluster boundary, and all consensus guarantees break down.
Today, teams often work around this with a single, central source of truth, using write-only pipelines to declarative endpoints like Kubernetes in each region. This works beautifully for stateless workloads. But the moment you introduce stateful data that needs to cross boundaries, the model falls apart.
The Data Consensus Problem
Stateless workloads are simple (or as simple as you can get in the distributed system world). But what happens when a user crosses a regional boundary due to load balancing or data sovereignty laws? You now have to figure out how and where to move their state. The common solutions are just bandages. I've seen countless teams forced to build brittle, custom consensus mechanisms using complex service meshes and controllers that poll APIs across regions, simply because the right tools don't exist.
Consider an ML training pipeline that needs to process user data across multiple regions for compliance. With current Kubernetes tooling, you end up with:
- Separate, manually coordinated data pipelines in each region.
- No global view of the processing state.
- Race conditions when jobs span regions.
- Inconsistent data lineage tracking.
Each cluster thinks it's doing the right thing, but globally, you have no guarantees of correctness. This isn't theoretical. I've seen production systems fail because cluster A started processing a dataset while cluster B was already halfway through the same job. Both clusters were internally consistent, but the global system was broken.
Why Raft and Paxos Aren't Enough
Raft works brilliantly within a cluster because it assumes low latency and reliable network connections. Stretch it across regions, however, and its core assumptions break down. Cross-region latency is unpredictable, and network partitions are a daily reality, not a rare edge case. Paxos handles partitions better, but its complexity makes it difficult to implement correctly at the scale of modern multi-cloud deployments.
Both algorithms were designed for consensus among similar nodes in controlled environments. The real challenge today isn't just agreeing on simple state changes. It's reaching consensus on complex data operations that might take hours and involve moving terabytes of data between regions.
The Path Forward: Data-Aware Consensus
The missing piece isn't just global consensus; it's data-aware global consensus.
Traditional consensus algorithms treat all operations equally. But data operations are long-running, resource-intensive, and have complex dependencies. A true global consensus mechanism for Kubernetes must understand these characteristics. It must know that moving 100TB of data is fundamentally different from starting a web server. It must coordinate not just the job, but where the data lives, how it's partitioned, and which cluster processes each part. Most importantly, it has to manage the reality that data jobs have phases, dependencies, and resource needs that change over time.
This is the problem we are focused on at Expanso. We believe the solution requires building global consensus mechanisms that understand data locality, processing requirements, and the economics of data movement. A system must work with existing compute clusters and data warehouses, but true consensus has to start where the data is generated, no matter where that is.
Ultimately, distributed systems aren't just about distributing compute. They're about distributing data intelligently, and that requires a fundamentally different approach to consensus.
What's been your experience with multi-cluster coordination? Have you hit similar walls when trying to orchestrate data processing across regions?