The Distributed Data Dilemma: Kubernetes' Unsolved Puzzle

The Distributed Data Dilemma: Kubernetes' Unsolved Puzzle

It was a question I heard constantly in the years after Kubernetes launched: "Can we run our database on Kubernetes?" The answer was always, "Yes, but it's complicated."

It wasn't complicated because of Kubernetes. It was complicated because of data.

By 2016, Kubernetes had proven that a new world of compute portability was possible. You could run applications anywhere. But the most critical workload—the data—was still chained to a single location, making the entire promise hollow. This crystallized something we'd known but avoided confronting. We'd solved the "day one" problem brilliantly. The hard problem? We'd kicked it down the road.

The "Day Two" Problem We Knew We Were Creating

I remember the late-night conversations in the early days of the project, deep in the push toward the 1.0 launch. The stateless compute orchestration was clicking into place—pods, services, deployments. It felt revolutionary. But we knew stateful workloads would be a challenge for anyone that did not have a dedicated team to manage them.

We made a conscious choice: get the core compute abstraction right first. Ship early and solve the problems right in front of us, like declarative compute and flat networking, with a solid foundation and a clean API (folks do not credit Brian Grant and Daniel Smith enough for getting this right (among many others!)). We called everything else "day two" problems. The reasoning was sound: you can't solve distributed state if you haven't solved distributed compute.

But "day two" became "day two thousand." And the industry is still wrestling with the consequences.

The First Wave of Solutions: StatefulSets and the Single-Cluster Mindset

The community didn't ignore the problem. StatefulSets (originally, and terribly, named "PetSets") arrived in 2016, followed by the Container Storage Interface (CSI) and sophisticated operators for databases. These were essential steps.

StatefulSets gave pods stable identities. Persistent Volumes provided durable storage that survived pod restarts. CSI enabled storage vendors to integrate seamlessly. For the first time, you could run serious stateful workloads in a Kubernetes cluster.

Helm, Operators, and CRDs helped as well. But folks were still focusing on the complexity of deploying a complex stateful workload to a single cluster.

A StatefulSet assumes its pods and storage live in the same availability zone. A Persistent Volume is tied to an EBS volume in us-east-1 or a Ceph cluster in your data center. The data becomes anchored to a physical location. This worked brilliantly for scaling up within a single cluster. It failed for scaling out across regions or clouds.

The Broken Promise: When Your Data Can't Follow Your App

The real-world implications often don't show up until something goes wrong. A common scenario is an application running in us-east-1 with its data stored on an EBS volume. A zone failure triggers Kubernetes' self-healing—exactly what you want. The scheduler places your pod in us-west-2 for resilience. Kubernetes reports the pod as "Running" and your monitoring shows green.

But your application is now separated from its data by 3,000 miles of network latency. Every database query takes 100+ milliseconds. Your app becomes unusable. Kubernetes solved the compute problem perfectly and created a data problem in the process.

Kubernetes gave us application portability, but data gravity chained it right back down.

The Unsolved Puzzle: Why We Need Data Orchestration

The contrast is jarring. For compute, we have an elegant, declarative world:

kubectl apply -f deployment.yaml

Kubernetes reads your desired state and makes it happen. Pods fail? New ones appear. Nodes disappear? Workloads migrate. It's orchestration at its finest.

For data, we're stuck in the imperative stone age:

s3 cp s3://source-bucket/data s3://dest-bucket/data --recursive
aws rds create-db-snapshot --db-instance-identifier prod-db
python migrate_database.py --pray-it-works

Brittle scripts, manual failovers, and cross-region copies that take hours and cost thousands. We manage data like it's 2010, while our applications live in 2025.

Kubernetes taught us to abstract applications from specific machines. The next evolution is to abstract data from specific locations. We don't need another storage driver or database operator. We need a data orchestrator—a control plane that treats data locality, dependencies, and movement as first-class citizens.

The Vision: Orchestrating Data Like We Orchestrated Compute

This is what we're building at Expanso. The principles that revolutionized compute orchestration—declarative configuration, desired state management, intelligent scheduling—must now be applied to data. We're not building another database or storage system; we're building the control plane for distributed data.

The Kubernetes community solved the compute puzzle. Now it's time to solve the data puzzle.

What's the most painful "application moved, but the data didn't" story you've lived through? I'm interested in your perspective.