kubeflow

From Kubeflow to Real-World ML: Why Data Locality Matters Just as Much as Compute

David Aronchick

24 Jul 2025 • 3 min read

From Kubeflow to Real-World ML: Why Data Locality Matters More Than Compute

When my co-founders, Jeremy Lewi, Vishnu Kannan, and I started Kubeflow back in 2017, we were trying to solve what felt like the biggest problem in machine learning. Brilliant data scientists would craft elegant models on their laptops, only to watch them fail spectacularly in production. The transition from a local Jupyter notebook to a distributed cluster was a minefield of broken dependencies and operational complexity. The promise of ML was being blocked by the messy reality of deployment.

Our goal was to make ML workloads as portable and scalable as any other modern application by bringing the operational discipline of Kubernetes to data science.

What We Got Right: Taming Compute

In many ways, we succeeded. Kubeflow created a common platform for the ML lifecycle. By building on Kubernetes, we gave teams a standard, declarative API for defining, deploying, and managing complex pipelines. You could define an entire workflow—from data preprocessing to distributed training to model serving—in a single, version-controlled manifest.

This was a big step forward. Teams could spin up a distributed TensorFlow or PyTorch training job across a cluster of GPUs without becoming experts in networking or systems administration. We had made real progress in solving the "works on my laptop, fails in production" problem by making the production environment a programmable, repeatable system.

But in solving the compute problem, we inherited a massive blind spot: data. We made it easy to run a training job anywhere, but we did nothing to solve the excruciating pain of getting the data to the job. We soon discovered that the new bottleneck wasn't provisioning compute, but the "training data shuffle."

I was shocked to learn how many teams spent days, not hours, running massive s3 cp or gsutil cp commands. They were duplicating petabytes of data from a central lake to a high-performance, local file cache just so their GPUs could access it. The compute itself was fast, but the end-to-end process was glacial, bottlenecked by data copying and network bandwidth before a single training epoch could begin.

The Real Reason ML Projects Fail

This is the uncomfortable truth of modern ML. Most projects don't fail because the algorithm is wrong or the model architecture is flawed. They die a slow death during data engineering, long before a model sees production traffic.

The culprit is the immense friction of data logistics. A data pipeline breaks. A copy job times out. A permissions error blocks access. Each issue seems small, but together they grind progress to a halt until the business loses patience. A promising project gets shut down not because the idea was bad, but because the data plumbing broke first.

I saw this exact pattern in the customers of nearly every major tech company I have worked for. A team would have a brilliant idea, get a budget for a massive GPU cluster, and then spend the next six months building a fragile, custom pipeline just to feed the machines. The focus was always on the shiny, expensive compute, while the boring, essential work of data management was an afterthought.

The Next Evolution: Data-Aware Orchestration

The last decade was about perfecting compute orchestration. With tools like Kubernetes, we learned to treat servers as cattle, not pets. We decoupled applications from physical machines, giving us incredible scale and resilience for stateless workloads.

But we still treat our data like a pampered pet, locked in centralized monoliths like data lakes or warehouses. This forces our compute to come to the data, a model that is fundamentally broken for data-intensive workloads. The next evolution is to flip this on its head and bring the compute to the data.

Instead of copying a 50-terabyte dataset to a remote cluster, a data-aware orchestrator should be smart enough to run components of the pipeline directly where the data already lives. This isn't just about efficiency; it's a fundamental architectural shift. It treats data as the stable center of gravity and schedules ephemeral compute around it. This "Compute over Data" approach means intelligently deciding which parts of a job should move, rather than defaulting to moving all the data.

This problem extends far beyond machine learning. Scientific computing, financial modeling, genomics, and ETL run into the same wall. The cost and complexity of moving data is a universal tax on innovation.

At Expanso, we are building the tools to enable this shift by making data locality a core primitive of the distributed stack. We believe the next generation of orchestration needs to manage data with the same grace and power that it manages compute.

What has been your experience? I'm interested to hear where the friction in your data pipelines has been the most painful.