expanso

The History of Expanso: It's About the Network (It Always Has Been)

David Aronchick

14 Aug 2025 • 8 min read

Welcome to our series on the history of Expanso.

The Network is the Problem: How Expanso Began

The first time the scope of the problem facing tech was super clear, I was at Protocol Labs. We had IPFS storing and serving data at exabyte scale. The storage worked. The catalog worked. The network did not. Every user asked for the same thing: "Just move the data here so I can process it." We tried. The math killed us.

Do the simple version. If you need to send 1 GB to 100 nodes in one second, you need 100 GB/s of aggregate throughput. Now scale the dataset and the fan-out. You are into tens to hundreds of terabytes per second. Inside a rack, that is hard. Across a building or a continent, it is fantasy. And the data did not sit still. New events arrived every second. Every plan that started with "first, copy it all" ended in a queue.

I had seen the same pattern before. At Google with GKE and early Kubernetes users. At Microsoft and Amazon with ML teams doing massive ftp and rsync (!!) jobs just to start training. Different stacks. Same bottleneck. We kept dragging data to compute and paying for it in time, cost, and risk.

What Kubernetes taught me

I was lucky enough to be extremely early with the Kubernetes team. And days into the project, I realized I had to make a mental shift. In order to make the most use of the platform, we had to help users stop asking "where does this run?", and start with "what should run?". A scheduler matched intent to resources, and the goal was to abstract all that away from the user. That unlocks scale because humans stop placing boxes.

But data never got that shift, because where data lives matters. Data (almost) never starts where it is processed; the data starts where it is created - on a server, in a building, on a vehicle. And where data is CHANGING matters even more. The real world never sits inside one region. So the answer was not "centralize better." The answer was to bring the same orchestration mindset to data. Keep a clean, central API for describing work. Run that work where the data is born. Compute over data.

The first principles behind Expanso

We started Expanso to help people embrace the reality they were faced with every day - that they needed a scalable, flexible platform that gave them the same capabilities we had given them with Kubernetes. But before we did, we started with a few hard rules.

As much as can be avoided, do not make people rewrite. Platforms often fail when value shows up only after a full migration. Most teams have pipelines made of Bash, Python, containers, and vendor tools. Meet them there. Run what they already have.
Place compute jobs near inputs. If the bytes are in a bucket in eu-west-1, enable them to run there. If the bytes are in a hospital network, run inside that network. Move only what you must. And DO NOT force users to think about all the complexity of distributed execution - ensuring jobs get there, stay running, report information, and monitor.
Embrace heterogeneity. The only universal constant is everyone is slightly different: edge, on‑prem, many clouds, many policies. Hide that from the developer. Even within a single company, or single application, things could be highly varied. We needed a system that was flexible enough to handle executing a common job in lots of uncommon places.
Map to how data teams already work. Warehouses and lakes assume distributed consensus and eventual alignment. Let those systems keep doing what they do. Add distributed compute where it helps, then land results back in the same places.
Keep the surface simple. Plain scripts, pipelines, binaries, and containers are already well understood. Familiar auth. No new language. No exotic runtime. Let people use what they already know, and invent as little as possible.
Build security and compliance into placement. Data residency, least privilege, auditable movement. Defaults that keep raw data in place.
Enable the future. The entire world is moving to new paradigms; today it is machine learning, tomorrow it will be something else. Our job is not to invent the future, but to make sure our platform helps people embrace it faster, and more reliably.

These are not slogans. They are constraints. They block a lot of "easy" designs. They also keep the system usable in the real world.

Why centralize-only hits a wall

Centralizing everything creates a queue at the center. Every copy, re-encode, and shuffle adds latency and new failure modes. Every region crossing adds cost and legal work. Every schema mismatch becomes a fire drill. Teams patch gaps with one-off batch jobs that become systems of their own. Iteration slows. Bills rise. Compliance risk spreads because too many systems see raw data.

Move compute to data and the shape of the pipeline changes. You cut transfers. You scale by adding more sites, not just more cores in one region. You meet privacy constraints by default because raw data stays put and only keepers, features, or model updates leave. The network stops being the system.

What "distributed data pipelines" means here

I use the term in a specific way. The Expanso control plane places and monitors work across many locations and administrative domains without pulling all data into one place. The Expanso edge means stores, clinics, factories, and devices, not just "small clouds" (though we love small clouds too!) Source means the system that first writes the data: a topic, a bucket, a database, or a sensor.

You describe what should happen near each dataset. Filter. Enrich. Join. Score. Train. The platform decides where and how to run that job given the data's location, the network, the policy, and the available runtime. You keep one workflow. The system executes it across many sites.

Where Expanso fits

Expanso runs your data jobs where your data already is and coordinates them across edge, cloud, and on‑prem. You keep your tools. We minimize data movement and make placement, execution, and audit reliable.

How Expanso works in practice

A few simple rules drive the system.

Place jobs near their inputs.
Move less by default. Summaries, features, model weights, and alerts move. Raw data does not unless you say so.
Adapt to heterogeneity. Different runtimes and policies across sites are abstracted behind one workflow.
Respect security and compliance. Identities, audit trails, data residency, and least privilege are part of placement and access.
Design for failure. Networks partition. Sites go dark. Work retries with backoff, checkpoints, and safe handoffs. You get clear visibility of what ran, where, on which inputs, and what it produced.

You do not have to replace your stack. Expanso plugs into sources you already use: object storage, streams, and databases. You package code in containers or run simple scripts. Outputs land back in your lakes, warehouses, registries, and dashboards. The control plane gives a single view across all sites.

The outcomes you should expect

Performance follows locality. Local filtering and aggregation mean the network carries orders of magnitude fewer bytes. Streaming jobs show lower end‑to‑end latency. Batch windows shrink.

Cost tracks the same curve. Egress and inter‑region transfer fees drop. Storage churn drops because you create fewer intermediate copies. Compute costs flatten because you use idle capacity at the edge and across clouds instead of competing for the same central pool.

Security improves because fewer systems see raw data. Policies become data-aware. Jobs run where data is permitted to be processed, not where someone pointed a copy command. Compliance gets simpler to prove with per-site audit logs and explicit data-movement intents baked into the workflow.

What This Unlocks: From Impossible to Practical

This model of moving compute to data turns problems that were previously bottlenecked by the network into tractable, high-value work. By processing data at the source, you unlock new capabilities and efficiencies across industries.

Manufacturing & Industrial IoT

On a modern factory floor, thousands of sensors and high-resolution cameras generate terabytes of data daily to monitor production. Instead of attempting to stream this data deluge to the cloud—a slow and expensive process—Expanso enables you to deploy AI models directly on the factory edge.

What it unlocks: Run visual inspection models on the assembly line to catch microscopic defects in real-time. Analyze acoustic and vibration data from heavy machinery to predict failures before they happen. Only alerts, summary statistics, and valuable training examples (e.g., images of new defects) are sent to the cloud.
The Outcome: A shift from reactive repairs to predictive maintenance, dramatically reducing downtime. You achieve near-perfect quality control without overwhelming your network, and your central AI models get smarter with curated data from every production line.

Retail & Logistics

A large retail chain has thousands of stores, each a massive source of real-time data from point-of-sale systems, inventory scanners, and security cameras. Centralizing this data to optimize operations means decisions are always based on stale information.

What it unlocks: Score transactions for fraud at the register, in milliseconds. Analyze in-store video feeds locally to understand customer foot traffic and optimize layouts. Manage inventory and trigger replenishment orders from the store's back office based on real-time sales data.
The Outcome: Drastically reduced credit card fraud and improved customer experience. Supply chains become hyper-efficient, responding to local demand instantly and reducing both stock-outs and overstocking.

Healthcare & Life Sciences ⚕️

Patient health information (PHI) is one of the most sensitive data types, governed by strict regulations like HIPAA that restrict how and where it can be moved or processed. This creates a huge barrier to large-scale medical research.

What it unlocks: A federated data preprocessing for medical research. Instead of pooling sensitive data, the research pipeline is sent to run inside the secure perimeter of each hospital. These pipelines can be run locally (and in advanced cases, potentially even trained locally), and only the non-sensitive, aggregated updates (weights and parameters) are sent back to a central server to be combined into a global model.
The Outcome: Researchers can build powerful diagnostic AI from diverse, global datasets without ever compromising patient privacy. Hospitals participate in cutting-edge research while maintaining complete data sovereignty, accelerating breakthroughs in medicine.

Finance & Insurance

Financial institutions operate globally, but customer data is subject to strict, country-specific data residency laws like GDPR. This makes it a legal and technical nightmare to run global risk analysis or detect sophisticated, cross-border fraud patterns.

What it unlocks: Run regional fraud detection and risk models within each legal jurisdiction. These local jobs can analyze raw transaction data in-country and output anonymized, aggregated risk scores or fraud alerts.
The Outcome: A real-time, global view of risk and financial crime that is built from locally compliant insights. The bank can protect itself and its customers without the immense cost and legal peril of moving raw financial records across borders.

Transportation & Autonomous Vehicles

A single autonomous vehicle can generate over a terabyte of sensor data per hour. Multiplying that across a fleet makes centralized processing a physical and economic impossibility.

What it unlocks: Process the vast majority of sensor and video data on the vehicle itself or at nearby 5G edge nodes. Filter for and upload only the rare, critical events—like a near-accident or an encounter with a new type of obstacle—that are needed to retrain the core driving models.
The Outcome: Massively accelerated training cycles for self-driving AI at a fraction of the data transmission cost. This makes developing and improving safer, more efficient autonomous systems economically viable.

Energy & Utilities

The modern energy grid is a distributed network of solar farms, wind turbines, and smart meters. Centralized control is too slow to react to rapid changes in renewable energy production or local demand, leading to instability and inefficiency.

What it unlocks: Deploy forecasting and load-balancing algorithms at the substation or directly at the renewable generation site. These edge jobs can make millisecond-level decisions to optimize power distribution locally, diverting energy to storage or adjusting output based on real-time conditions.
The Outcome: A more stable, resilient, and efficient power grid. It enables faster response to fluctuations, maximizes the use of green energy, and can even predict and isolate faults before they cause regional blackouts.

A familiar pattern, applied to data

If you lived through the microservices shift, you know this story. We broke monoliths to move faster. Then we needed orchestration to keep the system from collapsing under its own weight. Data pipelines have already done the first half. Sources are now everywhere. The orchestration piece is overdue.

Compute over data is that step. Describe the work once. Run it where the bytes live. Move only what adds value. Keep a single view of execution across sites.

We built Expanso for that world. If your network has become your system, what part of your pipeline would you run at the source first?