Orchestration Allows Microservices to Be Unreliable (That's a Good Thing)
Orchestration Allows Microservices to Be Unreliable (And That's a Good Thing)
One of the first features I wanted to build for Kubernetes was service workflows: Service A starts, then B, then C. If B fails, A should know, and C shouldn't panic. Services need to know when their dependencies are ready.
On a whiteboard, that sounded trivial. In production, it's a nightmare.
Is a service healthy if the container starts but blocks on I/O? What if the probe returns "OK" while the API stalls for ten seconds? What if a job needs two dependencies and only one appears? These edge cases turn start-up ordering into a distributed minefield.
I showed the plan to Brian Grant, who had lived through every permutation of failure inside Google. He shook his head: "You're solving the wrong problem. Treat startup quirks as just another failure mode. Build for failure, full stop."
That was the first time I internalized a hard truth: all containers, all nodes, all networks fail—often and in weird ways. Workflows hide the issue instead of fixing it. Orchestration embraces it.
What Orchestration Actually Provides
People say orchestration "deploys containers." That's like saying a traffic system "turns lights green." The value is movement, not electricity.
Orchestration provides four fundamental capabilities that transform chaos into reliability:
Failure Handling: A crashed service is ejected from load balancers, restarted, and watched until healthy. Users never see the glitch.
Service Discovery: A service asks, "Where's auth today?" and gets a fresh, healthy endpoint. When auth moves or scales, answers update instantly.
Load Balancing: Requests flow to instances that are alive, local, and lightly loaded. The orchestrator reroutes on millisecond signals.
Scaling Decisions: Metrics across the fleet drive placement: CPU, latency, queue depth, even downstream saturation the code never measured.
This isn't just automation. It's emergent intelligence.
The Reliability Paradox
Here's the counterintuitive part: well-architected microservices are individually less reliable than a monolith. They fail more often and in more unpredictable ways. This sounds like a disaster, but it's the secret to their strength.
When you build a monolith, you bet everything on one giant process that must never fail. When it does, everything stops. Your application is a single point of failure disguised as a single point of deployment.
When you build with an orchestrator, you design for constant, small failures. Services crash and restart. Network calls time out and retry. The system treats failure as a normal part of operations, not a crisis. The result is a system that's antifragile; it gets stronger under stress because it's constantly practicing recovery.
Why the "Microservices Are Unreliable" Complaint Misses the Point
Every few months, a blog post declares a return to the monolith because microservices are "too complex." This solves the wrong problem.
The complexity isn't in the microservices; it's in the distributed system you were already building. Your monolith talks to databases, caches, and external APIs. You still have network calls and partial failures. You just pretend they don't exist until they bring down the entire application.
Microservices make that complexity visible. Orchestration makes it manageable. You have to build services that tolerate failure; the orchestrator handles the recovery. Build for failure, and the rest falls into place.
The Bridge to Data
After years of seeing this pattern with compute, one thing became painfully clear: we learned to break apart application monoliths, but we're still building data monoliths.
I've seen the same pattern at nearly every company I've coached or worked at. Teams master microservices for their applications, but their data architecture is stuck in 1995. Everything gets copied to a central warehouse before any processing can happen, often with massive s3 cp jobs that take hours or days. Data pipelines are brittle, monolithic scripts that fail catastrophically and require manual intervention.
We spent a decade learning that distributed compute needs orchestration, then promptly forgot that lesson when dealing with data.
Why Data Pipelines Need the Same Orchestration Revolution
The principles that made microservices successful apply directly to data processing. We need:
Service Discovery for Data: An analytics job shouldn't need to know where data lives. It should ask an orchestrator, "Where is the latest sensor data?" and get a direct, actionable answer.
Load Balancing for Processing: Route jobs based on data locality and resource capacity, not just round-robin.
Failure Handling for Pipelines: When a data processing job fails, the orchestrator should handle retries, route around failed nodes, and maintain processing guarantees automatically.
Automated Scaling for Data: The orchestrator should watch data volumes and processing queues to make intelligent scaling decisions for the entire data pipeline.
This isn't a theoretical exercise. At Expanso, we're applying the same battle-hardened orchestration principles from compute to the world of distributed data. The teams that master data orchestration will have the same edge that early Kubernetes adopters had. They'll move faster, build more reliable systems, and spend less time on operational firefighting.
The data orchestration transformation is just beginning.
What's your experience with microservice reliability? Have you seen the same monolithic patterns emerge in your data pipelines?