The ASIC Unbundling

Inference is two-thirds of all AI compute. Almost all of it still runs on chips designed for training. The market is finally noticing the mismatch.

The ASIC Unbundling

Inference accounts for two-thirds of all AI compute now, and nearly all of it runs on chips that were designed for a fundamentally different job. The industry spent a decade optimizing silicon for training runs, concentrated bursts of GPU-saturating computation measured in days or weeks. Serving a model in production looks nothing like that. It's millions of small requests, around the clock, where the metric that matters is cost-per-token, not time-to-convergence.

That mismatch has been hiding in plain sight for years, papered over by Nvidia's extraordinary execution and the industry's collective unwillingness to question the GPU monoculture. But in the last six months, something structural has shifted. The monoculture is cracking, and the cracks are worth paying attention to.

The chips nobody's buying (yet)

Start with what Intel and Google announced last week: a multiyear collaboration on custom ASIC-based Infrastructure Processing Units. IPUs are not glamorous. They don't show up in benchmark wars or keynote demos. What they do is offload the unsexy infrastructure work (networking, packet processing, storage management, security enforcement) away from host CPUs so those CPUs can focus on actual computation.

This is a profoundly unglamorous investment. It's also exactly right.

Google didn't wake up one morning and decide IPUs sounded fun. They've been co-developing these chips with Intel since 2022, iterating through multiple generations, because at hyperscale the overhead costs of general-purpose infrastructure become staggering. When you're running millions of inference requests per second, the tax that networking and storage impose on your CPUs isn't a rounding error. It's a line item that dwarfs most companies' entire compute budgets.

The IPU bet reveals something important about where Google thinks the bottleneck actually lives. It's not raw FLOPS. It's the ratio of useful computation to total computation, and purpose-built silicon shifts that ratio dramatically.

The monoculture cracks

Google's IPU play would be interesting on its own. But it's not on its own. Custom ASIC shipments are growing at 44.6% annually while GPU shipments grow at 16.1%, and the roster of companies building their own silicon has gotten long enough to be structurally significant.

The roster covers just about everyone. Google's TPU Ironwood is now in its seventh generation, a decade into a custom silicon program that most of the industry dismissed as a science project in 2015. Amazon's Trainium3 ships on TSMC's 3nm process and stacks 144 chips into liquid-cooled racks that Amazon claims match Blackwell at rack-level performance. Microsoft's Maia 200 is deployed in Azure data centers, with Microsoft claiming 3x the FP4 performance of Trainium3. Meta has its MTIA inference accelerator. And OpenAI, which built its entire business on Nvidia GPUs, is now working with Broadcom on custom ASICs.

When the company that built its entire business on Nvidia GPUs starts designing custom silicon with Broadcom, you would not be overreaching to say the assumptions underneath the GPU monoculture are changing.

The metric that matters

The conventional framing of all this is a "chip war," with AMD and Nvidia slugging it out over who has the bigger FLOPS number. That framing misses the point entirely.

AMD's MI400X claims 40% more tokens-per-dollar than Nvidia's B200. Not 40% more raw performance. 40% more output per unit of cost. AMD is projecting 10x the performance of the MI300X and a 35x generational improvement in inference specifically, with rack-scale systems designed from the ground up for the workload that actually dominates production.

This is the shift that matters. When the industry metric moves from peak FLOPS (how fast can you train?) to tokens-per-dollar (how cheaply can you serve?), the entire value hierarchy rearranges. Nvidia built an extraordinary cathedral for training workloads. The market is discovering it needs a bazaar for inference.

And the bazaar, by its nature, is diverse. Different chips for different jobs, optimized for different tradeoffs. Google's TPU was purpose-built for the inference era. Amazon's Trainium is optimized for its own model architectures. Microsoft's Maia is tuned for Azure's specific workload mix. These aren't interchangeable commodities. They're specialized tools, each shaped by the particular constraints of the environment they serve.

The architecture lesson

If you've spent any time thinking about distributed systems, this pattern should feel deeply familiar.

The story of cloud computing's first decade was centralization: take everything, put it in one place, run it on general-purpose hardware. The story of the second decade has been the slow, sometimes painful recognition that centralization creates its own pathologies. Data gravity. Latency constraints. Regulatory boundaries. Cost structures that look elegant at small scale and punishing at large scale.

The hardware layer is now learning the same lesson. The GPU monoculture was its own form of centralization: one architecture, one vendor, one set of assumptions about what computation looks like. It worked brilliantly when AI was primarily a training problem, when the task was concentrated and the metric was speed-to-model. But inference is a fundamentally different beast. It's distributed by nature (happening everywhere, all the time, at wildly varying scales), heterogeneous by necessity (different models, different latency requirements, different cost constraints), and relentless in its demand for efficiency over raw power.

The ASIC unbundling is the hardware industry catching up to what the software architecture world has been learning for a decade: that the right computation, on the right hardware, at the right location, beats brute-force centralized processing. Every time.

New world, new risks

Specialization creates capability, but it also creates fragility.

When everyone ran on Nvidia GPUs, the supply chain had simplicity: TSMC fabbed the chips, Nvidia designed them, you bought them (if you could get them). The new world is structurally more complex. Google's TPUs run on different fab processes than Amazon's Trainium. Microsoft's Maia has different supply chain dependencies than Meta's MTIA. Intel's IPUs come from Intel's own foundries while AMD's MI400 series relies on TSMC's most advanced nodes.

This is silicon biodiversity, and like biological biodiversity, it's simultaneously more resilient (no single point of failure) and more complex (more failure modes, harder to manage). The industry is trading one kind of concentration risk (Nvidia dependency) for another kind of complexity risk (fragmented supply chains, incompatible ecosystems, balkanized software stacks).

The companies that navigate this well will be the ones that treat hardware heterogeneity as a first-class architectural concern, not an afterthought. That means workload orchestration that's silicon-aware. Deployment pipelines that can target different accelerators based on cost, latency, and availability. Data infrastructure that moves computation to where it makes sense, rather than assuming everything funnels through one type of chip in one type of data center.

Ok, so what do I do with this unbundling?

The ASIC unbundling isn't really about who wins the chip war. Nvidia will remain enormously important. AMD will take share. The hyperscaler custom chips will serve their own ecosystems.

The deeper significance is what happens when computation itself becomes heterogeneous. When the right chip for the job depends on where you're computing, what you're computing, and what tradeoffs you're willing to accept. When inference at the edge runs on different silicon than inference in the cloud, which runs on different silicon than training, which runs on different silicon than the infrastructure plumbing that holds it all together.

This is an architecture problem, and it's the same one that distributed systems engineers have been wrestling with for years: how do you build coherent systems out of fundamentally heterogeneous components?

The companies that understand this are already building for it. We (Expanso) certainly are :) The ones that don't will keep buying the most expensive GPU they can find and wondering why their inference costs won't come down.

The GPU monoculture served us well. It's time to let it go.


Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!