data architecture

The Myth of Fungible Data

David Aronchick

27 Jan 2025 • 5 min read

Gartner recently coined a new term: geopatriation. They're predicting that 75% of European and Middle Eastern enterprises will move workloads out of global clouds and back to local or sovereign environments by 2030, up from less than 5% today.

The industry response has been predictable. Compliance teams are spinning up new initiatives. Cloud providers are announcing sovereign cloud offerings. Microsoft is pledging $80 billion for AI data centers specifically to keep processing in-country for EU users. Everyone's treating this as a regulatory burden to be managed.

They're missing the point entirely.

The Assumption Nobody Questions

For two decades, we've operated under a belief so fundamental it rarely gets examined: data is fungible. A byte in Singapore is equivalent to a byte in Stuttgart. A sensor reading from a factory in Shenzhen carries the same informational weight as one from a plant in Sheffield.

This assumption is baked into every data lake architecture, every "single source of truth" initiative, every cloud migration business case. Pull all your data into a central repository, the thinking goes, and you unlock its collective value. Economies of scale. Network effects. The whole greater than the sum of its parts.

The assumption made sense when data meant transactional records and documents. An invoice is an invoice. A contract is a contract. Strip away the origin, aggregate at scale, run your analytics.

But that's not what enterprise data looks like anymore.

What Data Actually Is Now

Seventy-five percent of enterprise data is now generated outside traditional data centers. IoT sensors, edge devices, manufacturing equipment, connected vehicles, smart infrastructure. We're drowning in readings, measurements, and signals from 30.9 billion connected devices.

Consider a temperature sensor on a factory floor in Bavaria. It reports 32°C. Simple enough. Aggregate it with readings from your facilities worldwide, feed it into your predictive maintenance model, optimize globally.

Except that 32°C isn't just a number. It's 32°C on a specific machine with a specific thermal history. It's 32°C during a shift with specific operators running a specific production batch. It's 32°C under local humidity conditions that affect what that temperature means for equipment stress. It's 32°C calibrated against local standards, meaningful relative to that plant's baseline rather than some global average.

And it's 32°C subject to the EU Data Act governance requirements that came into force in September 2025, which now extends sovereignty considerations to non-personal industrial data.

Strip that reading from its context, transmit it to a central lake in Virginia, aggregate it with readings from twelve other countries, and what have you created? You've created noise that looks like signal. You've destroyed the very context that made the measurement valuable.

The Thousand Shards Problem

This isn't a edge case; it's the new normal.

IoT data exhibits systematic heterogeneity that traditional analytics architectures can't handle. Timestamp ranges vary by device. Sampling frequencies differ based on local conditions. Geographic locations carry meaning that can't be captured in metadata. Units of measurement follow local conventions. Calibration standards differ by jurisdiction.

Manufacturing facilities generate enormous sensor data volumes, but much of it sits in isolated systems because the preprocessing required to make it compatible with centralized analytics destroys what makes it useful. Smart buildings across a portfolio can't easily share insights because spatial and structural differences make direct comparison meaningless without deep contextual understanding.

The reason is that the data exists as contextually unique shards. Each shard carries meaning that's intrinsic to its origin. That meaning isn't metadata you can attach after the fact. It's not a tag you can add during ingestion. It's fundamental to what the data represents.

Why Context Isn't Metadata

The industry's instinctive response has been to add more metadata. Enrich the data with location tags, timestamp normalization, source identifiers. Build elaborate data catalogs and lineage tracking systems.

This solves the wrong problem.

Consider what happens when a predictive maintenance model trained on aggregated global data tries to predict failures at a specific facility. The model has learned patterns across thousands of machines in dozens of environments. It has no understanding that this particular machine runs hotter because of its position near a loading dock that opens frequently in winter. It doesn't know that the operators on the night shift have developed workarounds for a quirk in the control system that affects sensor behavior. It can't account for the fact that local power grid fluctuations create measurement artifacts that look like early-stage bearing failures.

The model will generate predictions that will look authoritative. They'll be systematically wrong in ways that are difficult to diagnose because the errors don't look like errors because they look like normal variance.

This is why 95% of enterprise GenAI pilots fail to deliver measurable business impact. It's not the models. The models are remarkably capable. It's that we're feeding them context-stripped data and expecting them to reconstruct meaning that was destroyed before they ever saw it.

Gartner predicts that 60% of AI projects will be abandoned through 2026 due to data that isn't AI-ready. The phrasing suggests the data needs to be transformed somehow, processed into a form AI can use. The reality is simpler and harder: the data was fine where it was. We broke it by moving it.

The Economic Inversion

For years, the economics of centralization seemed obvious. Compute was expensive and scarce and data was cheap to move. Concentrate your processing power, ship data to it, achieve economies of scale.

That math has inverted.

Compute costs have collapsed everywhere except specialized AI accelerators. You can run sophisticated analytics on commodity hardware at the edge for pennies. Meanwhile, data transfer costs scale linearly with volume. Latency costs scale with distance. And compliance costs scale with every jurisdiction your data crosses.

137 countries now have data protection laws with varying localization requirements. The EU, China, India, Brazil, and dozens of other markets impose restrictions on cross-border data flows. Each transfer creates compliance overhead. Each jurisdiction crossing creates audit requirements.

The cloud providers understood this before most enterprises did. That's why they're building sovereign regions rather than arguing against sovereignty requirements. Distributed processing is becoming cheaper than centralized processing once you account for the full cost of data movement.

What Geopatriation Actually Reveals

This brings us back to geopatriation. Gartner frames it as enterprises responding to regulatory pressure. That's the surface story. The deeper dynamic is that enterprises are discovering, jurisdiction by jurisdiction, that their centralized data architectures were destroying value all along.

The sovereignty requirements didn't create the problem. They revealed it.

When you're forced to process data locally, you suddenly discover that local processing produces better results. The predictive maintenance model trained on just this facility's data outperforms the global model. The demand forecasting system using regional signals catches patterns the centralized system missed. The quality control algorithm running at the edge catches defects the cloud-based system never saw because the relevant context was lost in transit.

Enterprises that have implemented edge processing for compliance reasons are quietly discovering they should have done it for performance reasons years ago. The regulatory mandate became an accidental forcing function for better architecture.

The Path Forward

None of this means complete data localization. That creates its own problems: duplicated effort, inconsistent insights, the inability to identify patterns that genuinely do span geographies.

The opportunity is more nuanced. Process data where it lives when the context matters. Move insights rather than raw material. Build systems that understand data's contextual nature rather than assuming it away.

Some data genuinely is fungible. Financial transactions, once properly recorded, can be aggregated globally. Document repositories can be centralized without losing meaning. But the fastest-growing categories of enterprise data, the sensor readings and IoT streams and operational measurements that increasingly drive competitive advantage, aren't fungible at all. They're shards of meaning that exist in relationship to their origin.

Geopatriation isn't a retreat from the cloud era. It's the beginning of a more honest reckoning with what data actually is and where its value actually lives.

The enterprises that figure this out first will build genuinely intelligent systems. The ones that keep pretending a byte in Singapore equals a byte in Stuttgart will keep wondering why their AI initiatives produce impressive demos and disappointing results.

The data was never fungible. We just told ourselves it was because centralization was easier to build.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!