While Everyone Argues About AI Regulation, Data Is the Real Wild West
            Last month, Colorado's AI Act went into full effect. California's considering similar legislation. New York has its own version in committee. Meanwhile, Sam Altman sat across from Brad Gerstner and said something quietly alarming: "I don't know how we're supposed to comply with that Colorado law. I would love them to tell us what we're supposed to do."
Not "we disagree with it." Not "it's too burdensome." Simply: we don't know what compliance even means.
I GET the intent here - a clean regulatory framework is an accelerant to AI development, not a source of friction. When you have things that many people, including prominent AI researchers, are saying could bring about the downfall of humanity, a little regulatory thought seems... good?
However, I also agree that fifty states creating fifty different interpretations of concepts like "algorithmic discrimination" is not ideal. This patchwork approach, as seen with some state-level privacy laws, will likely just result in a bunch of lawyers figuring out fifty different ways to get sued by someone claiming harm from a chatbot.
But I ALSO think that nobody's talking about the bottom 2/3rds of the iceberg: while we're building this elaborate regulatory framework for AI, we've created zero coherent rules for the thing AI actually runs on. Data.
The Infrastructure Nobody Regulates
Jensen Huang dropped a number that should make your head spin: inference workloads are about to scale by a billion times. Not 10x. Not 100x. A billion times. Chain-of-reasoning models think through problems step by step, burning tokens at an unprecedented rate.
Now map that against what Satya Nadella said last week: Microsoft is "short on power and infrastructure" and has been for "many quarters." They're not compute-constrained in the traditional sense—they literally can't plug in all the GPUs they have because they don't have enough warm shells near power sources.
Connect these dots. We're about to generate a billion times more inference compute. That compute requires data: training data, context data, real-time data. And all that data has to move.
Where's it moving? Not just to hyperscale datacenters in Virginia and Oregon. It's moving to edge devices, to on-premises deployments, to robots on factory floors. Moving across state lines, across national borders, through networks we barely understand and certainly don't regulate coherently.
The Colorado AI Act? It focuses on model outputs and bias. It says nothing substantive about data sovereignty, data movement costs, or the physics of moving petabytes across networks that weren't designed for AI workloads.
The Ghost in the Machine: Edge Computing
Both Jensen and Satya hinted at something profound: the future of AI isn't just massive centralized compute farms. Sam Altman said it explicitly: "Someday we will make an incredible consumer device that can run a GPT-5 or GPT-6 capable model completely locally at a low power draw."
Think about what that means. Not "someday far in the future." Someday. Soon enough to matter for business planning.
Right now, if you want to run a sophisticated AI model, you're paying inference costs to someone's cloud. You're sending your data to their servers, getting tokens back, and hoping the economics work out. The unit economics of AI today look nothing like search - Satya admitted that search had magical economics because you built one index and amortized it across billions of queries. Chat burns GPU cycles for every interaction.
But what if the model runs locally? On your phone. In your car. On the robot in your warehouse. Suddenly, the economics flip. No inference costs. No latency from round-trips to datacenters. No bandwidth constraints.
And no coherent regulatory framework for any of it.
Three Regulatory Failures We're Ignoring
The patchwork AI regulation problem is real. But it's hiding three deeper issues that actually matter more:
Data residency requirements that ignore physics. Europe wants data to stay in Europe. China wants data to stay in China. California wants certain data to stay private. None of these regulatory regimes acknowledge that modern AI architectures require massive context windows, real-time updates, and distributed training. You can't just "keep the data in Germany" when your model needs to learn from global patterns. The latency costs alone make certain applications impossible.
No standards for data movement costs. When Satya talks about needing $250 billion in Azure commitments from OpenAI over five years, a huge portion of that is about data movement. Moving training data. Storing in distributed, multi-region buckets. Pre-processing data in locations (using local VMs) to ready it for execution. Every byte costs money in bandwidth, CPU cycles, and latency. The result? Architectures where legal compliance come as an after thought will result in a whole lot of band-aids and inefficiency, rather than technical or economic sense.
Edge deployment is likely going to be a regulatory black hole. For a long time. Once AI models run on edge devices. - and they will, soon - what jurisdiction applies? If I'm using a locally-running model on my phone while traveling through three states, which state's AI regulations apply? If a robot in a warehouse uses a model trained in California but deployed in Texas using data from customers in fifty states, who's responsible for compliance? Nobody knows, because nobody's written the rules yet.
The Compute-Over-Data Inversion
I've been thinking about distributed systems for longer than I care to admit, and I keep coming back to a fundamental principle: moving data is expensive. Moving compute is cheap(-er). We spent the last decade building centralized cloud architectures because centralization meant economies of scale. But AI broke that model.
When your inference workload scales by a billion times, centralization becomes a bottleneck, not an advantage. The physics don't work. You can't move that much data fast enough. You can't power that many data centers efficiently. You can't build network infrastructure quickly enough to handle the load.
The solution isn't bigger datacenters. It's distributing compute to where the data already lives.
This is why these folks keep talking about edge AI and "fungible fleets" across geographies and workloads. Why Sam says consumer devices will run frontier models locally. They're all describing the same architectural shift: from centralized compute with data movement, to distributed compute with data locality.
But our regulatory frameworks assume centralization. They assume you can identify where AI happens, who's responsible, and what jurisdiction applies. That assumption is about to become profoundly wrong.
The Robotics Wildcard
And, as usual, there's no problem that can't be made worse when it comes to real world implementation. Look at robotics.
This isn't science fiction—Figure, Tesla, Boston Dynamics, and a dozen Chinese companies are shipping real robots that use real AI models. These robots need to make decisions in milliseconds. They can't wait for a round-trip to a datacenter.
So they'll run models locally. Trained on data from multiple jurisdictions. Updated via networks that cross state and national boundaries. Operating in physical spaces where privacy, safety, and liability rules differ dramatically.
Colorado's AI Act requires that people can request explanations for algorithmic decisions that affect them. Fine. Now explain that to a robot that uses a locally-running vision model trained on 100 million images from 30 countries, making real-time decisions about navigation, object manipulation, and human interaction.
What compliance burden falls on the robot manufacturer vs. the model provider vs. the end user vs. the cloud service that occasionally updates the model? Nobody knows, because we're regulating the wrong layer.
What Actually Needs Regulating
If you want to regulate AI effectively, regulate the data layer. Set clear rules for:
- Data provenance and lineage. Not just "where did this data come from" but "who touched it, when, and how did it change?" Make data transformations auditable from source to training set.
 - Cross-border data flow standards. Not blanket prohibitions but sensible frameworks that acknowledge the technical requirements of distributed training while protecting legitimate sovereignty concerns.
 - Edge device accountability. Clear standards for who's responsible when locally-running models make decisions. Is it the device manufacturer? The model provider? The end user? The update service? Define the liability chain before millions of devices ship.
 
But we're not doing any of this. We're writing laws about chatbot outputs while ignoring the infrastructure those chatbots run on. It's like regulating cars by specifying wheel sizes while ignoring road standards, traffic laws, and fuel regulations.
The Path Forward
Federal preemption would help, as Sam and Satya both noted. One set of rules beats fifty competing ones. But even federal rules focused on AI outputs miss the point. The infrastructure is the thing. Data movement. Power requirements. Edge deployment standards. Model versioning and updates. Liability frameworks for distributed systems.
In five years, hyperscale data-centers will still have their place in AI, but I'll take Jensen's bit; I think distributed AI will be a billion times larger, across edge devices, on-premises systems, and specialized hardware. It'll run locally, update occasionally, and move data constantly. And, unless we do it right, we'll still be arguing about chatbot bias while the real infrastructure remains unregulated.
The best time to regulate data infrastructure was ten years ago. The second best time is now before we build another decade of AI on top of regulatory sand.
That's not a policy I'd bet on.
Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*
NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!