From Pandas to Upstream Control: The Evolution PyData Needs Next
The Python data stack gave us Pandas, Dask, and Arrow. But we've created a new crisis: the ingest-it-all-first tax that's drowning us in noise.
Last Friday at PyData Seattle 2025, I gave a talk called "Taming the Data Tsunami." (I promise I'll link to it when it gets posted!) Room full of data engineers who've lived the same journey I have - Pandas on our laptops, then Dask clusters, then figuring out how to make everything talk to each other with Arrow.
The nods of recognition when I showed our successes? Expected. The uncomfortable silence when I showed what we've accidentally built? Also expected. We all know we're not done evolving, but people aren't sure what's next. Most of us haven't noticed yet.
The Journey From Here to There
If you've worked with data in the past 15 years, you know this story.
2008: Pandas changed everything. Wes McKinney, frustrated at AQR Capital Management with the tools available for data work, built something better. Suddenly your entire dataset fit in memory and you could manipulate it like a spreadsheet that didn't crash. No more batch jobs. No more Excel limitations. Just you, your data, and a Jupyter notebook. Exploratory analysis went from being a monologue to being a conversation.
2014: Dask and Spark gave us scale. As data outgrew our laptops, single-machine ceilings became real bottlenecks. These frameworks solved the problem by partitioning your data, parallelizing computation, and letting you process terabytes without waiting days for results. The Pandas API we'd grown to love now ran on clusters.
2016: Arrow gave us a common language. The serialization tax between tools had become brutal—moving data from Pandas to Spark to your ML library meant paying conversion costs each time. Arrow provided a zero-copy, columnar memory format that every tool in the ecosystem could speak natively, and suddenly those language barriers disappeared.
Each step solved a real bottleneck. Each unlocked new possibilities. The Python ecosystem became the standard because we kept solving the next problem.
What We Built Instead
Unfortunately, we've been lied to. Not because anyone was ACTIVELY trying to deceive us (I don't know, maybe they were, but I'll give them the benefit of the doubt). The entire industry has been convinced that "ingest-it-all-first" is the correct default pattern, and when you actually look at the numbers, they're absolutely brutal.
Think about the economics for a second: you're paying 100% of the cost to move, store, and process your data when only a tiny fraction of it actually matters for your use case. And here's the kicker—many organizations don't even have the metrics in place to measure whether the data they're ingesting is valuable at all. They're just moving it because that's what everyone does.
Your carefully orchestrated Airflow DAGs, your meticulously crafted dbt models, your perfectly tuned ML pipelines—they're all being forced to waste computational cycles filtering through mountains of noise to find the actual signal. Consider that 60% of enterprise data is unstructured and generated outside traditional data centers, and 73% of organizations report they can't keep up with their processing demand. This isn't a problem you can optimize your way out of. This is structural.
Three Physics Problems We Keep Ignoring
This goes way beyond just looking at your cloud bills (though those are painful enough). There are three fundamental physical constraints that our "centralize everything" approach simply pretends don't exist, and they're starting to catch up with us:
Data Is Everywhere
Your data is being generated across multiple clouds, on-premise data centers, edge devices, and IoT sensors scattered across the globe. The idea that a central warehouse can be the only answer to this distribution problem is increasingly untenable.
When you actually try to move ALL of it to a central location, the logistics break down in predictable ways: egress fees compound every time data crosses network boundaries, network latency kills any hope of responsive queries, and managing dozens of parallel ETL pipelines becomes an incredibly brittle operation. The math simply doesn't work when you calculate the actual costs of centralization at scale.
Speed of Light Is Constant
Here's a fun fact that Einstein already figured out: you can't make decisions faster than the data can physically travel. If you're waiting for data to arrive at your central warehouse before you can act on it, you've already lost the opportunity to respond in real-time.
Think about IoT sensor data or application logs—these are massive, noisy streams being generated at the edge. Under the current model, you pay to transport the entire firehose to your central infrastructure, then filter it down to the 5% that actually matters, and THEN you can take action. By that point, whatever real-time opportunity existed is long gone. You're paying to transport junk, over-provisioning your central clusters to handle the flood, and calling it "real-time" when it's anything but.
Regulations Have Teeth
GDPR and CCPA don't give a damn about your elegant architecture. If you're moving EU customer data to a US-based cluster for "cleaning" and processing, you're in violation the moment that data crosses the boundary—before you've even done anything useful with it.
We've created what I call "toxic data lakes": raw storage where personally identifiable information lands completely un-redacted. These become high-value attack targets and insider threat nightmares, with compliance measures patched in as an afterthought instead of built into the architecture from day one. And here's the thing—"almost compliant" is the same as "completely exposed" when the regulators come knocking.
Don't Rip-and-Replace, Shift left
Let me be clear: this isn't about replacing Pandas, Dask, or Arrow. These are revolutionary tools that fundamentally changed how we work with data, and they're still absolutely essential. They gave us the primitives we needed to build modern data systems. But we need to be thoughtful about the ingredients we're putting in our cake before we bake it.
Think of it this way: Pandas gave us the verbs for data manipulation. Dask gave us scale. Arrow gave us a common grammar for communication between tools. What's missing is the sentence structure—a framework for deciding what data actually deserves to be part of the conversation in the first place.
The Python ecosystem gave us powerful tools to analyze data anywhere. Now we need to get smart about WHICH data we're analyzing and WHERE it's coming from before we commit to moving it.
Three Implementation Patterns
At PyData, I walked through three concrete patterns for shifting control upstream—filtering, transformation, and governance applied at the source, before data ever hits your expensive central infrastructure. These aren't theoretical constructs; they're battle-tested approaches that organizations are using right now to cut costs and improve performance.
Playbook #1: Distributed Warehousing
The Problem: When your data is scattered across multiple regions and cloud providers, the traditional approach of moving everything to a central location for queries becomes painfully slow and prohibitively expensive.
The Pattern: Instead of centralizing everything, store data locally in open formats and query it federatively. You run the computation where the data already lives, then return only the aggregated results across the network.
The Stack:
- Iceberg, Delta Lake, or Hudi for transactional tables on object storage
- Trino and DuckDB for executing queries locally and aggregating results
- Arrow for zero-copy in-memory transport without serialization overhead
The key insight here is that your compute runs "near" where the data lives, allowing you to query massive datasets in place and deliver only the final result—which is orders of magnitude smaller—across the network.
Real-world example: One company I worked with was ingesting IoT sensor data from manufacturing facilities worldwide. By writing to local Iceberg tables and processing in-place, they only shipped summary statistics back to headquarters. The result? Massively reduced storage costs, streamlined ingestion pipelines, and queries that actually complete in reasonable time.
Playbook #2: Streamlined Pipelines
The Problem: Logs, metrics, and IoT data streams generate absolutely massive amounts of noise. Under traditional architectures, you're paying to transport and store all of it before you even start filtering for what matters.
The Pattern: Filter and aggregate at the source. Ship the answer, not the firehose.
The Stack:
- Vector or Benthos for edge collection and transformation
- Embedded query engines like DuckDB or SQLite running locally to execute complex SQL before transmission
- Stream processors for real-time transformation and aggregation
The workflow here is straightforward: process data locally at the generation point, and only send small, high-signal results over the network. Your central clusters handle final aggregation, not the raw stream.
When you manage this upstream, you get the opportunity to be smart about what you're moving. Use techniques like windowing, aggregation, and filtering BEFORE transport. Deploy lightweight agents on source machines, leave your existing ETL pipeline completely unchanged, and suddenly everything is faster and cheaper.
The numbers are genuinely impressive: $2.5M → $18K annual cost. That's a 98%+ reduction. These aren't hypothetical numbers—they're from real benchmarks.
Playbook #3: Upstream Governance
The Problem: Sensitive personally identifiable information is crossing network and regional boundaries before redaction happens. "Almost compliant" equals "exposure windows that regulations absolutely will not forgive."
The Pattern: Apply governance policies at the source and sanitize data before it ever leaves its region of origin.
The Stack:
- Open Policy Agent for declarative policy-as-code that's auditable and version-controlled
- Vector and Benthos with built-in processors for obfuscation, hashing, and PII filtering
- Control plane (like Expanso) to deploy policies and audit compliance globally
The approach is to define your transformations as declarative configuration and deploy them to edge agents. This ensures that only compliant, sanitized data enters your pipeline in the first place.
The traditional "ship-everything-and-get-to-compliance-later" pattern creates a cascade of problems: overwhelming data deluge, privacy pitfalls, compliance chaos, audit nightmares, resource drain, performance lag, regulatory roulette, and delayed insights. Every single one of these is avoidable if you just apply governance upstream.
The Mindset Shift
Here's the thing: the technology isn't actually that hard. The tools exist, they're mature, and they've been battle-tested in production. The hard part is changing how we think about data architecture.
The Python ecosystem has trained us for over a decade to follow a specific mental model: "Get the data somewhere I can work with it, THEN analyze it." Pandas reinforced this pattern. Dask scaled it up when single machines weren't enough. Arrow optimized the data movement. But fundamentally, we've been papering over an increasingly expensive problem.
The new mental model is simpler but requires us to think earlier in the pipeline: "Decide what's worth analyzing BEFORE paying to move and store it."
Think about the difference between a funnel and a filter. Funnels collect everything indiscriminately and narrow down later. Filters make intelligent decisions at the source about what deserves to pass through. We've been building funnels. We need to start building filters.
Where It Goes
When I finished the talk, the questions weren't about whether this approach works; the numbers are too compelling to ignore. People wanted to know how to retrofit existing systems, how to convince their teams to change, how to prove the value before committing to the shift.
That's the right concern. This isn't a rip-and-replace migration. It's an additional layer that makes your existing infrastructure more efficient. Your Airflow DAGs still run. Your dbt models still compile. Your ML training pipelines still execute. They just operate on 50-70% less data, process faster, and cost a fraction of what they used to.
The Python data stack democratized analysis. Pandas, Dask, and Arrow made data work accessible and powerful. But like all successful systems, they optimized for the problems we had - and in doing so, created new ones. That's not a failure. That's how technology evolves.
The next evolution isn't about replacing what works. It's about adding the control layer we need to make our success sustainable. Because the alternative - continuing to ingest-it-all-first as data volumes double every two years - isn't actually an alternative at all.
Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.
NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!