Your 2026 Resolution: Add Context to Your Data (Before It Breaks You)

"Wrong Context" is the Number One Reason Models Go Bad, Are You Addressing It Soon Enough?

Your 2026 Resolution: Add Context to Your Data (Before It Breaks You)

Last week I sat in a review with a customer where two teams spent forty minutes arguing about "active users." Not about strategy. Not about growth. About what the number meant.

One team counted anyone who logged in. The other excluded users who bounced in under 30 seconds. Neither knew which experiment flags were active when the data was pulled. The dashboard just showed a number. No definition. No lineage. No context.

This happens constantly. And it's about to get significantly worse.

Gartner predicts that 60% of AI projects will be abandoned by 2026 because organizations lack "AI-ready data." Most of the time, it's because the data traveling through these systems carries no meaning beyond the raw values.

The models can't tell the difference between a deprecated pricing page and current policy. They can't distinguish a test account from a real customer. They retrieve answers confidently, cite sources correctly, and still get everything wrong.

If you want to get ahead of the tsunami of "well it's vibe coded, but it looks pretty good" and ACTUALLY make your models work, now is the time you stop treating context as optional documentation and start treating it as infrastructure.

The Context Engineering Pivot

Something shifted in 2025. The industry stopped talking about "prompt engineering" and started talking about "context engineering."

Andrej Karpathy called it "the delicate art and science of filling the context window with just the right information for each step." MIT Technology Review documented the transition from "vibe coding" to systematic context management. Google's December release of their Agent Development Kit was entirely focused on context architecture.

The terminology change matters. "Prompt" implies a single instruction you craft carefully. "Context" implies an entire information environment you engineer deliberately.

And it turns out most organizations have been engineering that environment with all the care of a teenager cleaning their room by shoving everything under the bed.

David Lanstein, CEO of Atolio, put it bluntly in IBM's 2026 predictions: "The solution isn't bigger models, but smarter data. True value will come from feeding models high-quality, permission-aware structured data to generate intelligent, relevant and trustworthy answers."

The race for bigger context windows missed the point. A 200K token context window filled with undifferentiated garbage produces undifferentiated garbage outputs, just with more confident citations.

What Context Actually Means

When I talk about context, I don't mean adding a few comments to your SQL. I mean four distinct layers that most data systems ignore entirely.

Semantic context is what a value actually represents. Not just "this column is called revenue" but "this is recognized revenue under ASC 606, calculated monthly, excluding deferred amounts, as defined by the finance team's Q3 2025 policy update." When the definition changes, the context changes with it. When someone queries the data six months from now, they see what it meant then, not what it means today.

Operational context is the health and provenance of the data at query time. Is this number fresh? Did the upstream pipeline fail overnight? Is there an active incident affecting the source system? A dashboard that shows revenue without showing "by the way, the payment processor had a three-hour outage last night" is lying by omission.

Experimental context is which flags and tests were active when the data was generated. Your MAU number is meaningless if you don't know that 40% of users were in an onboarding experiment that changed the activation flow. The number isn't wrong. It's just uninterpretable without the experiment metadata traveling alongside it.

Human context is ownership and decision history. Who defined this metric? What decisions have been made based on it? Where's the design doc? When someone has a question, they shouldn't have to play archeologist in Slack to figure out who to ask.

Most data systems capture maybe one of these. Usually the semantic layer, poorly, in a data catalog that nobody updates and fewer people read.

Why RAG Doesn't Fix This

The popular assumption has been that Retrieval-Augmented Generation solves the context problem. Point your model at your documents, let it retrieve what it needs, problem solved.

InfoWorld's analysis last week explains why this breaks at scale: "RAG breaks at scale because organizations treat it like a feature of LLMs rather than a platform discipline. Models generate confidently incorrect answers because the retrieval layer returns ambiguous or outdated knowledge."

The failure mode is insidious. RAG with good retrieval but no context governance produces what I've started calling "hallucination with citations." The model cites a real document, with an accurate citation, but the answer is wrong. It looks impeccably sourced, but it's still wrong.

CX Today reported on exactly this pattern: "If the knowledge base is outdated, RAG just retrieves the wrong answer faster. If content is unstructured, like PDFs, duplicate docs, or inconsistent schemas, the model struggles to pull reliable context."

The problem isn't retrieval. The problem is that the documents themselves carry no context about their validity, scope, or temporal bounds. A PDF is just a PDF. It doesn't know that it was superseded by a newer version, that it only applies to EMEA customers, or that the pricing section was invalidated by a board decision last quarter.

When VentureBeat declared "RAG is dead" in their 2026 predictions, they were being provocative. But the underlying point stands: RAG without context governance is dying. The organizations that will succeed with retrieval-augmented systems are the ones treating their knowledge bases as living, context-rich assets rather than static document dumps.

The Toll Booth Is Coming

There's a harder version of this problem emerging, and most organizations haven't noticed it yet.

Constellation Research warns that "enterprise data tolls and API economics are going to be a headache" in 2026. Celonis is suing SAP over data access. The Information reported that Salesforce is raising prices on apps that tap into its data. Connector fees are trickling down to IT budgets.

"Connection fees are going to be the new cloud egress," Constellation writes.

In other words, if you don't own the context layer for your own data, you'll rent it from someone else. Every vendor building "AI-ready" connectors is essentially building a context layer on top of your data and charging you for access to the meaning of information you already own.

The Solutions Review predictions roundup said: "By the end of 2026, connectivity, governance, and context provisioning for AI agents will be built into every serious data platform."

Built in. Core infrastructure.

The organizations that treat context as someone else's problem will find themselves paying tolls to access the semantic meaning of their own customer data. The ones that invest now will own that layer.

Resolution #1: Ship Context With Every Event

The practical version of this starts at ingestion.

Every event entering your system should carry enough metadata that a reader (human or machine) can interpret it without external lookups. Not "user_id, timestamp, action" but "user_id, timestamp, action, schema_version, experiment_flags, source_system, data_classification."

This isn't aspirational. Anthropic's context engineering guide describes exactly this pattern: maintaining lightweight identifiers that allow systems to "dynamically load data into context at runtime using tools."

A transformation editor should show, live, which downstream dashboards and models will break if you drop a column. A query should surface its lineage alongside its results. A dashboard shouldn't just display a number; hovering over it should reveal the definition, the upstream tables, the freshness, and the last incident that affected it.

This requires tooling changes, yes. But mostly it requires treating context as a first-class output of every pipeline stage rather than an afterthought someone might add later.

Resolution #2: Make Context the Default in AI and Agents

TechCrunch's 2026 analysis identifies the Model Context Protocol as "quickly becoming the standard" for agent interoperability. OpenAI and Microsoft have embraced it. Google is standing up managed MCP servers.

The infrastructure for context-aware agents is arriving. The question is whether your data is ready to participate.

That means storing valid_from/valid_to timestamps on policy documents. It means tagging content with scope limitations (region, customer tier, product line). It means encoding data classification and retention rules at the source, not in a compliance spreadsheet nobody maintains.

Stanford HAI's predictions note that "2026 will hear more companies say that AI hasn't yet shown productivity increases." The organizations that do show productivity increases will be the ones whose agents can distinguish current reality from historical noise without human intervention.

An agent that refuses to take high-impact actions without verifying the environment, cohort, and guardrails is not cautious. It's correctly engineered. An agent that charges ahead on stale data with high confidence is the expensive kind of wrong.

Resolution #3: Measure Time to Trustworthy Insight

I wrote about Nicole Forsgren's new book last month. Her frameworks for developer productivity apply directly to data work, but with a crucial modification.

For data teams, the north star isn't deployment frequency or cycle time. It's time to trustworthy insight: from raw logs or events to a result you would put in front of an executive with confidence.

Most organizations can't measure this because they don't know when insight becomes trustworthy. The data arrives, transformations run, dashboards update, but confidence accrues informally. Someone senior enough eventually blesses the number based on vibes and experience.

Context infrastructure makes this measurable. If every metric carries its lineage, freshness, and incident history, you can ask: how long did it take from data landing to a metric with full provenance, no upstream incidents, and a defined owner? That's the number that matters.

When that number shrinks, you're actually improving. When people are just shipping dashboards faster without context, you're accumulating debt.

The Year We Stop Arguing About Definitions

Most New Year's resolutions fail by February.

Data resolutions framed as one-time efforts rather than infrastructure changes will fail as well. If you say "we'll document our metrics" as a static job, even if you staff it (which, let's be honest, you probably won't) it will become a Q1 project that never gets maintained. Your "data quality dashboard" becomes one more thing that nobody checks.

Context isn't a project. It's a property of how data moves through your organization. It either travels with its story or it doesn't.

The organizations that treat 2026 as the year context becomes default will stop having the same arguments in every meeting. The exec review becomes a discussion of strategy instead of a debate about what "active users" means. The AI agent produces answers that come with their own credibility assessment. The data team ships products instead of debugging why downstream consumers don't trust the numbers.

Gartner says 60% of AI projects will fail for lack of AI-ready data. The projects that succeed will be the ones that stopped treating data as numbers and started treating it as knowledge.

That's the resolution. Make the data know what it is.


Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.

NOTE: I'm currently writing a book called "Zen and the Art of Data Maintenance" based on what I've seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!