The Loop Is Only as Good as the Metric

David Aronchick

16 Mar 2026 • 6 min read

On Thursday I wrote about Karpathy's autoresearch, the 630-line training loop that runs 100 ML experiments overnight on a single GPU while you sleep. The post generated a lot of conversation, and most of it centered on the automation: agents doing research, models training themselves, the future of AI development as a lights-out factory.

But there's a thing in autoresearch that deserves more attention than the automation, something that explains why this particular loop produced real results while so many other "autonomous AI" projects produce noise. And it has nothing to do with the agent, the GPU, or the training code.

It's the metric.

Why Autoresearch Actually Works

Autoresearch uses a single evaluation criterion: validation bits per byte (val_bpb). Lower is better. The metric is independent of vocabulary size, which means the agent can change the tokenizer, the embedding dimensions, the entire model architecture, and the comparison remains valid. Every five-minute experiment produces a number. That number is either lower than the previous best (keep the change) or it isn't (discard it). There is no ambiguity.

This is not an accident; This is why it works.

If you gave the same agent the same compute and the same training code but replaced val_bpb with a vague instruction like "make the model better," the loop would produce nothing useful. The agent would have no way to distinguish a good experiment from a bad one. It would iterate endlessly without converging, accumulating changes that might help, might hurt, and would be impossible to rank. The automation would be impressive and the output would be worthless.

Hamel Husain, who has spent the last several years teaching over 3,000 engineers at companies like OpenAI, Anthropic, and Google how to build evaluation systems for AI products, has a way of putting this that I think is exactly right. He frames it as: success with AI hinges on how fast you can iterate, and your iteration speed is gated by the quality of your evaluation. If you can't measure whether a change made things better or worse, you can't iterate at all, no matter how fast your compute runs.

Autoresearch demonstrates this principle in the purest possible form. The agent runs 12 experiments per hour. The iteration speed is extraordinary. But that speed is only valuable because val_bpb gives unambiguous, immediate, correct feedback on every single experiment. Remove the metric and the speed is meaningless.

The Gap Between Models and Everything Else

In Friday's post, I argued that the autoresearch pattern generalizes beyond model training. Define an objective, let an agent iterate on the system, measure against a metric, keep or discard, repeat. That same loop could apply to marketing experiments, supply chain optimization, manufacturing process tuning, drug discovery pipelines.

That argument is correct, but it glosses over the hardest part: for model training, we have clean metrics. For nearly everything else, we don't.

Val_bpb works because language modeling has a well-defined objective function. You can compute it automatically, it correlates with the thing you actually care about (model quality), and it doesn't require a human in the loop. That combination is rare. Most real-world optimization problems don't have a single number that tells you whether things got better.

Consider what happens when you try to apply the autoresearch pattern to a customer support chatbot. What's the metric? Response time? Customer satisfaction scores? Resolution rate? Escalation frequency? Each of these captures something real, but none of them captures everything, and optimizing aggressively for any single one will produce pathological behavior. A chatbot that minimizes response time will give shorter, less helpful answers. One that minimizes escalation will refuse to hand off to humans even when it should. One that maximizes satisfaction scores will learn to be agreeable rather than accurate.

This is the problem that Husain and his co-instructor Shreya Shankar have been methodically working through with their AI Evals course and in Husain's writing. Their central insight, the one I think the autoresearch enthusiasm systematically underweights, is that most teams fail at AI not because they can't build systems but because they can't evaluate them. The blog post, the course, the consulting work, all circle the same thesis: evaluation is the hard part, evaluation is the bottleneck, and if you skip it, nothing downstream works no matter how sophisticated your automation gets.

What Good Evaluation Actually Looks Like

Husain's framework is worth understanding in detail because it maps directly onto the challenge of scaling automated loops beyond model training.

The process starts with error analysis: manually reviewing real system outputs (he calls them "traces"), taking open-ended notes about what's going wrong, and categorizing failures. This is qualitative research methodology applied to AI systems, adapted from social science methods that have been refined over decades. You review at least 100 traces. You identify patterns. You keep reviewing until you stop finding new failure modes.

Only after you understand what's actually failing do you start building automated evaluations. And those evaluations aren't generic metrics like BERTScore or ROUGE or cosine similarity. They're binary pass/fail checks designed around the specific failure modes you identified through error analysis. Did the chatbot incorrectly schedule a tour? Pass or fail. Did the system hand off to a human when it should have? Pass or fail. Did the response contain fabricated information about a specific listing? Pass or fail.

The insight is that generic metrics don't measure the most important problems with your specific AI product. Every product fails in its own specific ways, and evaluation systems need to be built around those specific failure modes, not around abstract notions of "quality" or "helpfulness." Husain has described evals as effectively becoming "living product requirements documents" that continuously test your AI against the things that actually matter for your users.

This is painstaking work. It requires domain expertise. It requires humans looking at data. It cannot (despite what some platforms suggest) be fully automated by another AI. As Husain put it on Lenny's Podcast: the most common misconception about evals is "Can't the AI just eval it?" People want that so much that companies sell it, but it doesn't work.

The Uncomfortable Implication

So here's where I think the real opportunity lives.

The reason autoresearch produces stunning results on model training is that model training has clean, automated, single-number evaluation built into the problem definition. The agent doesn't need to understand what "good" means. Val_bpb tells it, instantly, for free, after every experiment.

The reason the same pattern will struggle to generalize, at least initially, is that most real-world problems don't have that. Customer support doesn't have val_bpb. Marketing doesn't have val_bpb. Healthcare, logistics, legal, finance, manufacturing: none of these domains have a single automated metric that reliably captures "did this get better."

Which means the actual bottleneck to scaling automated optimization loops across domains isn't compute. It isn't agents. It isn't training infrastructure. It's evaluation infrastructure. The organizations that will benefit most from autoresearch-style automation are the ones that have already done the hard work of building robust, domain-specific evaluation systems that can provide unambiguous feedback to an automated loop.

If you have a well-instrumented customer support system with binary pass/fail evaluations for your twelve most common failure modes, validated against human judgment, you can wire those evaluations into an automated iteration loop and let it run overnight. If you don't have that, you can run the loop all you want, but you'll be optimizing for something that doesn't correlate with what your users actually need.

The infrastructure investment that matters most right now is not more GPUs. It's not bigger models. It's not even the agentic frameworks everyone is building. It's evaluation systems: domain-specific, validated against human judgment, designed around real failure modes rather than generic quality scores, and structured to provide the kind of clean feedback signal that makes automated iteration actually converge on improvement rather than just churn.

What Karpathy and Husain Have in Common

There's a deeper connection between Karpathy's autoresearch and Husain's eval work that I think most people will miss because they operate in different communities and speak different vocabularies.

Karpathy stripped model training down to the absolute minimum viable system: one file, one GPU, one metric. The constraint is the feature. By making everything else as simple as possible, the feedback signal becomes maximally clear and the agent can iterate with maximum efficiency.

Husain's approach to evals follows the same principle. Start simple. Use spreadsheets, not platforms. Build binary pass/fail checks, not Likert scales. Review traces manually before automating anything. Strip away everything that adds noise to the signal, because the signal is the whole point.

Both of them are, in different ways, arguing the same thing: the value of an optimization loop is determined entirely by the quality of its feedback signal. Make the signal clean and simple and the loop produces remarkable results. Let the signal get noisy or indirect and the loop produces waste.

The organizations that understand this will build their evaluation infrastructure first and their automation second. Everyone else will build automation that looks impressive and produces nothing.

Karpathy gave us the loop. Husain has been giving us the metric. The companies that combine both are going to be the ones worth watching.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!

Why Autoresearch Actually Works

The Gap Between Models and Everything Else

What Good Evaluation Actually Looks Like

The Uncomfortable Implication

What Karpathy and Husain Have in Common

Sign up for more like this.