The Model Is the Byproduct
Last Friday, Andrej Karpathy open-sourced a 630-line Python script and went to bed. By morning, an AI agent running on a single GPU had completed roughly 100 complete LLM training runs, each lasting exactly five minutes, autonomously modifying the neural network architecture, the optimizer, the hyperparameters, evaluating the results, keeping improvements, discarding failures, and moving on to the next experiment. No foundation model. No API calls to a frontier lab. Just data, a training loop, and an agent that doesn't sleep.
Within 48 hours, the post had 8.6 million views. The repo hit 8,000 GitHub stars. Shopify CEO Tobi Lutke cloned it before bed on Saturday, pointed it at his own data, and woke up to a smaller model that outperformed a larger one he'd configured manually. A 19% improvement in validation scores. From a model trained from scratch. On his data. Overnight.
Most of the commentary has focused on the "AI doing research while you sleep" angle, and that's reasonable. It's a compelling image. But I think the coverage is systematically missing what actually matters about autoresearch, because the interesting thing isn't the automation of the research loop. The interesting thing is what the research loop produces, and what it produces from.
What Autoresearch Actually Does
The autoresearch repo is stripped down from Karpathy's earlier nanochat framework into the most minimal possible training setup: one file of training code (train.py), one file of data preparation (prepare.py), and one Markdown file (program.md) that tells the AI agent how to approach experimentation. The human writes the Markdown. The agent iterates on the Python. Every experiment runs for exactly five minutes, which means the results are optimized for your specific hardware (an H100 will find different optima than a Mac Mini M4), and that's a feature, not a limitation.
The evaluation metric is validation bits per byte, which measures how well the model predicts text and is independent of vocabulary size. That independence matters because it means the agent can freely change the tokenizer, the embedding dimensions, the entire architecture without breaking the comparison. Everything is a fair test against everything else.
When Karpathy left it running for two days on a depth-12 model, the agent autonomously discovered roughly 20 additive improvements that transferred to larger models, cutting the benchmark time-to-GPT-2 from 2.02 hours to 1.80 hours. An 11% improvement found by a machine that tried approximately 700 different modifications while a human was doing other things. And the specific optimizations (attention scaling, regularization tuning, initialization corrections) were good enough that Karpathy merged them back into his production nanochat codebase.
That's not a toy demo. That's a research tool producing real, transferable results.
The Part Everyone Is Missing
The standard narrative about AI progress goes like this: labs build increasingly massive foundation models, companies access them through APIs, and the competitive moat belongs to whoever has the most GPUs and the most training data scraped from the internet. OpenAI builds five new Stargate sites. Hyperscalers commit $700 billion to AI infrastructure. The message is clear: scale is destiny.
Autoresearch inverts every assumption in that narrative.
It doesn't fine-tune a foundation model. It trains from scratch. The data you feed it isn't a massive internet scrape; it's whatever dataset matters for your specific problem. The compute isn't a hyperscale cluster; it's a single GPU that might be sitting under your desk. And the "researcher" iterating on the model isn't a team of PhD engineers at a frontier lab; it's an AI agent reading a Markdown file you wrote.
The output is a model that's purpose-built for your data, on your hardware, optimized to your constraints. Lutke's result at Shopify demonstrated this concretely: the agent-optimized smaller model, trained on Shopify's query-expansion data, beat a larger model that had been configured by humans. Not because smaller is inherently better, but because automated iteration on relevant data finds optima that manual tuning misses.
Karpathy himself framed it with characteristic directness: autoresearch is "just a recipe/idea - give it to your agent and apply to what you care about." That phrasing is worth paying attention to. He's not positioning this as a product. He's positioning it as a pattern. A way of thinking about the relationship between data, compute, and optimization that doesn't start with "first, get access to GPT-5."
Hardware Diversity as a Feature
One of the more unexpected results came from Hyperspace AI, which distributed the autoresearch loop across a peer-to-peer network. On the night of March 8th, 35 autonomous agents running on different hardware (H100s, consumer GPUs, CPU-only laptops) completed 333 experiments without any human supervision. The results were fascinating, not because the powerful machines won, but because the diversity of hardware produced diversity of approach.
The H100 agents used brute force, testing aggressive learning rates and large batch sizes because they had the throughput to burn. But the agents running on laptops, constrained by limited compute, were forced to be more creative. They focused on initialization strategies, normalization choices, architectural simplifications. One user running a Mac Mini M4 reported that 26 of 35 experiments failed or crashed, but the seven that succeeded revealed that the model improved by getting simpler.
In 17 hours, these distributed agents independently rediscovered ML techniques (RMSNorm, tied embeddings, specific initialization patterns) that took human researchers at labs like Google Brain years to formalize. Different hardware, different constraints, different search strategies, converging on known-good solutions through pure automated iteration.
This is what it looks like when the bottleneck shifts from "who has the most compute" to "who has the best experimental loop and the most relevant data."
From Models to Everything Else
Autoresearch currently optimizes neural network training. The agent modifies architecture and hyperparameters, trains for five minutes, and measures validation loss. But if you abstract the pattern, what you're looking at is more general: define an objective, let an agent iterate on the system, measure against a metric, keep or discard, repeat. That pattern doesn't require neural networks. It requires a measurable outcome and a system that can be modified.
Karpathy's own next-step vision points in this direction. He described making autoresearch "asynchronously massively collaborative," comparing it to SETI@home: distributed agents exploring different research directions simultaneously, contributing results back to a shared knowledge base. A fork already exists that implements exactly this, with agents registering experiments, publishing results, and syncing through a coordination layer. And his AgentHub project (2,000+ stars in its first 24 hours) replaces GitHub's human-centric collaboration model with one designed for agent swarms: no branches, no pull requests, just a growing graph of commits and a message board for agents to coordinate.
That same week, Eric Siu from Single Grain applied the autoresearch pattern to marketing. Instead of training code, the modifiable artifact is a landing page. Instead of validation loss, the metric is positive reply rate. Instead of 12 experiments per hour on model architecture, it's 12 experiments per hour on ad creative. His estimate: marketing teams currently run 30 to 50 experiments per year; this pattern enables 36,500.
Replace "marketing" with supply chain optimization, with clinical trial design, with manufacturing process tuning, with logistics routing. The loop is the same. Objective, iteration, measurement, selection. The specific domain is just the data you feed it and the metric you choose.
Where the Value Actually Lives
Gartner predicts that by 2027, organizations will use task-specific small models three times more often than general-purpose LLMs. By 2028, 30% of generative AI workloads are projected to run on-premises or on-device. Those numbers describe a world where the foundation-model-as-a-service paradigm coexists with (and is increasingly supplemented by) locally trained, purpose-built models optimized for specific data and specific tasks.
Autoresearch makes that world tangible. It demonstrates, concretely and measurably, that an automated loop running on your data produces models that compete with larger, more expensive alternatives. Not because big models are bad (they're remarkable), but because a model built specifically for your problem, from your data, on your hardware, through hundreds of automated experiments, will fit your actual needs in ways that a general-purpose API call cannot.
The competitive advantage in this world doesn't belong to whoever has the biggest model or the most GPUs. It belongs to whoever has the most relevant data and the discipline to build automated experimental loops around it. That's a fundamentally different game than the one the industry has been playing, and it has fundamentally different winners.
I want to be careful about overclaiming here. Autoresearch is 630 lines of code training small language models. It is not replacing Llama or GPT-5 or frontier research at scale. Foundation models are extraordinary tools, and the idea that they're suddenly irrelevant because Karpathy wrote a training loop would be silly.
But what autoresearch demonstrates is a direction. The trajectory it implies: that model creation becomes automated, that it happens locally, that it operates on your data rather than everyone's data, that the human's job shifts from building models to defining objectives and curating data. That trajectory is real, it's accelerating, and it's going to reshape how organizations think about AI infrastructure in ways that the "just call the API" narrative hasn't accounted for.
Karpathy opened the autoresearch README with a characteristically playful sci-fi scenario about autonomous agent swarms running across "compute cluster megastructures in the skies." Then he added: "This repo is the story of how it all began."
He might be joking. He also might not be.
Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*
NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!