Will I Make It To The Restaurant Before The Soup Dumplings Get Cold? (And Other Problems In Machine Learning)
 
            I'm chronically late. Not because I want to be rude - I feel terrible about it every single time - but because I'm catastrophically bad at predicting how long it takes to get anywhere.
Turns out machine learning algorithms have the exact same problem.
Here's how it happens: Dinner is at 7pm. I know where the restaurant is. I have a perfectly clear route in my head: office door → hallway → elevator → street → subway → walk → restaurant. Very well defined. Call it 14 minutes, door to door.
The problem is, I get distracted. I'm deep in some problem, reading an article, debugging something. Suddenly it's 6pm. Then 6:30. Then 6:45. And I think: "Well, 14 minutes, so as long as I leave by 6:46, I'm fine."
Except the route is nondeterministic.
Are the cleaners in the office, so I have to take a longer way? Is the elevator busy? Is the subway running slow? Did I just miss it by 30 seconds? Is it raining and the sidewalks are full of people with big umbrellas?
What I confidently think of as a "14 minute journey" might actually take 25 minutes. I leave with enough time to make it in the ideal case - because that's what I'm planning for in my head - and congratulations, now I've kept someone waiting (and the xiaolongbao are congealing) for 10-15 minutes because of things "out of my control."
Or, realistically, things that were entirely in my control, since I could have just left earlier. Sorry, everyone.
At 3am, it might be an incredibly fast journey. During the Puerto Rican Day Parade, much slower. But I don't know these things unless I understand exactly what the flow looks like at that specific moment.
The Route You Don't Even Know
The thing is, machine learning training barely understands what the route even is.
You give a training system an objective function - "is this a dog or a cat?" - and at the end, you tell it whether it got there or not. That's it.
No GPS along the way. No measurements. No iPhone telling you "turn left in 500 feet." Nothing during the journey. Just the final result: did you make it, and if not, by how much did you miss?
It's like my restaurant problem, times a billion. The algorithm has no idea if the actual best route might be to go through Queens and come all the way back down. It's just trying a billion different routes and remembering which ones work.
After billions of training examples (Chinchilla showed us how many billions we actually need), the model gets remarkably good at finding routes that work.
The Bigger Problem: Even When You Know The Route
Unlike me - who no matter how many times I go to the restaurant, I'm never going to try the route a billion times - a machine learning algorithm can try many different things in parallel. It can try them far more times than a human ever could. So the model generally has a good sense of the route. It's figured it out.
But here's the real nondeterminism problem: Even when you take the exact same route - same hallway, same elevator, same subway, same streets, same sequence of turns - you get different arrival times. Every. Single. Time.
Not by much. Maybe a minute or two. But it's never identical.
Same route. Different time. Every time.
You can imagine how frustrating this is. If there's one thing machines like, it's EXACTLY repeatable answers to questions. (Humans do too, but we're more tolerant of small changes). And if you're doing things a billion times, you REALLY need the exact same answers to the exact same questions.
I know everyone is constantly talking about the latest thing and how this is the revolution, but Thinking Machines Lab just announced something a month ago that I genuinely think is a huge pivot point for our industry. They published "Defeating Nondeterminism in LLM Inference," and they didn't just explain the problem. They figured out how to make the same route take the same time, every single time.
It doesn't have a business model, but I have to believe every inference engine will be adopting it shortly.
What Everyone Thought Was Happening (But Was Wrong)
For years, the conventional wisdom blamed the problem on something akin to phantom traffic jams.
The accepted explanation, which you'll find repeated everywhere, was this:
"Floating-point arithmetic in GPUs is non-associative, meaning $(a+b)+c \neq a+(b+c)$ due to finite precision and rounding errors. Because GPUs run operations in parallel across many threads, the execution order is unpredictable. This random order leads to different rounding patterns each time, creating nondeterminism."
This explanation suggests that every time you take your trip, invisible, unpredictable traffic slows down random streets. The math itself has a "randomness" baked in from the thread scheduling. It seems plausible. Case closed. Nothing we can do about it.
Except the Thinking Machines team—Horace He, et al.—noticed something that didn't fit. They ran this simple experiment:
Python
A = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
B = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
ref = torch.mm(A, B)
for _ in range(1000):
    assert (torch.mm(A, B) - ref).abs().max().item() == 0
Matrix multiplication on a GPU. Same matrices. 1,000 times in a row. The results were bitwise identical every single time.
There were no phantom traffic jams. The route, when taken as a single, uninterrupted journey, was perfectly deterministic. If the "random thread scheduling + floating point math" story were the whole truth, this code should have failed. But it didn't. The individual GPU operations are perfectly repeatable.
So, if it's not random, what's actually happening?
The Real Cause
The problem isn't random. The problem is that the way we measure the route changes based on unrelated factors, leading to microscopic, but critical, differences in the final calculation.
Let's go back to my dinner trip. Imagine I time my 14-minute journey with a hyper-accurate digital stopwatch that calculates down to the nanosecond, even if it only displays minutes and seconds.
- Scenario 1: One Request. I time the whole trip as one segment. I press 'start' at my desk and 'stop' at the restaurant. The stopwatch calculates a single, precise duration: 14.1123173271minutes.
- Scenario 2: Three Requests Batched Together. I decide to time three segments separately: (1) office to subway, (2) the subway ride, and (3) the walk to the restaurant. I press the 'lap' button at each stage.
Here’s the crucial part: because of the non-associativity of floating-point math—the fact that $(a+b)+c$ can be a microscopically different number than $a+(b+c)$—the way the stopwatch's internal chip adds up the lap times gives a different result. It might calculate (lap1 + lap2) + lap3 and arrive at a final duration of 14.1123173274 minutes.
The difference is a rounding error a dozen decimal places out. It's completely imperceptible to me. But it is a different number.
This is exactly what happens in an inference server like vLLM. It processes requests in batches to maximize GPU utilization.
- Processing 1 sequence? The GPU performs one set of grouped calculations.
- Processing 10 sequences? It groups the calculations differently to be more efficient.
Each grouping changes the order of operations—not randomly, but systematically and deterministically based on the batch size. A different order leads to different floating-point rounding patterns. This creates a slightly different numerical result, which can cause the model to pick a different "most probable" token, leading to a completely different output that cascades from that point on.
The problem wasn't a phantom traffic jam. It was that our stopwatch was giving us a different measurement depending on how many lap times we asked it to record.
Why Addition Order Actually Matters
Floating-point arithmetic isn't associative. This isn't a bug - it's mathematics. The paper gives a perfect example:
(0.1 + 1e20) - 1e20 = 0
0.1 + (1e20 - 1e20) = 0.1
The order you do additions in fundamentally matters. Computers aren't infinitely large, so you represent real (infinite precision) numbers with finite precision. Rounding happens at each operation, and different operation orders produce different rounding patterns.
Kahan summation exists precisely because naive summation loses accuracy. The whole field of numerical analysis exists because these details matter.
If I add up my trip segments like this: A + B + C, I might get one total. But if I group them differently, like (A + B) + C, the rounding in my mental math could produce a slightly different result. In machine learning, this is called "reduction strategy."
The paper introduces a potential solution:
"The requirement for batch invariance is that the reduction order for each element must be fixed regardless of the batch-size of the kernel. Note that this doesn't mean we must always use the same reduction strategy. For example, if we change the number of elements we're reducing over, we can still be batch-invariant even if our reduction strategy changes."
The Three Operations That Need Fixing
The solution sounds simple: take the same route every time, regardless of how many people you're grouping together. In practice, this requires rethinking how three fundamental GPU operations work.
Operation One: RMSNorm (The Simplest Case)
"Standard implementations parallelize by splitting the reduction across multiple workers. If you're normalizing a vector with 10,000 elements and have 10 workers, each handles 1,000 elements. But with 100 workers? Each handles 100 elements. Different split points = different reduction orders."
What is RMSNorm? Root Mean Square Normalization is used in models like LLaMA. You compute a scaling factor based on all values in a vector. Computing that factor requires reducing (combining) values across potentially thousands of dimensions.
The dinner analogy: Imagine calculating your total travel time by asking different friends to time different segments. One person times "office to subway," another times "subway to walk," another times "walk to restaurant." The order you add up their reports affects your final rounded total.
Small batch? Few workers timing segments. Large batch? Many workers. Different worker counts = different addition orders = different results.
The fix: Always split at the same boundaries. Always combine sub-totals in the same order. Some GPU cores might sit idle with small batches, but you get consistent computation.
Operation Two: Matrix Multiplication (Different Route Segments)
"Modern GPU kernels tile operations - breaking matrices into blocks and computing block-by-block. Tile size typically depends on available parallelism. More sequences in your batch? Larger tiles. Fewer sequences? Smaller tiles. Different tiles = different accumulation patterns = different floating-point operation orders."
GPUs break matrices into smaller rectangular "tiles," compute products on those tiles, then combine results. This is how CUTLASS and Triton work.
The dinner analogy: Instead of timing your trip as one journey, you break it into segments. Maybe you measure every 5 blocks, then combine those. If the segment size changes based on how many people you're traveling with (5 blocks when alone, 3 blocks with two people), you get different rounding patterns.
The solution: Fix tile sizes regardless of batch configuration. You trade some GPU efficiency for deterministic ordering.
Operation Three: Attention (The Actually Hard Problem)
This is where it gets genuinely difficult:
"Consider this scenario: You have 80 tokens in your KV cache and you're processing 48 new tokens. With a block size of 32, standard implementations need three blocks (two full, one masked) for cached values and two blocks (one full, one masked) for new values - five total blocks for 128 total elements. But if you had 0 cached tokens and were processing all 128 at once? Four blocks total. Same number of elements, different reduction organization."
Attention is the mechanism that lets transformers weigh how much each token should "pay attention" to other tokens. Modern implementations like FlashAttention and PagedAttention optimize this heavily, but they organize computation differently based on cache state and batch size.
The dinner analogy: Imagine calculating your arrival time not just from your own travel, but by checking with everyone already at the restaurant ("how long ago did you arrive?") and everyone currently traveling ("how far along are you?"). The order you process and combine these reports - and whether you batch "people already there" separately from "people traveling now" - affects your computed time due to rounding.
The fix requires two things:
- Update the KV cache before attention, ensuring consistent layout regardless of how many tokens you're processing.
- Move from "fixed number of splits" to "fixed split size" strategies:
"Instead of fixing the # of splits, we fix the size of each split and then end up with a varying number of splits. In this manner, we can guarantee that regardless of how many tokens we're processing, we always perform the identical reduction order."
Instead of saying "divide my route into 4 equal segments no matter how long," you say "every segment is exactly 5 blocks." Whether your trip is 20 blocks (exactly 4 segments) or 23 blocks (4.6 segments), each individual segment uses the same measurement pattern.
This required contributing changes to PyTorch's FlexAttention. That's how deep into the stack this goes.
The Soup Dumplings Experiment
Here's where the rubber meets the road:
"We use Qwen/Qwen3-235B-A22B-Instruct-2507 and sample 1000 completions at temperature 0 with the prompt 'Tell me about Richard Feynman' (non-thinking mode), generating 1000 tokens each."
Temperature 0 should be the easy mode: always pick the single most likely next token. No creativity, no randomness. It should be perfectly deterministic. Same route, same time, always.
With standard vLLM:
- 80 different outputs from 1000 runs
- Most common output appeared 78 times (less than 8% of runs!)
- First divergence at token 103
- 992 completions said one thing
- 8 completions said something else
Same prompt. Same temperature. Same model. Different results.
But when they switched on batch-invariant kernels:
"... all of our 1000 completions are identical. This is what we would mathematically expect from our sampler, but we aren't able to achieve deterministic results without our batch-invariant kernels."
Same route. Same time. Every. Single. Time.
One thousand runs. One output. Bitwise identical.
What I predicted in my head was the exact time I got there every single time.
The Performance Trade-Off
You don't get this for free. Their initial implementation runs about 2x slower (55 seconds vs 26 seconds). With optimization, 1.6x slower (42 seconds).
The paper is honest:
"Much of the slowdown comes from the fact that the FlexAttention integration in vLLM has not been heavily optimized yet. Nevertheless, we see that performance is not disastrous."
Is 1.6x slower acceptable? Depends.
For production serving billions of queries where milliseconds matter? Maybe not yet.
For research where reproducibility is paramount? Absolutely.
For model development and testing where you need exact repeatability? Without question.
For reinforcement learning from human feedback where policy drift can break training? This might be necessary, not optional.
True On-Policy RL: The Big Unlock
Here's where this goes from "nice infrastructure improvement" to "fundamentally changing what's possible." The paper drops this point almost casually, but its impact is profound:
"As researchers have noted, the different numerics between training and inference implicitly turns our on-policy RL into off-policy RL. ... [D]eterministic inference enables us to also modify our training stack to obtain bitwise identical results between sampling and training, thus resulting in true on-policy RL."
To understand why this is such a big deal, we need to quickly break down the terms.
What is On-Policy vs. Off-Policy RL?
In reinforcement learning, a policy is just the agent's strategy - in our case, the specific route to the restaurant.
On-Policy: You learn from the exact route you're taking, right now. You take the 6 train, it's slow, and you learn "the 6 train is slow at this time." The policy you're improving is the same one you're using to gather experience.
Off-Policy: You learn about a different route than the one you're currently on. You're stuck on the 6 train, but you check your phone to see how the F train is doing. You're learning about the F train's performance without actually riding it.
The Miscalibrated Watch Problem
The numerical drift between generating text (sampling) and learning from it (training) accidentally turns on-policy methods into off-policy ones.
It's like trying to optimize your route to the restaurant, but your watch is slightly miscalibrated each time, so you can't be sure if a change actually helped or if the measurement just varied.
The standard fix is a patch called importance weighting, where you try to mathematically correct for the drift. But you're just patching over a problem that shouldn't exist in the first place.
Calibrating the Watch
Batch-invariant kernels solve this by ensuring the "route" of the calculations is bit-for-bit identical every time. This creates true on-policy RL. The results from the paper are striking:
- Without the patch: The model's performance quickly collapses.
- With the patch (Importance Weighting): Training works, but it's unstable, wobbling around a small amount of drift.
- True On-Policy (Batch-Invariant): The drift is a flat line at zero. The training is perfectly stable.
As the paper notes:
"...when running 'True On-Policy RL', our KL-divergence stays flat at 0, indicating that there is no divergence between the training policy and sampling policy."
My watch is finally calibrated. Same route, same time. Now I can actually optimize.
This isn't just theoretical. It directly impacts the entire post-training phase of LLM development, making methods like RLHF and DPO more stable and reliable.
What Actually Changes Now
Let me be concrete, because lots of people scream "THIS CHANGES EVERYTHING," but it doesn't mean much without specific use cases.
Research Reproducibility Becomes Real
Right now, validating that model A outperforms model B requires running multiple trials and computing statistical significance. Not because the models are inherently random at temperature 0, but because your measurement apparatus is inconsistent.
With deterministic inference:
- Precise A/B tests with tight bounds
- Claims become testable with certainty
- Replication studies actually replicate exactly
- Meta-analyses don't worry about implementation differences
The replication crisis in AI research is partly about researchers not sharing details. But it's also about implementations subtly differing. Batch-invariant kernels remove one source of variation.
Debugging Gets Orders of Magnitude Easier
When a model produces bad output, reproducing it exactly means you can:
- Trace through execution step by step, including inspecting intermediate activations
- Add instrumentation and re-run with identical results
- Binary search through the token sequence to find where things went wrong
Nondeterministic systems make debugging probabilistic. "Well, usually it does X, but sometimes it does Y" is a developer's worst nightmare.
With determinism, debugging becomes systematic. Every run is identical. You can use all the normal debugging tools and trust that what you see is what you'll get next time.
Caching Becomes Bulletproof
Production LLM serving uses aggressive caching. Common queries, prefix caching, continuous batching with shared KV caches - all assume identical inputs produce identical outputs.
But with nondeterminism, that assumption is leaky. Cache hit rates are lower than they should be.
Deterministic inference means:
- Perfect cache hit rates for identical inputs
- Ability to cache and reuse intermediate computations
- Simpler cache invalidation logic
For companies serving millions of queries, this translates directly to infrastructure cost savings.
Model Testing Becomes Precise
Quality assurance for LLMs is currently statistical. You can't write a test that says "for this exact input, model must produce this exact output" because you can't guarantee exact outputs.
With deterministic inference:
- Write precise regression tests that detect when model updates change specific behaviors
- Build comprehensive test suites that don't flake
- Use differential testing between model versions confidently
Being able to trust your tests means faster iteration and fewer production surprises.
Research Velocity Increases
Maybe the biggest effect: researchers spend less time on statistics, more time on science.
Right now, substantial ML research time goes to:
- Running enough trials to achieve statistical power
- Correcting for various sources of variance
- Arguing about whether differences are significant
With deterministic inference, you eliminate one major source of variance. Experiments become simpler. Results become clearer. More time on actual research questions, less on measurement methodology.
It's like how version control doesn't directly make your code better, but it removes friction and coordination overhead, so teams move faster. Deterministic inference removes friction from the research process.
Why This Is Infrastructure That Matters
I started this post talking about how I'm chronically late to dinner because I can't predict travel time. Machine learning has had the same problem, except worse: even when you know the route, you still can't predict the time.
Thinking Machines Lab could have kept this proprietary. Built "Deterministic Inference as a Service." Charged premium prices. Created a moat.
Instead, they:
- Published the full paper with complete technical details
- Released the code under permissive licenses
- Contributed improvements back to PyTorch FlexAttention
- Wrote extensive documentation
- Shared benchmark results
There's no business model here because there shouldn't be one.
The paper concludes:
"Modern software systems contain many layers of abstractions. In machine learning, when we run into nondeterminism and subtle numerical differences it can often be tempting to paper over them. After all, our systems are already 'probabilistic', so what's wrong with a little more nondeterminism? What's wrong with bumping up the atol/rtol on the failing unit test? The difference in logprobs between the trainer and the sampler probably isn't a real bug, right? We reject this defeatism."
"We reject this defeatism." What a line.
Everyone accepted nondeterminism as inevitable. Built workarounds. Adjusted tolerances. The entire ecosystem adapted to work around the problem rather than solving it.
Thinking Machines Lab asked: "What if we actually solve this?" Not "how do we minimize impact" or "how do we statistically correct for it," but "what's the root cause and how do we eliminate it?"
This is systems thinking applied to infrastructure. The problem isn't floating-point arithmetic or GPU concurrency per se - it's how we organize work across different batch configurations. The solution isn't to fight floating-point behavior, but to ensure consistent operational ordering regardless of context.
The Deeper Pattern
Most "infrastructure improvements" are about making existing things faster or cheaper. Speed up training. Reduce serving costs. Compress models. These matter - FlashAttention matters, quantization matters, efficient architectures matter.
But occasionally, someone fixes a problem that was so fundamental we stopped seeing it as a problem. We adapted. We built workarounds. The problem became part of the landscape.
Kubernetes didn't make containers faster - it made container orchestration not be a custom nightmare for every company. Git didn't make code better - it made collaboration not be a coordination nightmare. Rust didn't make systems programming faster - it made memory safety not require garbage collection.
These are foundational. They remove entire classes of problems. You don't work around them - you stop having to think about them entirely.
Batch-invariant kernels do this for LLM inference reproducibility. It's not a workaround. It's a solution. The problem just stops existing.
And here's the thing about foundational infrastructure: it only works if everyone adopts it. Network effects matter. Standards matter. Hoarding foundational infrastructure slows everyone down while extracting rent from the field.
Thinking Machines Lab understood this. They knew they could have built a business here. But they also understood that this particular innovation is more valuable if it becomes ubiquitous. Making it freely available means every inference engine can adopt these techniques. Every research lab can reproduce results. Every production deployment can get consistent behavior.
The field moves faster when the foundations are solid and shared.
What Happens Next
I expect batch-invariant kernels to show up in:
- vLLM as an opt-in flag initially, then possibly default
- TensorRT-LLM within months
- Text Generation Inference (HuggingFace)
- llama.cpp for local inference
- Major cloud providers' serving infrastructure
This won't be a competitive differentiator for long. It'll just become how inference works. Like how HTTPS used to be optional and is now expected. Like how Unicode support used to be a feature and is now assumed.
Which is exactly right.
We spend enormous energy talking about breakthrough model architectures. Mixture of Experts. State Space Models. Long context. These matter. Architecture matters.
But infrastructure like this - unglamorous, technically deep, freely shared - is what makes systematic progress possible. When the foundations are solid, everything built on top becomes more reliable. When measurements are consistent, optimization becomes possible. When experiments are reproducible, science can function.
Two Problems, One Solution
Remember, we started with two nondeterminism problems:
Problem #1: We don't know the route (ML training still doesn't tell you what matters)
Problem #2: Even when we DO know the route, we get different times every time (can't reproduce results)
Thinking Machines Lab solved Problem #2. They discovered the root cause - batch size changing computation order - and fixed it. Now if you take the same route, you get the same time. 100% of the time. No more random variations.
We still haven't solved Problem #1. People are working on interpretability, on attribution, on understanding which streets made the difference. That's the next frontier.
But now we can measure accurately. And measurement is the foundation of science.
My watch finally works. Same route, same time, every time.
Maybe I'll even make it to the restaurant before the soup dumplings get cold.
Onward.
The full paper "Defeating Nondeterminism in LLM Inference" includes extensive technical details, benchmarks, and ablation studies. The batch-invariant kernel implementations are available at github.com/thinking-machines-lab/batch-invariant-ops. Related work on FlexAttention improvements has been upstreamed to PyTorch.
It's a scorcher, go read it!
For more on why deterministic computation matters in ML systems, see Reproducible Machine Learning, Numerical Reproducibility in HPC, and the broader replication crisis in AI.
Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*
NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!