Unmasking the Myth of LLM Reproducibility

Struggling with LLM reproducibility? Our latest blog post unpacks the real reason behind inconsistent outputs: it's not just floating-point math, it's a lack of "batch invariance." Learn why the most common hypothesis is wrong and how truly reproducible LLM inference is now within reach.

9/11/20252 min read

a drawing of a colorful octopus
a drawing of a colorful octopus

Reproducibility is a cornerstone of scientific progress, but it remains a significant challenge with large language models (LLMs). While you might expect deterministic results with a "temperature of 0," which forces the model to choose the most probable token, LLM APIs and even open-source inference libraries often produce different outputs for the same input. The common explanation for this is a "concurrency + floating-point" hypothesis, which suggests that the non-associativity of floating-point arithmetic combined with the non-deterministic order of parallel GPU calculations leads to varying results.

However, this hypothesis only tells part of the story. While floating-point non-associativity is the underlying cause of numerical differences, it's not the source of non-determinism itself. The true culprit is batch non-invariance.

The Real Reason for Non-Determinism

The core issue is that the output of an LLM query is not independent of other requests being processed at the same time. From a user's perspective, the server's load and, therefore, the batch size of the inference run, is a non-deterministic variable. When a kernel lacks batch invariance, a change in this variable can lead to a different result.

In practice, this means that even if a kernel (a low-level program that performs computations on a GPU) is run-to-run deterministic, the overall system is not. A simple matrix multiplication, for example, is run-to-run deterministic, meaning it will always produce the same result given the same input on the same hardware. But if that same matrix multiplication is run with a different batch size, the output can change. This is because modern GPU kernels, optimized for performance, often change their internal strategies (e.g., how they parallelize a task) based on the batch size.

This holds true for the three most common operations in an LLM's forward pass:

  • RMSNorm: The reduction order for each element must be fixed, regardless of the batch size. Optimizing for smaller batch sizes can break this invariance.

  • Matrix Multiplication: The standard strategy is to keep the entire reduction within a single core. However, for smaller input sizes, kernels may switch to a "Split-K" strategy, which breaks the batch invariance.

  • Attention: This is the most complex of the three. It's necessary to ensure that the reduction order for a given token does not depend on how many other tokens are being processed simultaneously. Optimizations like "chunked prefill" and "prefix caching" can easily break this invariance if not handled correctly.

How to Achieve Reproducibility

The solution is to build kernels that are batch-invariant. This means ensuring the numerical calculations and reduction order remain identical, regardless of the batch size. While this may come with a small performance trade-off, it’s a necessary step to achieve truly reproducible results.

This approach transforms the QA process, enabling testers to confidently verify model behavior and developers to reproduce bugs. It’s a vital step in moving LLM development from an unpredictable art to a reliable, scientific practice.