This is exciting, it means there might be some hope for good automated tests. Right now what everyone has to do is create tags to test around the actual answer, in your specific response you have...
This is exciting, it means there might be some hope for good automated tests.
Right now what everyone has to do is create tags to test around the actual answer, in your specific response you have tags that describe the answer so that you don’t have to test an actual non-deterministic response.
Then, you have benchmark tests that test the actual response with a similarity score to test for drift over time.
If they can figure this issue out, those two test sets can be combined reliably and we can test the actual response in a functional test.
Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models.
...
But why aren’t LLM inference engines deterministic? One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first. We will call this the “concurrency + floating point” hypothesis for LLM inference nondeterminism.
...
Although [floating point non-associability] is the underlying cause for non-identical outputs, it does not directly answer where the nondeterminism comes from. It doesn’t help us understand why floating-point values get added in different orders, when that happens, nor how it can be avoided.
....
Although concurrent atomic adds do make a kernel nondeterministic, atomic adds are not necessary for the vast majority of kernels. In fact, in the typical forward pass of an LLM, there is usually not a single atomic add present.
...
As it turns out, our request’s output does depend on the parallel user requests. Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.
...
When you make a query to an inference endpoint, the amount of load the server is under is effectively “nondeterministic” from the user’s perspective. The load determines the batch size that the kernels are run under, and thus changes the eventual result of each individual request!
...
So, the easiest way to ensure batch invariance for matmuls is to compile one kernel configuration and use that for all shapes. Although we will lose some performance, this isn’t typically disastrous in LLM inference. In particular, split-k is most needed when both M and N are small, and luckily in our case, N (i.e. the model dim) is usually pretty large!
This is exciting, it means there might be some hope for good automated tests.
Right now what everyone has to do is create tags to test around the actual answer, in your specific response you have tags that describe the answer so that you don’t have to test an actual non-deterministic response.
Then, you have benchmark tests that test the actual response with a similarity score to test for drift over time.
If they can figure this issue out, those two test sets can be combined reliably and we can test the actual response in a functional test.
From the article:
...
...
....
...
...
...