Randomness
Temperature is only applied to output logit softmax
- When T=0, it selects the token with highest probability (deterministic, greedy decoding)
- When T approaches infinity, probability distribution becomes uniform, increasing randomness
Even with temperature = 0, you can still see nondeterminism in LLM inference because of batching. When the batch size changes, the GPU kernel may evaluate the same mathematical expression in a different order. For example, batch size 1 vs. batch size 2048 can lead to different tile sizes, different reduction orders, different Tensor Core instruction choices, and different parallelization strategies.
The main reason is not simply “GPU threads finish in random order so atomic-add ordering changes”; more precisely, the reduction strategy selected by the kernel itself can change with batch size. Because floating‑point arithmetic is not associative, changing the computation order can change results at the bit level, which can then cascade into different generated tokens.
RMSNorm, matmul, and attention all involve reductions. If the reduction order depends on the batch size, batch invariance breaks. The solution is to use a batch-invariant kernel (i.e., enforce a reduction strategy that does not change with batch size).
Defeating Nondeterminism in LLM Inference
Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models. For example, you might observe that asking ChatGPT the same question multiple times provides different results. This by itself is not surprising, since getting a result from a language model involves “sampling”, a process that converts the language model’s output into a probability distribution and probabilistically selects a token. What might be more surprising is that even when we adjust the temperature down to 0This means that the LLM always chooses the highest probability token, which is called greedy sampling. (thus making the sampling theoretically deterministic), LLM APIs are still not deterministic in practice (see past discussions here, here, or here). Even when running inference on your own hardware with an OSS inference library like vLLM or SGLang, sampling still isn’t deterministic (see here or here).
https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

Why should we use Temperature in softmax?
I'm recently working on CNN and I want to know what is the function of temperature in softmax formula? and why should we use high temperatures to see a softer norm in probability distribution?Softmax
https://stackoverflow.com/questions/58764619/why-should-we-use-temperature-in-softmax/63471046#63471046
Transformer로 텍스트를 생성하는 다섯 가지 전략
Hugging face에서 정리한 자연어 생성 디코딩 전략 포스팅을 번역 & 정리한 포스트입니다 ❤️ Source - hugging face ❤️ 더 좋은 디코딩 전략으로 자연어 생성 모델의 성능 높이기 원본 포스팅: https://huggingface.co/blog/how-to-generate?fbclid=IwAR19kbEiW_sF19TeSr4BE4jQZSIqz0GzOFD2013fIGEH32DReW9pAFq6vDM 포스팅에서 소개하는 전략은 아래와 같이 표현할 수 있는 모든 auto-regressive 언어 모델에 적용 가능하다. 또한, 다섯 가지 디코딩 전략은 hugging face의 transformer 라이브러리에서 함수로 호출해 사용할 수 있다. import tensorflow as tf fro..
https://littlefoxdiary.tistory.com/46

Seonglae Cho