Normalize its values so that they sum to 1.0 (probability distrubution)
simply exponentiate the values and then divide by the sum (softmax is kind of a mean)
One useful property of the softmax operation is that if we add a constant to all the input values, the result will be the same. So we can find the largest value in the input vector and subtract it from all the values. This ensures that the largest value is 0.0, and the softmax remains numerically stable.
What's with the name "softmax"? The "hard" version of this operation, called argmax, simply finds the maximum value, sets it to 1.0, and assigns 0.0 to all other values. In contrast, the softmax operation serves as a "softer" version of that.
the largest value is emphasized and pushed towards 1.0, while still maintaining a probability distribution over all input values.
The following technique is not commonly used because the overhead of softmax is considered to be low in gpu or llm.
SoftMax Techniques
softmax함수는 sigmoid의 일반형
- 2개 클래스를 대상으로 정의하던 logit을 K개의 클래스를 대상으로 일반화하면 softmax함수가 derived
- softmax함수에서 K를 2로 놓으면 sigmoid함수로 환원
- sigmoid함수를 K개의 클래스를 대상으로 일반화하면 softmax함수가 유도
classification
softmax함수는 인공신경망이 내놓은 K개의 클래스 구분 결과를 확률처럼 해석하도록
sigmoid는 activation에, softmax는 classification에 사용되지만 수학적으로 같다
다루는 클래스가 2개냐 K개냐로 차이가 있을 뿐
Discrete
- Concrete softmax https://arxiv.org/abs/1611.00712
- Gumbel softmax https://arxiv.org/abs/1611.01144