Softmax Function

Creator

Creator

Created

Created

2020 Nov 1 15:13

Editor

Editor

Edited

Edited

2025 Mar 10 12:57

Refs

Refs

Sparsemax Function

Normalize its values so that they sum to 1.0 (probability distribution)

The softmax function is popularly used to normalize the neural network output scores across all the classes

\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}}

to prevent overflow

\text{Softmax}(z_i) = \frac{e^{z_i - \max(z)}}{\sum_{j=1}^n e^{z_j - \max(z)}}

simply exponentiate the values and then divide by the sum (softmax is kind of a mean)

One useful property of the softmax operation is that if we add a constant to all the input values, the result will be the same. So we can find the largest value in the input vector and subtract it from all the values. This ensures that the largest value is 0.0, and the softmax remains numerically stable.

What's with the name "softmax"? The "hard" version of this operation, called argmax, simply finds the maximum value, sets it to 1.0, and assigns 0.0 to all other values. In contrast, the softmax operation serves as a "softer" version of that.

the largest value is emphasized and pushed towards 1.0, while still maintaining a probability distribution over all input values.

The following technique is not commonly used because the overhead of softmax is considered to be low in gpu or llm.

SoftMax Techniques

Hierarchical Softmax

Negative Sampling

softmax함수는 sigmoid의 일반형

2개 클래스를 대상으로 정의하던 logit을 K개의 클래스를 대상으로 일반화하면 softmax함수가 derived

softmax함수에서 K를 2로 놓으면 sigmoid함수로 환원

sigmoid함수를 K개의 클래스를 대상으로 일반화하면 softmax함수가 유도

classification

softmax함수는 인공신경망이 내놓은 K개의 클래스 구분 결과를 확률처럼 해석하도록

sigmoid는 activation에, softmax는 classification에 사용되지만 수학적으로 같다

다루는 클래스가 2개냐 K개냐로 차이가 있을 뿐

Discrete

Concrete softmax https://arxiv.org/abs/1611.00712

Gumbel softmax https://arxiv.org/abs/1611.01144

logit, sigmoid, softmax의 관계 - 한 페이지 머신러닝

logit, sigmoid, softmax의 관계] 이것들이서로다다른개념같지만 결론부터일단말씀드리면 - 반대로 sigmoid함수를 K개의클래스를대상으로일반화하면 softmax함수가유도됩니다. 그러므로 이결정을같은의미로간략하게표현한것이바로 이것과또아주같은의미로 odds는그값이 1보다큰지아닌지로결정의기준을세웠다면 로그함수의기본성질과분수의약분통분을다룰수있다면유도할수있습니다. 오른쪽 식은 클래스가 2개일 때의 odds를 표현해놓은 것이고 왼쪽 식은 클래스가 K개일 때의 odds를 표현해놓은 것입니다. 양 변을 i=1부터 K-1까지 더해주세요 분모에 있는 C_K는 i의 영향을 받지 않으므로, 분자의 P(C_i | x)만 더해지는데요, 확률의 합은 1이기 때문에, 1부터 K-1번째 클래스의 확률을 더한 값은 1-에서 K번째 클래스의 확률을 뺀 것과 같죠.

logit, sigmoid, softmax의 관계 - 한 페이지 머신러닝

https://opentutorials.org/module/3653/22995

logit, sigmoid, softmax의 관계 - 한 페이지 머신러닝

Recommendations

/////////