Model Fitting, Fitting Probability Distribution, Parametric Learning
Methods find the most likely parameter that explain the data and boil down to
if ,
statistical experiment be a sample … , of i.i.d.
random variables in some measurable space Ω, usually Ω ⊆ ℝ
hyperparameter , is data set
- While performing MLE estimation, we update the weights through back propagation to maximize the likelihood of the data, obtaining the optimal point estimation
- While performing MAP estimation, we update the weights through back propagation to maximize the posterior probability, obtaining the optimal point estimation
- While performing Bayesian inference, we update the weights through back propagation to calculate the posterior probability distribution, obtaining the optimal density estimation
MLE is intuitive, MAP is a generalized MLE with non-constant log-prior, and ERM is a generalized form with any loss function and regularization term.
\begin{align*}
\hat{\theta}_{\mathrm{MAP}}
&= \arg\max_{\theta}\Bigl[\sum_{i=1}^n \log p(y_i\mid x_i;\theta) + \log p(\theta)\Bigr] \\[6pt]
&= \arg\min_{\theta}\Bigl[-\tfrac{1}{n}\bigl(\sum_{i=1}^n \log p(y_i\mid x_i;\theta) + \log p(\theta)\bigr)\Bigr] \\[6pt]
&= \arg\min_{\theta}\Bigl[\tfrac{1}{n}\sum_{i=1}^n \underbrace{-\log p(y_i\mid x_i;\theta)}_{\ell(x_i,y_i;\theta)} + \underbrace{-\tfrac{1}{n}\log p(\theta)}_{\Omega(\theta)}\Bigr]
\end{align*}Point Estimations
Parameter Estimation Notion