MLE

MLE is the case where MAP's Prior is a Uniform Distribution (i.e., not considering prior probability)

\hat{\theta}_{mle} \in \argmax_\theta{p(\mathcal{D}|\theta)}

We usually assume that training data is

iid hence

p(\mathcal{D}|\theta) = \prod_{i=1}^np(Y_i|X_i, \theta)

For computational reasons, we work with the NLL

-\log \mathcal{L}(\theta)

with minimizing it

\mathcal{l}(\theta) = \log{p(\mathcal{D}| \theta)} = \log \mathcal{L}(\theta) = \sum_{i=1}^{n}\log p(Y_i|X_i, \theta)

Therefore we search

\hat{\theta}_{mle} \in \argmin_\theta l(\theta)

with

For example with

\text{NLL}(\theta) = - \log \prod_{i=1}^n p(Y_i | \theta)

Expanding the probability

p(Y_i | \theta)

= - \log \prod_{i=1}^n \theta^{1(Y_i = 1)} (1 - \theta)^{1(Y_i = 0)}

= - \sum_{i=1}^n \left[ 1(Y_i = 1) \log \theta + 1(Y_i = 0) \log(1 - \theta) \right]

Grouping terms for

Y_i = 1, Y_i = 0

= - \left( N_1 \log \theta + N_0 \log(1 - \theta) \right)

where

N_j = \sum_{i=1}^n 1(Y_i = j), \quad j = 0, 1.

n = N_0 + N_1

\frac{d}{d\theta}NLL(\theta) = \frac{-N_1}{\theta} + \frac{N_0}{1-\theta}

\frac{N_1}{\theta} = \frac{N_0}{1-\theta} \quad\Longrightarrow\quad \frac{\theta}{1-\theta} = \frac{N_1}{N_0} \quad\Longrightarrow\quad \theta = \frac{N_1}{N_0 + N_1}.

MLE is given by

\hat{\theta}_{mle} = \frac{N_1}{n}

which is the empirical fraction of heads