High state entropy (not action entropy)
Gather more information
Simplest: random actions (ε-greedy)
With (small) probability ε, act randomly (if random is small then threshold) → threshold become lower greedly by learning → become zero
With (large) probability 1-ε, act on current policy
- can keep thrashing around once learning is done
One solution: lower εover time (decaying epsilon greedy)
explore areas whose badness is not (yet) established (optimism for uncertainty)
RL Exploration Notion