A problem where an agent observes context, selects an action once, and receives a reward based on that action.
- Action space
- Reward design
- Short term vs long term objectives
Naively, the bandit has to try every possible combination of {item x explanation} many times, before being able to exploit the best combination.

BaRT: Explore, Exploit, Explain
Distribution aware rewards

Seonglae Cho