A problem where an agent observes context, selects an action once, and receives a reward based on that action.
- Action space
- Reward design
- Short term vs long term objectives
Naively, the bandit has to try every possible combination of {item x explanation} many times, before being able to exploit the best combination.

BaRT: Explore, Exploit, Explain
Explore, Exploit, Explain: Personalizing Explainable Recommendations with Bandits - Spotify Research
The multi-armed bandit is an important framework for balancing exploration with exploitation in recommendation. Exploitation recommends content (e.g., products, movies, music playlists) with the highest predicted user engagement and has traditionally been the focus of recommender systems. Exploration recommends content with uncertain predicted user engagement for the purpose of gathering more information. The importance of... View Article
https://research.atspotify.com/publications/explore-exploit-explain-personalizing-explainable-recommendations-with-bandits/
Distribution aware rewards
Deriving User- and Content-specific Rewards for Contextual Bandits - Spotify Research
Given the overwhelming choices faced by users on what to watch, read and listen to online, recommender systems play a pivotal role in helping users navigate the myriad of choices. Most modern recommender systems are powered by interactive machine learning algorithms, which learn to adapt their recommendations by estimating a model of the user satisfaction... View Article
https://research.atspotify.com/2019/05/deriving-user-and-content-specific-rewards-for-contextual-bandits/

Seonglae Cho