Minimize cumulated difference of decisions compared to the optimal choice
The reason we typically use regret minimization for Online Learning is similar to why we use Replay Buffer for Sample efficiency in Off-policy learning. More specifically, in online learning, data is presented sequentially, which hinders efficient training that considers all data points. Considering all data is more appropriate for online learning since it deals with environmental changes, unlike traditional training. Also, the reason why we consider optimal point is that the Online Learning requires gradual optimization for unknown data, so the concepts of Regret and Optimal are useful
Regret Minimization and Reward Maximization are philosophically opposite since the former focuses on the past while the latter focuses on future rewards.
The cumulated performance difference of past decision compared to the optimal choice.
Implementation is just a regularized Stochastic Gradient Descent
We simply add projection into space to minimize regret compared to loss minimization, and we do not know the optimal .
to implement this we use restriction trick
It enables to make it close to the optimal
with limiting explode during update