Learning to Optimize via Posterior Sampling
Summarizing the paper that shares the title of this post.
Original paper here
Problem Formulation
Before we delve too deeply into the algorithm itself we have to understand the problem that we are trying to solve. In this context we are looking at some agent who has an option of different actions it can take. When it performs an action, it receives a reward for that action. The agent is going to choose this action randomly based on a policy that is generated from the history of actions and rewards. We also need some way to measure the performance of actions, this is called regret in this context. The regret is just the current best reward minus the reward of the action just taken over the entire history.
Algorithms
This paper compares posterior sampling optimization algorithms to Upper Confidence Bound (UCB) optimization algorithms. UCB algorithms place a bound on the regret for a new action that typically favors more optimistic approaches. Posterior sampling chooses a new action based on the probability that is an optimal point. So posterior sampling creates a model (most likely an approximate model, like a Gaussian Process, which I have talked about before), chooses an action based on whether a point is likely to be optimal given the current approximate function, then update the mean and variance after that. This process then continues to termination.
Posterior sampling can provide some benefits in different situations depending on the context of the problem. One of the biggest advantages to posterior sampling is the fact that it can be used to perform optimization on complicated models without requiring super extensive statistical analysis. Also, if you have a large set of options for actions, this can become difficult to compute, but since posterior sampling only requires sampling from a posterior and not calculating it, the computational effort is reduced. Of course we can only say that these advantages for posterior sampling exist in comparison to UCB algorithms only since the authors didn’t test against other algorithms.
Results
When comparing the regret of posterior sampling to the different UCB algorithms it was clear that the posterior sampling outperformed all of them (meaning it had the least regret). The authors say that the discrepancy between the two is likely due to the poor choice in bounds for the UCB algorithms. This probably stems from poor statistical analysis. Also, some UCB algorithms have free parameters that must be chosen by the user. This can cause difficulty in performance when the user may not be familiar with the algorithm that they are choosing. Posterior sampling avoids that. Recently, I have been doing research into posterior sampling for use in Bayesian optimization and engineering design. It shows considerable promise in its effectiveness.