With this interactive dashboard you will understand how Upper Confidence Bound is working and how the parameters affect the result.
Upper confidence bound is a reinforcement learning algorithm that finds a solution to problems with incomplete information or uncertain rewards. It takes its learnings into account for future actions.
Assume you go to a casino and want to play a one armed-bandit machines. You see five of these slot machines and you don’t know which one to play with. One round is one turn on a machine. For each round you have five possible actions, because you could play at any of these five machines. The rewards are your wins or losses. These two items, actions and rewards, are key to solving the multi-armed bandit problem, but you will find these in all reinforcement learning problems. Your goal is to detect the strategy with highest long-term reward.
The first input field refers to the iteration number. This defines how many rounds have been played and by increasing it you see step-by-step which bandit is chosen at this specific point in time. You can go back and forth to see the effect.
The other input parameters define the mean and standard deviation of the bandit returns. This is an information that is unknown to the player.
For reference the distribution of bandit returns until the current round is plotted. Below this plot you can see the confidence bounds and average returns. The red confidence bound refers to the bandit that is selected for the next round, because it currently has the highest upper confidence level.
The last graph shows the average return for each round and which bandit was chosen in each round. You can also see here, how often each bandits was chosen and that although the best bandit might be detected already, still there are other bandits tested once in a while.