In the multi-armed bandit problem we try to maximise our gain over time by "gambling on slot-machines" that have different (but unknown) expected outcomes.

The concept is typically used as an alternative to A/B-testing used in marketing research or website optimization. The benefit of viewing website optimization as a multi-armed bandit problem instead of an A/B-testing problem is that no pre-defined sample sizes are needed and the algorithm will start optimizing the outcome (e.g. click rate) from the beginning, while the A/B-test needs to run all predefined samples to make a conclusion.

The following post will illustrate how to implement and solve a very simple multi-armed bandit problem with a probabilistic algorithm.

In [1]:

```
%matplotlib inline import sys
import numpy as np
import scipy.stats as stats
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns sns.set_style('darkgrid')
np.random.seed(42)
# Print versions used
print('Python: {}.{}.{}'.format(*sys.version_info[:3]))
print('numpy: {}'.format(np.__version__))
print('matplotlib: {}'.format(matplotlib.__version__))
print('seaborn: {}'.format(sns.__version__))
#
```

We will simulate 3 bandits with each an underlying probability of winning ` p_bandits `

. The ` pull(i) `

method will "pull the arm" of bandit $i$ and randomly decides based on the underlying probability if the pull was a win (1) or not (0).

In [2]:

```
# Define the multi-armed bandits
nb_bandits = 3 # Number of bandits
# True probability of winning for each bandit
p_bandits = [0.45, 0.55, 0.60] def pull(i): """Pull arm of bandit with index `i` and return 1 if win,
else return 0.""" if np.random.rand() < p_bandits[i]: return 1 else: return 0
```

Each pull of a specifc bandit will give a win with a certain probability. The higher this probability the more likely pulling the arm of the bandit is going to result in a win. However we don't know what this probability is so we will have to model it based on our observations of a certain bandit resulting in a win or not. We are going to model the unkown probability with a parameter $\theta$. Based on oberving a single outcome $x$ we can model this as the distribution $P(\theta \mid x)$. Using Bayes rule we can write this as:

$$ P(\theta \mid x) = \frac{P(x \mid \theta)P(\theta)}{P(x)} $$

$P(\theta \mid x)$ is called the posterior distribution of $\theta$ after observing $x$. We will be able to compute this via the conditional distribution $P(x \mid \theta)$ and prior $P(\theta)$. The conditional distribution $P(x \mid \theta)$ in this case follows the Bernoulli distribution .

### Beta prior ¶

A good choice for the prior $P(\theta)$ would be the Beta distribution since it is a conjugate prior to the Bernoulli distribution. This means that if we choose the Beta distribution as a prior that the posterior will also be a Beta distribution. More specifically if the prior is $Beta(\alpha, \beta)$ then after observing a win $x=1$ the posterior would become $Beta(\alpha+1, \beta)$ or after observing a loss $x=0$ the posterior would become $Beta(\alpha, \beta+1)$.

This means that after every observation we can use the posterior as the prior for the next time we pull the bandit's arm. A more in-depth illustration can be found here .

The goal of the muli-armed bandit problem is to maximize reward (minimize regret). There is an exploitation-exploration tradeoff we have to make here. The more we pull the arm of our percieved best bandit the more certain we become of the probability of that bandit. But other bandits that we haven't pulled that often might have a lower expected probability but with higher uncertaintly. There is a chance that they are actually better than our percieved best bandit.

We will use Thompson sampling to overcome this problem. In Thompson sampling we will for each bandit sample the probability $\theta$ from the prior and pull the bandit with the highest sampled probability. And repeat this step until finished.

We will start with the prior $Beta(\alpha=1, \beta=1)$, which corresponds to a uniform prior between 0 and 1. The run is simulated for 1000 steps and the results at certain steps are plotted below.

In [3]:

```
# Define plotting functions
# Iterations to plot
plots = [1, 2, 5, 10, 25, 50, 100, 200, 500, 1000] def plot(priors, step, ax): """Plot the priors for the current step.""" plot_x = np.linspace(0.001, .999, 100) for prior in priors: y = prior.pdf(plot_x) p = ax.plot(plot_x, y) ax.fill_between(plot_x, y, 0, alpha=0.2) ax.set_xlim([0, 1]) ax.set_ylim(ymin=0) ax.set_title(f'Priors at step {step:d}')
#
```

In [4]:

```
# Simulate multi-armed bandit process and update posteriors # Setup plot
fig, axs = plt.subplots(5, 2, figsize=(12.0, 11))
axs = axs.flat # The number of trials and wins will represent the prior for each
# bandit with the help of the Beta distribution.
trials = [0, 0, 0] # Number of times we tried each bandit
wins = [0, 0, 0] # Number of wins for each bandit n = 1000
# Run the trail for `n` steps
for step in range(1, n+1): # Define the prior based on current observations bandit_priors = [ stats.beta(a=1+w, b=1+t-w) for t, w in zip(trials, wins)] # plot prior if step in plots: plot(bandit_priors, step, next(axs)) # Sample a probability theta for each bandit theta_samples = [ d.rvs(1) for d in bandit_priors ] # choose a bandit chosen_bandit = np.argmax(theta_samples) # Pull the bandit x = pull(chosen_bandit) # Update trials and wins (defines the posterior) trials[chosen_bandit] += 1 wins[chosen_bandit] += x plt.tight_layout()
plt.show()
```

In [5]:

```
# Print final outcome and number of test needed per bandit
emperical_p_bandits = [(1+w) / (1+t) for t, w in zip(trials, wins)]
for i in range(nb_bandits): print((f'True prob={p_bandits[i]:.2f}; ' f'Emperical prob={emperical_p_bandits[i]:.2f}; ' f'Trials={trials[i]:d}'))
#
```

Note above that the algorithm quickly converges to select the bandit with the highest probability of winning. After 1000 iterations it has chosen the winning bandit 10 times more often than one of the other bandits.