One of my favorite data science blogs comes from James McCaffrey, a software engineer and researcher at Microsoft. He recently wrote a blog post on a method for allocating turns in a multi-armed bandit problem.

I really liked his post, and decided to take a look at the algorithm he described and code up a function to do the simulation in R.

Note: this is strictly an implementation of Dr. McCaffrey’s ideas from his blog post, and should not be taken as my own.

You can find the .Rmd file for this post on my GitHub.

Background

The basic idea of a multi-armed bandit is that you have a fixed number of resources (e.g. money at a casino) and you have a number of competing places where you can allocate those resources (e.g. four slot machines at the casino). These allocations occur sequentially, so in the casino example, we choose a slot machine, observe the success or failure from our play, and then make the next allocation decision. Since we’re data scientists at a casino, hopefully we’re using the information we’re gathering to make better gambling decisions (is that an oxymoron?).

We want to choose the best place to allocate our resources, and maximize our reward for each allocation. However, we should shy away from a greedy strategy (just play the winner), because it doesn’t allow us to explore our other options.

There are different strategies for choosing where to allocate your next resource. One of the more popular choices is Thompson sampling, which usually involves sampling from a Beta distribution, and using the results of that sampling to determine your next allocation (out of scope for this blog post!).

Code: `roulette_wheel`

The following function implements the roulette wheel allocation, for a flexible number of slot machines.

The function starts by generating a warm start with the data. We need to gather information about our different slot machines, so we allocate a small number of resources to each one to collect information. After we do this, we start the real allocation. We pick a winner based on how its cumulative probability compares to a draw from a random uniform distribution.

So, if our observed success probabilities are

machine	observed_prob	cumulative_prob	selection_range
1	0.2	0.2	0.0-0.2
2	0.3	0.5	0.2-0.5
3	0.5	1.0	0.5-1.0

And our draw from the random uniform was 0.7, we’d pick the third arm (0.7 falls between 0.5 and 1). This selection criteria is the main point of Dr. McCaffrey’s algorithm. For a better and more thorough explanation, I’d suggest reading his blog post.

We then continue this process (playing a slot machine, observing the outcome, recalculating observed probabilities, and picking the next slot machine) until we run out of coins.

And here’s the code

roulette_wheel <- function(coins = 40, 
                           starts = 5,
                           true_prob = c(0.3, 0.5, 0.7)){
  # must have enough coins to generate initial empirical distribution
  if (coins < (length(true_prob) * starts)){
    stop("To generate a starting distribution, each machine must be",
         " played ",
         starts,
         " times - not enough coins to do so.")
  }
  # allocate first ("warm up")
  SS <- sapply(true_prob, FUN = function(x) sum(rbinom(starts, 1, x)))
  FF <- starts - SS
  # calculate metrics used for play allocation
  probs <- SS / (SS + FF)
  probs_normalized <- probs / sum(probs)
  cumu_probs_normalized <- cumsum(probs_normalized)
  # update number of coins
  coins <- coins - (length(true_prob) * starts)
  # create simulation data.frame
  sim_df <- data.frame(machine = seq_along(true_prob),
                       true_probabilities = true_prob,
                       observed_probs = probs,
                       successes = SS,
                       failures = FF,
                       plays = SS + FF,
                       machine_played = NA,
                       coins_left = coins)
  # initialize before while loop
  sim_list <- vector('list', length = coins)
  i <- 1
  # play until we run out of original coins
  while(coins > 0){
    # which machine to play?
    update_index <- findInterval(runif(1), c(0, cumu_probs_normalized))
    # play machine
    flip <- rbinom(1, 1, true_prob[update_index])
    # update successes and failure for machine that was played
    SS[update_index] <- SS[update_index] + flip
    FF[update_index] <- FF[update_index] + (1-flip)
    # update metrics used for play allocation
    probs <- SS / (SS + FF)
    probs_normalized <- probs / sum(probs)
    cumu_probs_normalized <- cumsum(probs_normalized)
    # update number of coins
    coins <- coins - 1    
    # update simulation data.frame (very inefficient)
    sim_list[[i]] <- data.frame(machine = seq_along(true_prob),
                                true_probabilities = true_prob,
                                observed_probs = probs,
                                successes = SS,
                                failures = FF,
                                plays = SS + FF,
                                machine_played = seq_along(true_prob) == update_index,
                                coins_left = coins)
    i <- i + 1
  }
  # show success:failure ratio
  message("Success to failure ratio was ",
          round(sum(SS) / sum(FF), 2),
          "\n",
          paste0("(", 
                 paste0(SS, collapse = "+"), 
                 ")/(", 
                 paste0(FF, collapse = "+"), ")"))
  # return data frame of values from experiment
  rbind(sim_df, do.call('rbind', sim_list))
}

Data Analysis

I’ll show a brief example of what we can do with the data generated from this function.

set.seed(123)
rw1 <- roulette_wheel(coins = 5000, 
                      starts = 10, 
                      true_prob = c(0.1, 0.25, 0.5, 0.65))

## Success to failure ratio was 1.06
## (15+228+835+1490)/(213+662+826+731)

Table 1: Final simulation result
machine	true_probabilities	observed_probs	successes	failures	plays	machine_played
1	0.10	0.0658	15	213	228	FALSE
2	0.25	0.2562	228	662	890	FALSE
3	0.50	0.5027	835	826	1661	FALSE
4	0.65	0.6709	1490	731	2221	TRUE

Let’s look at how the observed probabilities changed over time:

And how did our plays for each machine accumulate through time?

Boring!

Maybe if we run a smaller number of simulations, we might get a better sense of variation in our number of plays.

set.seed(1)
rw2 <- roulette_wheel(coins = 100, 
                      starts = 5, 
                      true_prob = c(0.1, 0.3, 0.65))

## Success to failure ratio was 0.82
## (1+16+28)/(11+26+18)

Table 2: Final simulation result
machine	true_probabilities	observed_probs	successes	failures	plays	machine_played
1	0.10	0.0833	1	11	12	FALSE
2	0.30	0.3810	16	26	42	FALSE
3	0.65	0.6087	28	18	46	TRUE

That shows our allocations a little bit better than the previous visualization.

Conclusion

This was a fun exercise for me, and it reminded me of a presentation I did in graduate school about a very similar topic. I also wrote a roulette wheel function in Python, and was moderately successful at that (it runs faster than my R function, but I’m less confident in how “pythonic” it is).

My biggest concern with this implementation is the potential situation in which our warm start results in all failures for a given slot machine. If the machine fails across the warm start, it will not be selected for the rest of the simulation. To offset this, you could add a little “jitter” (technical term: epsilon) to the observed probabilities at each iteration. Another option would be to generate a second random uniform variable, and if that value is very small, you that pull a random lever, rather than the one determined by the simulation.

Finally, I’d be interested in comparing the statistical properties of this algorithm and others that are used in sequential allocation problems…if I have the time.

data-science r machine-learning data-analysis

Background

Code: roulette_wheel

Data Analysis

Conclusion

Code: `roulette_wheel`