The policies submodule

Contents:

class multibeep.policies.UCB_p(base b, rng_class rng, float_t p)

Bases: multibeep.policies.base

UCB_p picks the arm with the largest Upper Confidence Bound $mu_i + p cdot sigma_i$

Parameters:
  • b (multibeep.bandits.bandit) – the bandit to be played
  • rng (multibeep.util.rng_class) – a valid random number generator
  • p (double) – prefactor to the standard deviation in the equation above. Controlls exploration vs. exploitation.
class multibeep.policies.base

Bases: object

play_n_rounds(self, unsigned int n)

automatically pull multiple times.

If not specified otherwise, the number of rounds to play equals the number of pulls.

Parameters:n (unsigned int) – number of round to be played
select_next_arm(self)

policy suggests the next arm to pull

Returns:unsigned int – the current index of the arm to pull
class multibeep.policies.prob_match(base b, rng_class rng)

Bases: multibeep.policies.base

Probability Match pulls arms randomly where the probability is proportional to having the highest mean.

Parameters:
  • b (multibeep.bandits.bandit) – the bandit to be played
  • rng (multibeep.util.rng_class) – a valid random number generator
class multibeep.policies.random(base b, rng_class rng)

Bases: multibeep.policies.base

the random policy just picks an arm uniformly at random among all active arms

Parameters:
  • b (multibeep.bandits.bandit) – the bandit to be played
  • rng (multibeep.util.rng_class) – a valid random number generator
class multibeep.policies.successive_halving(base b, unsigned int min_num_pulls, float_t frac_arms, float_t factor_pulls=0)

Bases: multibeep.policies.base

pulls

Parameters:
  • b (multibeep.bandits.bandit) – the bandit to be played
  • min_num_pulls (unsigned int) – number of pulls for every arm in the first round
  • frac_arms (double) – after each round, only frac_arms * b.number_of_active_arms() remain active, all others are deactivated.
  • factor_pulls (double) – every round the number of pull is factor_pull times the value of the previous round. Default is 0, which means that 1/frac arms is used. With that choice the total number of pulls per round is constant.

TODO

Policies that need to be coded:

  1. UCB1, UCB1-NORMAL

    P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002

  2. lil-UCB

    Jamieson et al. lil’ UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits JMLR: Workshop and Conference Proceedings vol 35:1–17, 2014

  3. KL-UCB

    Aurélien Garivier, Olivier Cappé: The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. COLT 2011: 359-376

  4. MOSS

    J-Y. Audibert and S. Bubeck: Minimax Policies for Adversarial and Stochastic Bandits. Proceedings of the 22nd Annual Conference on Learning Theory 2009

  5. Epsilon-Greedy

    http://cs.mcgill.ca/~vkules/bandits.pdf

  6. SoftMax/Boltzmann Exploration

    http://cs.mcgill.ca/~vkules/bandits.pdf

  7. Poker:

    Vermorel, Joannes, and Mehryar Mohri. “Multi-armed bandit algorithms and empirical evaluation.” European conference on machine learning. Springer Berlin Heidelberg, 2005.

  8. f-race

    Birattari, Mauro, et al. “F-Race and iterated F-Race: An overview.” Experimental methods for the analysis of optimization algorithms. Springer Berlin Heidelberg, 2010. 311-336.

  9. Entropy search with different update rules
    1. proper predictive posterior to compute expected change in H(pmax)
    1. Sampling from the reward history to approximate expected change in H(pmax)
    1. Joel’s sampling from the posterior and pretending the mean is fixed (expected ‘information of one mean’)