The policies submodule¶

Contents:

class multibeep.policies.UCB_p(base b, rng_class rng, float_t p)¶

UCB_p picks the arm with the largest Upper Confidence Bound $mu_i + p cdot sigma_i$

Parameters:	b (multibeep.bandits.bandit) – the bandit to be played rng (multibeep.util.rng_class) – a valid random number generator p (double) – prefactor to the standard deviation in the equation above. Controlls exploration vs. exploitation.

class multibeep.policies.base¶

Bases: object

play_n_rounds(self, unsigned int n)¶

automatically pull multiple times.

If not specified otherwise, the number of rounds to play equals the number of pulls.

Parameters:	n (unsigned int) – number of round to be played

select_next_arm(self)¶

policy suggests the next arm to pull

Returns:	unsigned int – the current index of the arm to pull

class multibeep.policies.prob_match(base b, rng_class rng)¶

Probability Match pulls arms randomly where the probability is proportional to having the highest mean.

Parameters:	b (multibeep.bandits.bandit) – the bandit to be played rng (multibeep.util.rng_class) – a valid random number generator

class multibeep.policies.random(base b, rng_class rng)¶

the random policy just picks an arm uniformly at random among all active arms

Parameters:	b (multibeep.bandits.bandit) – the bandit to be played rng (multibeep.util.rng_class) – a valid random number generator

class multibeep.policies.successive_halving(base b, unsigned int min_num_pulls, float_t frac_arms, float_t factor_pulls=0)¶

pulls

Parameters:

b (multibeep.bandits.bandit) – the bandit to be played
min_num_pulls (unsigned int) – number of pulls for every arm in the first round
frac_arms (double) – after each round, only frac_arms * b.number_of_active_arms() remain active, all others are deactivated.
factor_pulls (double) – every round the number of pull is factor_pull times the value of the previous round. Default is 0, which means that 1/frac arms is used. With that choice the total number of pulls per round is constant.

TODO¶

Policies that need to be coded:

UCB1, UCB1-NORMAL

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002
lil-UCB

Jamieson et al. lil’ UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits JMLR: Workshop and Conference Proceedings vol 35:1–17, 2014
KL-UCB

Aurélien Garivier, Olivier Cappé: The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. COLT 2011: 359-376
MOSS

J-Y. Audibert and S. Bubeck: Minimax Policies for Adversarial and Stochastic Bandits. Proceedings of the 22nd Annual Conference on Learning Theory 2009
Epsilon-Greedy

http://cs.mcgill.ca/~vkules/bandits.pdf
SoftMax/Boltzmann Exploration

http://cs.mcgill.ca/~vkules/bandits.pdf
Poker:

Vermorel, Joannes, and Mehryar Mohri. “Multi-armed bandit algorithms and empirical evaluation.” European conference on machine learning. Springer Berlin Heidelberg, 2005.
f-race

Birattari, Mauro, et al. “F-Race and iterated F-Race: An overview.” Experimental methods for the analysis of optimization algorithms. Springer Berlin Heidelberg, 2010. 311-336.
Entropy search with different update rules
1. proper predictive posterior to compute expected change in H(pmax)
1. Sampling from the reward history to approximate expected change in H(pmax)
1. Joel’s sampling from the posterior and pretending the mean is fixed (expected ‘information of one mean’)