The policies submodule¶
Contents:
-
class
multibeep.policies.
UCB_p
(base b, rng_class rng, float_t p)¶ Bases:
multibeep.policies.base
UCB_p picks the arm with the largest Upper Confidence Bound $mu_i + p cdot sigma_i$
Parameters: - b (multibeep.bandits.bandit) – the bandit to be played
- rng (multibeep.util.rng_class) – a valid random number generator
- p (double) – prefactor to the standard deviation in the equation above. Controlls exploration vs. exploitation.
-
class
multibeep.policies.
base
¶ Bases:
object
-
play_n_rounds
(self, unsigned int n)¶ automatically pull multiple times.
If not specified otherwise, the number of rounds to play equals the number of pulls.
Parameters: n (unsigned int) – number of round to be played
-
select_next_arm
(self)¶ policy suggests the next arm to pull
Returns: unsigned int – the current index of the arm to pull
-
-
class
multibeep.policies.
prob_match
(base b, rng_class rng)¶ Bases:
multibeep.policies.base
Probability Match pulls arms randomly where the probability is proportional to having the highest mean.
Parameters: - b (multibeep.bandits.bandit) – the bandit to be played
- rng (multibeep.util.rng_class) – a valid random number generator
-
class
multibeep.policies.
random
(base b, rng_class rng)¶ Bases:
multibeep.policies.base
the random policy just picks an arm uniformly at random among all active arms
Parameters: - b (multibeep.bandits.bandit) – the bandit to be played
- rng (multibeep.util.rng_class) – a valid random number generator
-
class
multibeep.policies.
successive_halving
(base b, unsigned int min_num_pulls, float_t frac_arms, float_t factor_pulls=0)¶ Bases:
multibeep.policies.base
pulls
Parameters: - b (multibeep.bandits.bandit) – the bandit to be played
- min_num_pulls (unsigned int) – number of pulls for every arm in the first round
- frac_arms (double) – after each round, only frac_arms * b.number_of_active_arms() remain active, all others are deactivated.
- factor_pulls (double) – every round the number of pull is factor_pull times the value of the previous round. Default is 0, which means that 1/frac arms is used. With that choice the total number of pulls per round is constant.
TODO¶
Policies that need to be coded:
- UCB1, UCB1-NORMAL
P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002
- lil-UCB
Jamieson et al. lil’ UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits JMLR: Workshop and Conference Proceedings vol 35:1–17, 2014
- KL-UCB
Aurélien Garivier, Olivier Cappé: The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. COLT 2011: 359-376
- MOSS
J-Y. Audibert and S. Bubeck: Minimax Policies for Adversarial and Stochastic Bandits. Proceedings of the 22nd Annual Conference on Learning Theory 2009
- Epsilon-Greedy
- SoftMax/Boltzmann Exploration
- Poker:
Vermorel, Joannes, and Mehryar Mohri. “Multi-armed bandit algorithms and empirical evaluation.” European conference on machine learning. Springer Berlin Heidelberg, 2005.
- f-race
Birattari, Mauro, et al. “F-Race and iterated F-Race: An overview.” Experimental methods for the analysis of optimization algorithms. Springer Berlin Heidelberg, 2010. 311-336.
- Entropy search with different update rules
- proper predictive posterior to compute expected change in H(pmax)
- Sampling from the reward history to approximate expected change in H(pmax)
- Joel’s sampling from the posterior and pretending the mean is fixed (expected ‘information of one mean’)