On adaptive estimation for dynamic Bernoulli bandits
File(s)AFF_bandits.pdf (3.72 MB)
Accepted version
Author(s)
Lu, Xue
Adams, Niall
Kantas, Nikolas
Type
Journal Article
Abstract
The multi-armed bandit (MAB) problem is a classic example of the
exploration-exploitation dilemma. It is concerned with maximising the total
rewards for a gambler by sequentially pulling an arm from a multi-armed slot
machine where each arm is associated with a reward distribution. In static
MABs, the reward distributions do not change over time, while in dynamic MABs,
each arm's reward distribution can change, and the optimal arm can switch over
time. Motivated by many real applications where rewards are binary, we focus on
dynamic Bernoulli bandits. Standard methods like $\epsilon$-Greedy and Upper
Confidence Bound (UCB), which rely on the sample mean estimator, often fail to
track changes in the underlying reward for dynamic problems. In this paper, we
overcome the shortcoming of slow response to change by deploying adaptive
estimation in the standard methods and propose a new family of algorithms,
which are adaptive versions of $\epsilon$-Greedy, UCB, and Thompson sampling.
These new methods are simple and easy to implement. Moreover, they do not
require any prior knowledge about the dynamic reward process, which is
important for real applications. We examine the new algorithms numerically in
different scenarios and the results show solid improvements of our algorithms
in dynamic environments.
exploration-exploitation dilemma. It is concerned with maximising the total
rewards for a gambler by sequentially pulling an arm from a multi-armed slot
machine where each arm is associated with a reward distribution. In static
MABs, the reward distributions do not change over time, while in dynamic MABs,
each arm's reward distribution can change, and the optimal arm can switch over
time. Motivated by many real applications where rewards are binary, we focus on
dynamic Bernoulli bandits. Standard methods like $\epsilon$-Greedy and Upper
Confidence Bound (UCB), which rely on the sample mean estimator, often fail to
track changes in the underlying reward for dynamic problems. In this paper, we
overcome the shortcoming of slow response to change by deploying adaptive
estimation in the standard methods and propose a new family of algorithms,
which are adaptive versions of $\epsilon$-Greedy, UCB, and Thompson sampling.
These new methods are simple and easy to implement. Moreover, they do not
require any prior knowledge about the dynamic reward process, which is
important for real applications. We examine the new algorithms numerically in
different scenarios and the results show solid improvements of our algorithms
in dynamic environments.
Date Issued
2019-06-01
Date Acceptance
2019-05-22
Citation
Foundations of Data Science, 2019, 1 (2), pp.197-225
ISSN
2639-8001
Publisher
American Institute of Mathematical Sciences
Start Page
197
End Page
225
Journal / Book Title
Foundations of Data Science
Volume
1
Issue
2
Copyright Statement
© 2019 American Institute of Mathematical Sciences
Identifier
http://arxiv.org/abs/1712.03134v2
Subjects
stat.ML
stat.ML
cs.LG
Notes
Added another AFF version of the standard UCB algorithm; modified the AFF-TS algorithm; in the numerical studies, added comparisons to SW-UCB and D-UCB
Publication Status
Published