The Exploration-Exploitation Trade-Off in Sequential Decision Making Problems

Sykulski, Adam M.

394

IRUS Total
Downloads

Altmetric

The Exploration-Exploitation Trade-Off in Sequential Decision Making Problems

File	Description	Size	Format

Sykulski-A-2011-PhD-Thesis.PDF		1.34 MB	Adobe PDF	View/Open

Title:	The Exploration-Exploitation Trade-Off in Sequential Decision Making Problems
Authors:	Sykulski, Adam M.
Item Type:	Thesis or dissertation
Abstract:	Sequential decision making problems require an agent to repeatedly choose between a series of actions. Common to such problems is the exploration-exploitation trade-off, where an agent must choose between the action expected to yield the best reward (exploitation) or trying an alternative action for potential future benefit (exploration). The main focus of this thesis is to understand in more detail the role this trade-off plays in various important sequential decision making problems, in terms of maximising finite-time reward. The most common and best studied abstraction of the exploration-exploitation trade-off is the classic multi-armed bandit problem. In this thesis we study several important extensions that are more suitable than the classic problem to real-world applications. These extensions include scenarios where the rewards for actions change over time or the presence of other agents must be repeatedly considered. In these contexts, the exploration-exploitation trade-off has a more complicated role in terms of maximising finite-time performance. For example, the amount of exploration required will constantly change in a dynamic decision problem, in multiagent problems agents can explore by communication, and in repeated games, the exploration-exploitation trade-off must be jointly considered with game theoretic reasoning. Existing techniques for balancing exploration-exploitation are focused on achieving desirable asymptotic behaviour and are in general only applicable to basic decision problems. The most flexible state-of-the-art approaches, έ-greedy and έ-first, require exploration parameters to be set a priori, the optimal values of which are highly dependent on the problem faced. To overcome this, we construct a novel algorithm, έ-ADAPT, which has no exploration parameters and can adapt exploration on-line for a wide range of problems. έ-ADAPT is built on newly proven theoretical properties of the έ-first policy and we demonstrate that έ-ADAPT can accurately learn not only how much to explore, but also when and which actions to explore.
Issue Date:	Apr-2011
Date Awarded:	Nov-2011
URI:	http://hdl.handle.net/10044/1/9073
DOI:	https://doi.org/10.25560/9073
Supervisor:	Adams, Niall Jennings, Nick
Sponsor/Funder:	BAE Systems and EPSRC
Author:	Sykulski, Adam M.
Funder's Grant Number:	ALADDIN project
Department:	Mathematics
Publisher:	Imperial College London
Qualification Level:	Doctoral
Qualification Name:	Doctor of Philosophy (PhD)
Appears in Collections:	Mathematics PhD theses

Unless otherwise indicated, items in Spiral are protected by copyright and are licensed under a Creative Commons Attribution NonCommercial NoDerivatives License.