Repository logo
  • Log In
    Log in via Symplectic to deposit your publication(s).
Repository logo
  • Communities & Collections
  • Research Outputs
  • Statistics
  • Log In
    Log in via Symplectic to deposit your publication(s).
  1. Home
  2. Faculty of Engineering
  3. Computing
  4. Computing PhD theses
  5. Learning and exploiting reward machines for reinforcement learning
 
  • Details
Learning and exploiting reward machines for reinforcement learning
File(s)
FurelosBlanco-D-2023-PhD-Thesis.pdf (4.04 MB)
Thesis
Author(s)
Furelos Blanco, Daniel
Type
Thesis or dissertation
Abstract
Reinforcement learning (RL) with non-Markovian rewards requires agents to learn history-dependent policies, which is particularly challenging in long-horizon or sparse reward settings. Reward machines (RMs) are finite-state machines that represent non-Markovian reward functions in terms of high-level events. By compactly encoding high-level event histories, RMs thereby constitute an external memory that makes rewards Markovian and enables the applicability of standard RL algorithms in non-Markovian reward settings. The structure elucidated by RMs facilitates task decomposition, allowing policy learning to become more efficient when rewards are sparse. Nevertheless, the potential of RMs is limited by the complexity of handcrafting them and the lack of reusability within larger RMs. In this thesis, we address these problems.

In the first part of the thesis, we devise a method for learning minimal RMs from traces of high-level events observed by an RL agent. The learning is powered by an inductive logic programming system and is launched when the current RM does not correctly recognize a trace. To make RM learning more efficient, we conceive a symmetry breaking mechanism to shrink the search space whilst remaining complete. We empirically demonstrate that exploiting a learned RM leads to performance on par with a handcrafted one.

In the second part of the thesis, we build hierarchies of RMs (HRMs) by endowing RMs with the ability to call each other, enabling the reusability of the RMs' structures and policies. In particular, the HRMs are exploited by treating each call as an independently solvable subtask, and learned through a curriculum-based method extending our RM learning approach. Our experiments reveal that (i) exploiting a handcrafted HRM leads to faster convergence than with a flat HRM, and (ii) learning an HRM is feasible in cases where its equivalent flat representation is not.
Version
Open Access
Date Issued
2023-10
Date Awarded
2024-02
URI
http://hdl.handle.net/10044/1/110047
DOI
https://doi.org/10.25560/110047
Copyright Statement
Creative Commons Attribution NonCommercial ShareAlike Licence
License URL
https://creativecommons.org/licenses/by-nc-sa/4.0/
Advisor
Russo, Alessandra
Broda, Krysia
Law, Mark
Jonsson, Anders
Publisher Department
Computing
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)
About
Spiral Depositing with Spiral Publishing with Spiral Symplectic
Contact us
Open access team Report an issue
Other Services
Scholarly Communications Library Services
logo

Imperial College London

South Kensington Campus

London SW7 2AZ, UK

tel: +44 (0)20 7589 5111

Accessibility Modern slavery statement Cookie Policy

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback