Optimal convergence rate for exact policy mirror descent in discounted Markov decision processes
File(s)convergence_pmd.pdf (424.32 KB)
Published version
Author(s)
Johnson, Emmeran
Pike-Burke, Ciara
Rebeschini, Patrick
Type
Conference Paper
Abstract
Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide range of novel and fundamental methods in reinforcement learning. Motivated by the instability of policy iteration (PI) with inexact policy evaluation, PMD
algorithmically regularises the policy improvement step of PI. With exact policy evaluation, PI is known to converge linearly with a rate given by the discount factor γ of a Markov Decision Process. In this work, we bridge the gap between PI and PMD with exact policy evaluation and show that the dimension-free γ-rate of PI can be achieved by the general family of unregularised PMD algorithms under
an adaptive step-size. We show that both the rate and step-size are unimprovable for PMD: we provide matching lower bounds that demonstrate that the γ-rate is optimal for PMD methods as well as PI, and that the adaptive step-size is necessary for PMD to achieve it. Our work is the first to relate PMD to rate-optimality and step-size necessity. Our study of the convergence of PMD avoids the use of the
performance difference lemma, which leads to a direct analysis of independent interest. We also extend the analysis to the inexact setting and establish the first
dimension-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result.
algorithmically regularises the policy improvement step of PI. With exact policy evaluation, PI is known to converge linearly with a rate given by the discount factor γ of a Markov Decision Process. In this work, we bridge the gap between PI and PMD with exact policy evaluation and show that the dimension-free γ-rate of PI can be achieved by the general family of unregularised PMD algorithms under
an adaptive step-size. We show that both the rate and step-size are unimprovable for PMD: we provide matching lower bounds that demonstrate that the γ-rate is optimal for PMD methods as well as PI, and that the adaptive step-size is necessary for PMD to achieve it. Our work is the first to relate PMD to rate-optimality and step-size necessity. Our study of the convergence of PMD avoids the use of the
performance difference lemma, which leads to a direct analysis of independent interest. We also extend the analysis to the inexact setting and establish the first
dimension-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result.
Date Issued
2023
Date Acceptance
2023-09-21
Citation
Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023, 36, pp.76496-76524
Publisher
Curran Associates, Inc.
Start Page
76496
End Page
76524
Journal / Book Title
Advances in Neural Information Processing Systems 36 (NeurIPS 2023)
Volume
36
Copyright Statement
Copyright © 2023 The Author(s).
Identifier
https://proceedings.neurips.cc/paper_files/paper/2023/hash/f0d7b528c31bc3f9a0d5bab515ed6ed5-Abstract-Conference.html
Source
Neural Information Processing Systems (NeurIPS 2023)
Publication Status
Published
Start Date
2023-12-11
Finish Date
2023-12-16
Coverage Spatial
New Orleans, LA, USA
Date Publish Online
2023