MSRL: Distributed reinforcement learning with dataflow fragments
File(s)atc23-zhu-huanzhou.pdf (1.8 MB)
Published version
Author(s)
Type
Conference Paper
Abstract
A wide range of reinforcement learning (RL) algorithms have been proposed, in which agents learn from interactions with a simulated environment. Executing such RL training loops is computationally expensive, but current RL systems fail to support the training loops of different RL algorithms efficiently on GPU clusters: they either hard-code algorithm-specific strategies for parallelization and distribution; or they accelerate only parts of the computation on GPUs (e.g., DNN policy updates). We observe that current systems lack an abstraction that decouples the definition of an RL algorithm from its strategy for distributed execution.
We describe MSRL, a distributed RL training system
that uses the new abstraction of a fragmented dataflow
graph (FDG) to execute RL algorithms in a flexible way.
An FDG is a heterogeneous dataflow representation of an
RL algorithm, which maps functions from the RL training
loop to independent parallel dataflow fragments. Fragments
account for the diverse nature of RL algorithms: each fragment can execute on a different device using its own low-level dataflow implementation, e.g., an operator graph of a DNN engine, a CUDA GPU kernel, or a multi-threaded CPU process. At deployment time, a distribution policy governs how fragments are mapped to devices, without changes to the algorithm implementation. Our experiments show that MSRL exposes trade-offs between different execution strategies, while surpassing the performance of existing RL systems.
We describe MSRL, a distributed RL training system
that uses the new abstraction of a fragmented dataflow
graph (FDG) to execute RL algorithms in a flexible way.
An FDG is a heterogeneous dataflow representation of an
RL algorithm, which maps functions from the RL training
loop to independent parallel dataflow fragments. Fragments
account for the diverse nature of RL algorithms: each fragment can execute on a different device using its own low-level dataflow implementation, e.g., an operator graph of a DNN engine, a CUDA GPU kernel, or a multi-threaded CPU process. At deployment time, a distribution policy governs how fragments are mapped to devices, without changes to the algorithm implementation. Our experiments show that MSRL exposes trade-offs between different execution strategies, while surpassing the performance of existing RL systems.
Date Issued
2023-07-10
Date Acceptance
2023-07-01
Citation
Proceedings of the 17th USENIX symposium on operating systems design and implementation, OSDI 2023, 2023, pp.977-993
ISBN
978-1-939133-35-9
Publisher
USENIX Assoc
Start Page
977
End Page
993
Journal / Book Title
Proceedings of the 17th USENIX symposium on operating systems design and implementation, OSDI 2023
Copyright Statement
© USENIX Association 2023. Open access to the Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation is sponsored by King Abdullah University of Science and Technology.
Identifier
https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:001066454400062&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=a2bf6146997ec60c407a63945d4e92bb
Source
USENIX Annual Technical Conference (USENIX ATC)
Subjects
Computer Science
Computer Science, Hardware & Architecture
Computer Science, Information Systems
Computer Science, Software Engineering
Computer Science, Theory & Methods
GO
LEVEL
Science & Technology
Technology
Publication Status
Published
Start Date
2023-07-10
Finish Date
2023-07-12
Coverage Spatial
MA, Boston, USA