13
IRUS Total
Downloads
  Altmetric

KungFu: Making Training in Distributed Machine Learning Adaptive

File Description SizeFormat 
osdi20-mai.pdfPublished version5.12 MBAdobe PDFView/Open
Title: KungFu: Making Training in Distributed Machine Learning Adaptive
Authors: Mai, L
Li, G
Wagenlander, M
Fertakis, K
Brabete, A-O
Pietzuch, P
Item Type: Conference Paper
Abstract: When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must con-figure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. In current systems, adapting such parameters during training is ill-supported. Users must set system parameters at deployment time, and provide fixed adaptation schedules for hyper-parameters in the training program. We describe Kung Fu, a distributed ML library for Tensor-Flow that is designed to enable adaptive training. Kung Fu allows users to express high-level Adaptation Policies(APs)that describe how to change hyper- and system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios and noise scale) as input and trigger control actions (e.g. cluster rescaling or synchronisation strategy updates). For execution, APs are translated into monitoring and control operators, which are embedded in the data flowgraph. APs exploit an efficient asynchronous collective communication layer, which ensures concurrency and consistency of monitoring and adaptation operations
Issue Date: 4-Nov-2020
Date of Acceptance: 31-Aug-2020
URI: http://hdl.handle.net/10044/1/85597
ISBN: 9781939133199
Publisher: Usenix
Journal / Book Title: Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation
Copyright Statement: © 2020 The Author(s).
Sponsor/Funder: Huawei Technologies Co. Ltd
Funder's Grant Number: YBN2017100016
Conference Name: USENIX Symposium on Operating Systems Design and Implementation (OSDI)
Publication Status: Published
Start Date: 2020-11-04
Finish Date: 2020-11-06
Conference Place: Virtual
Online Publication Date: 2020-11-04
Appears in Collections:Computing