IRUS Total

Ako: Decentralised Deep Learning with Partial Gradient Exchange

File Description SizeFormat 
ako-socc16.pdfAccepted version669.72 kBAdobe PDFView/Open
Title: Ako: Decentralised Deep Learning with Partial Gradient Exchange
Authors: Pietzuch, P
Watcharapichat, P
Lopez Morales, V
Castro Fernandez, R
Item Type: Conference Paper
Abstract: Distributed systems for the training of deep neural networks (DNNs) with large amounts of data have vastly improved the accuracy of machine learning models for image and speech recognition. DNN systems scale to large cluster deployments by having worker nodes train many model replicas in parallel; to ensure model convergence, parameter servers periodically synchronise the replicas. This raises the challenge of how to split resources between workers and parameter servers so that the cluster CPU and network resources are fully utilised without introducing bottlenecks. In practice, this requires manual tuning for each model configuration or hardware type. We describe Ako, a decentralised dataflow-based DNN system without parameter servers that is designed to saturate cluster resources. All nodes execute workers that fully use the CPU resources to update model replicas. To synchronise replicas as often as possible subject to the available network bandwidth, workers exchange partitioned gradient updates directly with each other. The number of partitions is chosen so that the used network bandwidth remains constant, independently of cluster size. Since workers eventually receive all gradient partitions after several rounds, convergence is unaffected. For the ImageNet benchmark on a 64-node cluster, Ako does not require any resource allocation decisions, yet converges faster than deployments with parameter servers.
Issue Date: 5-Oct-2016
Date of Acceptance: 12-Jul-2016
URI: http://hdl.handle.net/10044/1/39019
DOI: https://dx.doi.org/10.1145/2987550.2987586
ISBN: 978-1-4503-4525-5
Publisher: ACM
Start Page: 84
End Page: 97
Journal / Book Title: SoCC '16 Proceedings of the Seventh ACM Symposium on Cloud Computing
Copyright Statement: © ACM, 2016. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the 7th ACM Symposium on Cloud Computing, SoCC 2016, 5 Oct 2016 http://doi.acm.org/10.1145/2987550.2987586
Conference Name: ACM Symposium on Cloud Computing 2016 (SoCC)
Publication Status: Published
Start Date: 2016-10-05
Finish Date: 2016-10-07
Conference Place: Santa Clara, CA, USA
Appears in Collections:Faculty of Engineering