23
IRUS TotalDownloads
Altmetric
Crossbow: scaling deep learning with small batch sizes on multi-GPU servers
File | Description | Size | Format | |
---|---|---|---|---|
1901.02244.pdf | Accepted version | 1.05 MB | Adobe PDF | View/Open |
Title: | Crossbow: scaling deep learning with small batch sizes on multi-GPU servers |
Authors: | Koliousis, A Watcharapichat, P Weidlich, M Mai, L Costa, P Pietzuch, P |
Item Type: | Journal Article |
Abstract: | Deep learning models are trained on servers with many GPUs, andtraining must scale with the number of GPUs. Systems such asTensorFlow and Caffe2 train models with parallel synchronousstochastic gradient descent: they process a batch of training data ata time, partitioned across GPUs, and average the resulting partialgradients to obtain an updated global model. To fully utilise allGPUs, systems must increase the batch size, which hinders statisticalefficiency. Users tune hyper-parameters such as the learning rate tocompensate for this, which is complex and model-specific.We describeCROSSBOW, a new single-server multi-GPU sys-tem for training deep learning models that enables users to freelychoose their preferred batch size—however small—while scalingto multiple GPUs.CROSSBOWuses many parallel model replicasand avoids reduced statistical efficiency through a new synchronoustraining method. We introduceSMA, a synchronous variant of modelaveraging in which replicasindependentlyexplore the solution spacewith gradient descent, but adjust their searchsynchronouslybased onthe trajectory of a globally-consistent average model.CROSSBOWachieves high hardware efficiency with small batch sizes by poten-tially training multiple model replicas per GPU, automatically tuningthe number of replicas to maximise throughput. Our experimentsshow thatCROSSBOWimproves the training time of deep learningmodels on an 8-GPU server by 1.3–4×compared to TensorFlow. |
Issue Date: | Jul-2019 |
Date of Acceptance: | 15-Jun-2019 |
URI: | http://hdl.handle.net/10044/1/75907 |
DOI: | 10.14778/3342263.3342276 |
ISSN: | 2150-8097 |
Publisher: | VLDB Endowment |
Journal / Book Title: | Proceedings of the VLDB Endowment |
Volume: | 12 |
Issue: | 11 |
Copyright Statement: | © 2019 The Author(s) |
Sponsor/Funder: | Huawei Technologies Co. Ltd |
Funder's Grant Number: | YBN2017100016 |
Keywords: | Science & Technology Technology Computer Science, Information Systems Computer Science, Theory & Methods Computer Science OPTIMIZATION cs.DC cs.DC cs.LG |
Publication Status: | Published |
Conference Place: | Los Angeles, CA, USA |
Online Publication Date: | 2019-07-01 |
Appears in Collections: | Computing |