Crossbow: scaling deep learning with small batch sizes on multi-GPU servers

Koliousis, A; Watcharapichat, P; Weidlich, M; Mai, L; Costa, P; Pietzuch, P

23

IRUS Total
Downloads

Altmetric

Crossbow: scaling deep learning with small batch sizes on multi-GPU servers

File	Description	Size	Format
1901.02244.pdf	Accepted version	1.05 MB	Adobe PDF	View/Open

Title:	Crossbow: scaling deep learning with small batch sizes on multi-GPU servers
Authors:	Koliousis, A Watcharapichat, P Weidlich, M Mai, L Costa, P Pietzuch, P
Item Type:	Journal Article
Abstract:	Deep learning models are trained on servers with many GPUs, andtraining must scale with the number of GPUs. Systems such asTensorFlow and Caffe2 train models with parallel synchronousstochastic gradient descent: they process a batch of training data ata time, partitioned across GPUs, and average the resulting partialgradients to obtain an updated global model. To fully utilise allGPUs, systems must increase the batch size, which hinders statisticalefficiency. Users tune hyper-parameters such as the learning rate tocompensate for this, which is complex and model-specific.We describeCROSSBOW, a new single-server multi-GPU sys-tem for training deep learning models that enables users to freelychoose their preferred batch size—however small—while scalingto multiple GPUs.CROSSBOWuses many parallel model replicasand avoids reduced statistical efficiency through a new synchronoustraining method. We introduceSMA, a synchronous variant of modelaveraging in which replicasindependentlyexplore the solution spacewith gradient descent, but adjust their searchsynchronouslybased onthe trajectory of a globally-consistent average model.CROSSBOWachieves high hardware efficiency with small batch sizes by poten-tially training multiple model replicas per GPU, automatically tuningthe number of replicas to maximise throughput. Our experimentsshow thatCROSSBOWimproves the training time of deep learningmodels on an 8-GPU server by 1.3–4×compared to TensorFlow.
Issue Date:	Jul-2019
Date of Acceptance:	15-Jun-2019
URI:	http://hdl.handle.net/10044/1/75907
DOI:	10.14778/3342263.3342276
ISSN:	2150-8097
Publisher:	VLDB Endowment
Journal / Book Title:	Proceedings of the VLDB Endowment
Volume:	12
Issue:	11
Copyright Statement:	© 2019 The Author(s)
Sponsor/Funder:	Huawei Technologies Co. Ltd
Funder's Grant Number:	YBN2017100016
Keywords:	Science & Technology Technology Computer Science, Information Systems Computer Science, Theory & Methods Computer Science OPTIMIZATION cs.DC cs.DC cs.LG
Publication Status:	Published
Conference Place:	Los Angeles, CA, USA
Online Publication Date:	2019-07-01
Appears in Collections:	Computing