Tackling Latency via Replication in Distributed Systems
File(s)main.pdf (819.96 KB)
Accepted version
Author(s)
Harrison, PG
Qiu, Z
Perez, JF
Type
Conference Paper
Abstract
Consistently high reliability and low latency are twin requirements
common to many forms of distributed processing;
for example, server farms and mirrored storage access.
To address them, we consider replication of requests with
canceling – i.e. initiate multiple concurrent replicas of a request
and use the first successful result returned, canceling
all outstanding replicas. This scheme has been studied recently,
but mostly for systems with a single central queue,
while server farms exploit distributed resources for scalability
and robustness. We develop an approximate stochastic
model to determine the response-time distribution in a system
with distributed queues, and compare its performance
against its centralized counterpart. Validation against simulation
indicates that our model is accurate for not only the
mean response time but also its percentiles, which are particularly
relevant for deadline-driven applications. Further,
we show that in the distributed set-up, replication with canceling
has the potential to reduce response times, even at
relatively high utilization. We also find that it offers response
times close to those of the centralized system, especially
at medium-to-high request reliability. These findings
support the use of replication with canceling as an effective
mechanism for both fault- and delay-tolerance.
common to many forms of distributed processing;
for example, server farms and mirrored storage access.
To address them, we consider replication of requests with
canceling – i.e. initiate multiple concurrent replicas of a request
and use the first successful result returned, canceling
all outstanding replicas. This scheme has been studied recently,
but mostly for systems with a single central queue,
while server farms exploit distributed resources for scalability
and robustness. We develop an approximate stochastic
model to determine the response-time distribution in a system
with distributed queues, and compare its performance
against its centralized counterpart. Validation against simulation
indicates that our model is accurate for not only the
mean response time but also its percentiles, which are particularly
relevant for deadline-driven applications. Further,
we show that in the distributed set-up, replication with canceling
has the potential to reduce response times, even at
relatively high utilization. We also find that it offers response
times close to those of the centralized system, especially
at medium-to-high request reliability. These findings
support the use of replication with canceling as an effective
mechanism for both fault- and delay-tolerance.
Date Issued
2016-03-12
Date Acceptance
2015-12-10
Citation
ICPE '16 Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering, 2016, pp.197-208
ISBN
978-1-4503-4080-9
Publisher
ACM
Start Page
197
End Page
208
Journal / Book Title
ICPE '16 Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering
Copyright Statement
© ACM, 2016. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ICPE '16 Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering, pp. 197-208. 12 Mar 2016 http://doi.acm.org/10.1145/2851553.2851562
Sponsor
Engineering & Physical Science Research Council (EPSRC)
Grant Number
EP/L00738X/1
Source
ACM/SPEC International Conference on Performance Engineering (ICPE),
Publication Status
Published
Start Date
2018-03-12
Finish Date
2016-03-18
Coverage Spatial
Delft, Netherlands