16
IRUS Total
Downloads
  Altmetric

A minimally intrusive low-memory approach to resilience for existing transient solvers

File Description SizeFormat 
Cantwell-Nielsen2019_Article_AMinimallyIntrusiveLow-MemoryA.pdfPublished version715.4 kBAdobe PDFView/Open
Title: A minimally intrusive low-memory approach to resilience for existing transient solvers
Authors: Cantwell, C
Nielsen, A
Item Type: Journal Article
Abstract: We propose a novel, minimally intrusive approach to adding fault tolerance to existing complex scientific simulation codes, used for addressing a broad range of time-dependent problems on the next generation of supercomputers. Exascale systems have the potential to allow much larger, more accurate and scale-resolving simulations of transient processes than can be performed on current petascale systems. However, with a much larger number of components, exascale computers are expected to suffer a node failure every few minutes. Many existing parallel simulation codes are not tolerant of these failures and existing resilience methodologies would necessitate major modifications or redesign of the application. Our approach combines the proposed user-level failure mitigation extensions to the Message-Passing Interface (MPI), with the concepts of message-logging and remote in-memory checkpointing, to demonstrate how to add scalable resilience to transient solvers. Logging MPI communication reduces the storage requirement of static data, such as finite element operators, and allows a spare MPI process to rebuild these data structures independently of other ranks. Remote in-memory checkpointing avoids disk I/O contention on large parallel filesystems. A prototype implementation is applied to Nektar++, a scalable, production-ready transient simulation framework. Forward-path and recovery-path performance of the resilience algorithm is analysed through experiments using the solver for the incompressible Navier-Stokes equations, and strong scaling of the approach is observed.
Issue Date: Jan-2019
Date of Acceptance: 25-Jun-2018
URI: http://hdl.handle.net/10044/1/61802
DOI: https://doi.org/10.1007/s10915-018-0778-7
ISSN: 0885-7474
Publisher: Springer Verlag
Start Page: 565
End Page: 581
Journal / Book Title: Journal of Scientific Computing
Volume: 78
Issue: 1
Copyright Statement: © The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Sponsor/Funder: Commission of the European Communities
Funder's Grant Number: 671571
Keywords: Science & Technology
Physical Sciences
Mathematics, Applied
Mathematics
Exascale
Fault tolerance
Message-logging
MPI
Transient solvers
Parallel computing
FAULT-TOLERANCE
RECOVERY
0102 Applied Mathematics
0103 Numerical and Computational Mathematics
0802 Computation Theory and Mathematics
Applied Mathematics
Publication Status: Published
Online Publication Date: 2018-07-12
Appears in Collections:Aeronautics
Faculty of Engineering