Cache blocking strategies applied to flux reconstruction
File(s)CacheBlockingFR_AAM.pdf (613.02 KB)
Accepted version
Author(s)
Akkurt, Semih
Witherden, Freddie
Vincent, Peter
Type
Journal Article
Abstract
On modern hardware architectures, the performance of Flux Reconstruction (FR) methods can be limited
by memory bandwidth. In a typical implementation, these methods are implemented as a chain of
distinct kernels. Often, a dataset which has just been written in the main memory by a kernel is
read back immediately by the next kernel. One way to avoid such a redundant expenditure of memory
bandwidth is kernel fusion. However, on a practical level kernel fusion requires that the source for all
kernels be available, thus preventing calls to certain third-party library functions. Moreover, it can add
substantial complexity to a codebase. An alternative to full kernel fusion is cache blocking. But for this
to be effective, CPU cache has to be meaningfully big. Historically, size of L1 and L2 caches prevented
cache blocking for high-order CFD applications. However in recent years, size of L2 cache has grown
from around 0.25 MiB to 1.25 MiB, and made it possible to apply cache blocking for high-order CFD
codes. In this approach, kernels remain distinct, and are executed one after another on small chunks of
data that can fit in the cache, as opposed to on full datasets. These chunks of data stay in the cache and
whenever a kernel requests access to data that is already in the cache, memory bandwidth is conserved.
In this study, a data structure that facilitates cache blocking is considered, and a range of kernel grouping
configurations for an FR based Euler solver are examined. A theoretical study is conducted for hexahedral
elements with no anti-aliasing at p = 3 and p = 4 in order to determine the predicted performance of
a few kernel grouping configurations. Then, these candidates are implemented in the PyFR solver and
the performance gains in practice are compared with the theoretical estimates that range between 2.05x
and 2.50x. An inviscid Taylor-Green Vortex test case is used as a benchmark, and the most performant
configuration leads to a speedup of approximately 2.81x in practice.
by memory bandwidth. In a typical implementation, these methods are implemented as a chain of
distinct kernels. Often, a dataset which has just been written in the main memory by a kernel is
read back immediately by the next kernel. One way to avoid such a redundant expenditure of memory
bandwidth is kernel fusion. However, on a practical level kernel fusion requires that the source for all
kernels be available, thus preventing calls to certain third-party library functions. Moreover, it can add
substantial complexity to a codebase. An alternative to full kernel fusion is cache blocking. But for this
to be effective, CPU cache has to be meaningfully big. Historically, size of L1 and L2 caches prevented
cache blocking for high-order CFD applications. However in recent years, size of L2 cache has grown
from around 0.25 MiB to 1.25 MiB, and made it possible to apply cache blocking for high-order CFD
codes. In this approach, kernels remain distinct, and are executed one after another on small chunks of
data that can fit in the cache, as opposed to on full datasets. These chunks of data stay in the cache and
whenever a kernel requests access to data that is already in the cache, memory bandwidth is conserved.
In this study, a data structure that facilitates cache blocking is considered, and a range of kernel grouping
configurations for an FR based Euler solver are examined. A theoretical study is conducted for hexahedral
elements with no anti-aliasing at p = 3 and p = 4 in order to determine the predicted performance of
a few kernel grouping configurations. Then, these candidates are implemented in the PyFR solver and
the performance gains in practice are compared with the theoretical estimates that range between 2.05x
and 2.50x. An inviscid Taylor-Green Vortex test case is used as a benchmark, and the most performant
configuration leads to a speedup of approximately 2.81x in practice.
Date Issued
2022-02
Date Acceptance
2021-10-12
Citation
Computer Physics Communications, 2022, 271, pp.1-9
ISSN
0010-4655
Publisher
Elsevier BV
Start Page
1
End Page
9
Journal / Book Title
Computer Physics Communications
Volume
271
Copyright Statement
© 2022 Elsevier Ltd. All rights reserved. This manuscript is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Licence http://creativecommons.org/licenses/by-nc-nd/4.0/
Sponsor
Engineering & Physical Science Research Council (EPSRC)
Identifier
https://www.sciencedirect.com/science/article/pii/S0010465521003052?via%3Dihub
Grant Number
EP/R030340/1
Subjects
Science & Technology
Technology
Physical Sciences
Computer Science, Interdisciplinary Applications
Physics, Mathematical
Computer Science
Physics
Cache blocking
Kernel fusion
Flux reconstruction
High-order
Computational fluid dynamics
Nuclear & Particles Physics
01 Mathematical Sciences
02 Physical Sciences
08 Information and Computing Sciences
Publication Status
Published
Article Number
108193
Date Publish Online
2021-10-28