Predictable thread coarsening
File(s)finalVersionSubmitted.pdf (625.47 KB)
Accepted version
Author(s)
Stawinoga, Nicolai
Field, AJ
Type
Journal Article
Abstract
Thread coarsening on GPUs combines the work of several threads into one. We show how thread coarsening
can be implemented as a fully automated compile-time optimisation which estimates the optimal coarsening
factor based on a low-cost, approximate static analysis of cache line re-use and an occupancy prediction model.
We evaluate two coarsening strategies on three different NVidia GPU architectures. For NVidia reduction
kernels we achieve a maximum speedup of 5.08x and for the Rodinia benchmarks we achieve a mean speedup
of 1.30x over 8 of 19 kernels that were determined safe to coarsen.
can be implemented as a fully automated compile-time optimisation which estimates the optimal coarsening
factor based on a low-cost, approximate static analysis of cache line re-use and an occupancy prediction model.
We evaluate two coarsening strategies on three different NVidia GPU architectures. For NVidia reduction
kernels we achieve a maximum speedup of 5.08x and for the Rodinia benchmarks we achieve a mean speedup
of 1.30x over 8 of 19 kernels that were determined safe to coarsen.
Date Issued
2018-06-01
Date Acceptance
2018-03-05
Citation
ACM Transactions on Architecture and Code Optimization, 2018, 15 (2)
ISSN
1544-3973
Publisher
Association for Computing Machinery
Journal / Book Title
ACM Transactions on Architecture and Code Optimization
Volume
15
Issue
2
Copyright Statement
© 2018 ACM. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Architecture and Code Optimization: https://dl.acm.org/citation.cfm?doid=3212710.3194242
Subjects
0803 Computer Software
Publication Status
Published
Article Number
ARTN 23