Implementing Monte Carlo tests with P-value buckets
File(s)multthresh.pdf (559.99 KB)
Accepted version
Author(s)
Gandy, Axel
Hahn, Georg
Ding, Dong
Type
Journal Article
Abstract
Software packages usually report the results of statistical tests using
p-values. Users often interpret these by comparing them to standard thresholds,
e.g. 0.1%, 1% and 5%, which is sometimes reinforced by a star rating (***, **,
*). We consider an arbitrary statistical test whose p-value p is not available
explicitly, but can be approximated by Monte Carlo samples, e.g. by bootstrap
or permutation tests. The standard implementation of such tests usually draws a
fixed number of samples to approximate p. However, the probability that the
exact and the approximated p-value lie on different sides of a threshold (the
resampling risk) can be high, particularly for p-values close to a threshold.
We present a method to overcome this. We consider a finite set of
user-specified intervals which cover [0,1] and which can be overlapping. We
call these p-value buckets. We present algorithms that, with arbitrarily high
probability, return a p-value bucket containing p. We prove that for both a
bounded resampling risk and a finite runtime, overlapping buckets need to be
employed, and that our methods both bound the resampling risk and guarantee a
finite runtime for such overlapping buckets. To interpret decisions with
overlapping buckets, we propose an extension of the star rating system. We
demonstrate that our methods are suitable for use in standard software,
including for low p-value thresholds occurring in multiple testing settings,
and that they can be computationally more efficient than standard
implementations.
p-values. Users often interpret these by comparing them to standard thresholds,
e.g. 0.1%, 1% and 5%, which is sometimes reinforced by a star rating (***, **,
*). We consider an arbitrary statistical test whose p-value p is not available
explicitly, but can be approximated by Monte Carlo samples, e.g. by bootstrap
or permutation tests. The standard implementation of such tests usually draws a
fixed number of samples to approximate p. However, the probability that the
exact and the approximated p-value lie on different sides of a threshold (the
resampling risk) can be high, particularly for p-values close to a threshold.
We present a method to overcome this. We consider a finite set of
user-specified intervals which cover [0,1] and which can be overlapping. We
call these p-value buckets. We present algorithms that, with arbitrarily high
probability, return a p-value bucket containing p. We prove that for both a
bounded resampling risk and a finite runtime, overlapping buckets need to be
employed, and that our methods both bound the resampling risk and guarantee a
finite runtime for such overlapping buckets. To interpret decisions with
overlapping buckets, we propose an extension of the star rating system. We
demonstrate that our methods are suitable for use in standard software,
including for low p-value thresholds occurring in multiple testing settings,
and that they can be computationally more efficient than standard
implementations.
Date Issued
2020-09-01
Date Acceptance
2019-10-30
Citation
Scandinavian Journal of Statistics: theory and applications, 2020, 47 (3), pp.950-967
ISSN
0303-6898
Publisher
Wiley
Start Page
950
End Page
967
Journal / Book Title
Scandinavian Journal of Statistics: theory and applications
Volume
47
Issue
3
Copyright Statement
© 2019 Board of the Foundation of the Scandinavian Journal of Statistics. This is the peer reviewed version of the following article, which has been published in final form at https://onlinelibrary.wiley.com/doi/full/10.1111/sjos.12434. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions.
Identifier
http://arxiv.org/abs/1703.09305v4
Subjects
stat.ME
stat.ME
Publication Status
Published
Date Publish Online
2019-11-14