The application of Hadoop in structural bioinformatics
File(s)Paper-v2.pdf (533.71 KB)
Accepted version
Author(s)
Alnasir, Jamie J
Shanahan, Hugh P
Type
Journal Article
Abstract
The paper reviews the use of the Hadoop platform in structural bioinformatics applications. For structural bioinformatics, Hadoop provides a new framework to analyse large fractions of the Protein Data Bank that is key for high-throughput studies of, for example, protein–ligand docking, clustering of protein–ligand complexes and structural alignment. Specifically we review in the literature a number of implementations using Hadoop of high-throughput analyses and their scalability. We find that these deployments for the most part use known executables called from MapReduce rather than rewriting the algorithms. The scalability exhibits a variable behaviour in comparison with other batch schedulers, particularly as direct comparisons on the same platform are generally not available. Direct comparisons of Hadoop with batch schedulers are absent in the literature but we note there is some evidence that Message Passing Interface implementations scale better than Hadoop. A significant barrier to the use of the Hadoop ecosystem is the difficulty of the interface and configuration of a resource to use Hadoop. This will improve over time as interfaces to Hadoop, e.g. Spark improve, usage of cloud platforms (e.g. Azure and Amazon Web Services (AWS)) increases and standardised approaches such as Workflow Languages (i.e. Workflow Definition Language, Common Workflow Language and Nextflow) are taken up.
Date Issued
2018-11-20
Date Acceptance
2018-10-05
Citation
Briefings in Bioinformatics, 2018, 21 (1), pp.96-105
ISSN
1467-5463
Publisher
Oxford University Press (OUP)
Start Page
96
End Page
105
Journal / Book Title
Briefings in Bioinformatics
Volume
21
Issue
1
Copyright Statement
© The authors 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved.
This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model). This is a pre-copy-editing, author-produced version of an article accepted for publication in Briefings in Bioinformatics following peer review. The definitive publisher-authenticated version Jamie J Alnasir, Hugh P Shanahan, The application of Hadoop in structural bioinformatics, Briefings in Bioinformatics, Volume 21, Issue 1, January 2020, Pages 96–105, is available online at: https://doi.org/10.1093/bib/bby106
This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model). This is a pre-copy-editing, author-produced version of an article accepted for publication in Briefings in Bioinformatics following peer review. The definitive publisher-authenticated version Jamie J Alnasir, Hugh P Shanahan, The application of Hadoop in structural bioinformatics, Briefings in Bioinformatics, Volume 21, Issue 1, January 2020, Pages 96–105, is available online at: https://doi.org/10.1093/bib/bby106
Identifier
https://academic.oup.com/bib/article/21/1/96/5162997
Subjects
0601 Biochemistry and Cell Biology
0802 Computation Theory and Mathematics
0899 Other Information and Computing Sciences
Bioinformatics
Publication Status
Published
Date Publish Online
2018-11-20