Repository logo
  • Log In
    Log in via Symplectic to deposit your publication(s).
Repository logo
  • Communities & Collections
  • Research Outputs
  • Statistics
  • Log In
    Log in via Symplectic to deposit your publication(s).
  1. Home
  2. Faculty of Medicine
  3. Department of Medicine
  4. Department of Metabolism, Digestion and Reproduction
  5. Auto-CORPus: a natural language processing tool for standardising and reusing biomedical literature
 
  • Details
Auto-CORPus: a natural language processing tool for standardising and reusing biomedical literature
File(s)
Frontiers_Auto-CORPus_revisionDec21_Supplementary_Material_final.pdf (1.84 MB)
Supporting information
fdgth-04-788124.pdf (3.13 MB)
Published version
OA Location
https://www.biorxiv.org/content/10.1101/2021.01.08.425887v2
Author(s)
Beck, Tim
Shorter, Tom
Hu, Sawyer
Li, Zhuoyu
Sun, Shujian
more
Type
Journal Article
Abstract
To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardised. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardisation and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC. To standardise the description of heterogenous publication sections, the Information Artifact Ontology is used to annotate each section within the BioC output. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. The BioC specification does not include a data structure for representing publication table data, so we present a JSON format for sharing table content and metadata. Inline tables within full-text HTML files and linked tables within separate HTML files are processed and converted to machine-interpretable table JSON format. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition. This abbreviation collection supports text mining tasks such as named entity recognition by including abbreviations unique to individual publications that are not contained within standard bio-ontologies and dictionaries. The Auto-CORPus package is freely available with detailed instructions from GitHub at https://github.com/omicsNLP/Auto-CORPus.
Editor(s)
Ruch, Patrick
Date Issued
2022-02-15
Date Acceptance
2022-01-21
Citation
Frontiers in Digital Health, 2022, Healthcare Text Analytics: Unlocking the Evidence from Free Text, Volume II, 4
URI
http://hdl.handle.net/10044/1/94647
DOI
https://www.dx.doi.org/10.3389/fdgth.2022.788124
ISSN
2673-253X
Publisher
Frontiers Media
Journal / Book Title
Frontiers in Digital Health
Volume
4
Copyright Statement
© 2022 Beck, Shorter, Hu, Li, Sun, Popovici, McQuibban, Makraduli,
Yeung, Rowlands and Posma. This is an open-access article distributed under the
terms of the Creative Commons Attribution License (CC BY). The use, distribution
or reproduction in other forums is permitted, provided the original author(s) and
the copyright owner(s) are credited and that the original publication in this journal
is cited, in accordance with accepted academic practice. No use, distribution or
reproduction is permitted which does not comply with these terms.
License URL
http://creativecommons.org/licenses/by/4.0/
Sponsor
Medical Research Council (MRC)
Medical Research Council
Identifier
https://www.frontiersin.org/articles/10.3389/fdgth.2022.788124/abstract
Grant Number
MR/S004033/1
MR/S004033/1
Subjects
Natural Language Processing
text mining
health data
biomedical literature
Semantics
Notes
Published in the Health Informatics section of the journal
Edition
Healthcare Text Analytics: Unlocking the Evidence from Free Text, Volume II
Publication Status
Published
Article Number
ARTN 788124
About
Spiral Depositing with Spiral Publishing with Spiral Symplectic
Contact us
Open access team Report an issue
Other Services
Scholarly Communications Library Services
logo

Imperial College London

South Kensington Campus

London SW7 2AZ, UK

tel: +44 (0)20 7589 5111

Accessibility Modern slavery statement Cookie Policy

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback