Auto-CORPus: Automated and Consistent Outputs from Research Publications
OA Location
Author(s)
Type
Conference Paper
Abstract
The availability of improved natural language processing (NLP) algorithms and models enable researchers to analyse larger corpora using open source tools. Text mining of biomedical literature is one area for which NLP has been used in recent years with large untapped potential. However, to generate corpora that can be analysed using machine learning NLP algorithms, these need to be standardized. Summarizing data from literature to be stored into databases typically requires manual curation, especially for extracting data from result tables.
We present an automated pipeline that cleans HTML files from biomedical literature. The outputs are JSON files that contains the text for each section, table data in machine-readable format and lists the phenotypes, assays, chemical compounds, SNPs, P-values and abbreviations found in the article. We analysed a total of 2,441 Open Access articles from PubMed Central, from both Genome-Wide and Metabolome-Wide Association Studies, and developed a model to standardize the section headers based on the Information Artifact Ontology. Extraction of table data was developed on PubMed articles and fine-tuned using the equivalent publisher versions. As part of this work we found evidence of tables being converted to figures by authors as well as publishers. To this end we have developed a pipeline that converts the table from an image back to text while keeping the table structure intact. We have fine-tuned the Tesseract optical character recognition (OCR) algorithm specifically for biomedical table data. We have improved the accuracy of recognising characters in table-images using the original Tesseract algorithm from 53% to 90% when evaluated on 233 tables from 80 publications.
In summary, Auto-CORPus can be used to create a corpus for different fields where the section headers are standardised to allow NLP algorithms to be applied to specific paragraphs, rather than only on abstracts or the full text.
We present an automated pipeline that cleans HTML files from biomedical literature. The outputs are JSON files that contains the text for each section, table data in machine-readable format and lists the phenotypes, assays, chemical compounds, SNPs, P-values and abbreviations found in the article. We analysed a total of 2,441 Open Access articles from PubMed Central, from both Genome-Wide and Metabolome-Wide Association Studies, and developed a model to standardize the section headers based on the Information Artifact Ontology. Extraction of table data was developed on PubMed articles and fine-tuned using the equivalent publisher versions. As part of this work we found evidence of tables being converted to figures by authors as well as publishers. To this end we have developed a pipeline that converts the table from an image back to text while keeping the table structure intact. We have fine-tuned the Tesseract optical character recognition (OCR) algorithm specifically for biomedical table data. We have improved the accuracy of recognising characters in table-images using the original Tesseract algorithm from 53% to 90% when evaluated on 233 tables from 80 publications.
In summary, Auto-CORPus can be used to create a corpus for different fields where the section headers are standardised to allow NLP algorithms to be applied to specific paragraphs, rather than only on abstracts or the full text.
Date Issued
2021-06-15
Date Acceptance
2021-04-30
Citation
2021
Sponsor
Medical Research Council (MRC)
Medical Research Council
Grant Number
MR/S004033/1
MR/S004033/1
Source
UK healthcare text analytics conference 2021
Publication Status
Published
Start Date
2021-06-17
Finish Date
2021-06-18
Coverage Spatial
London
Date Publish Online
2021-06-15