Exploiting and assessing multi-source data for supervised biomedical named entity recognition
File(s)bty152.pdf (789.57 KB)
Published version
Author(s)
Galea, Dieter
Laponogov, Ivan
Veselkov, Kirill
Type
Journal Article
Abstract
Motivation:
Recognition of biomedical entities from scientific text is a critica
l component of natural
language processing and automated information extraction platfo
rms. Modern named entity recognition
approaches rely heavily on supervised machine learning tech
niques, which are critically dependent on
annotated training corpora. These approaches have been shown to
perform well when trained and tested
on the same source. However, in such scenario, the performance
and evaluation of these models may be
optimistic, as such models may not necessarily generalize to in
dependent corpora, resulting in potential
non-optimal entity recognition for large-scale tagging of widel
y diverse articles in databases such as
PubMed.
Results:
Here we aggregated published corpora for the recognition of bio
molecular entities (such as
genes, RNA, proteins, variants, drugs, and metabolites), identi
fied entity class overlap and performed
leave-corpus-out cross validation strategy to test the efficiency o
f existing models. We demonstrate
that accuracies of models trained on individual corpora decre
ase substantially for recognition of the
same biomolecular entity classes in independent corpora. Thi
s behavior is possibly due to limited
generalizability of entity-class-related features captured by i
ndividual corpora (model “overtraining”) which
we investigated further at the orthographic level, as well as potenti
al annotation standard differences.
We show that the combined use of multi-source training corpora re
sults in overall more generalizable
models for named entity recognition, while achieving comparab
le individual performance. By performing
learning-curve-based power analysis we further identified that
performance is often not limited by the
quantity of the annotated data.
Recognition of biomedical entities from scientific text is a critica
l component of natural
language processing and automated information extraction platfo
rms. Modern named entity recognition
approaches rely heavily on supervised machine learning tech
niques, which are critically dependent on
annotated training corpora. These approaches have been shown to
perform well when trained and tested
on the same source. However, in such scenario, the performance
and evaluation of these models may be
optimistic, as such models may not necessarily generalize to in
dependent corpora, resulting in potential
non-optimal entity recognition for large-scale tagging of widel
y diverse articles in databases such as
PubMed.
Results:
Here we aggregated published corpora for the recognition of bio
molecular entities (such as
genes, RNA, proteins, variants, drugs, and metabolites), identi
fied entity class overlap and performed
leave-corpus-out cross validation strategy to test the efficiency o
f existing models. We demonstrate
that accuracies of models trained on individual corpora decre
ase substantially for recognition of the
same biomolecular entity classes in independent corpora. Thi
s behavior is possibly due to limited
generalizability of entity-class-related features captured by i
ndividual corpora (model “overtraining”) which
we investigated further at the orthographic level, as well as potenti
al annotation standard differences.
We show that the combined use of multi-source training corpora re
sults in overall more generalizable
models for named entity recognition, while achieving comparab
le individual performance. By performing
learning-curve-based power analysis we further identified that
performance is often not limited by the
quantity of the annotated data.
Date Issued
2018-07-15
Date Acceptance
2018-02-14
Citation
Bioinformatics, 2018, 34 (14), pp.2472-2482
ISSN
1367-4803
Publisher
Oxford University Press (OUP)
Start Page
2472
End Page
2482
Journal / Book Title
Bioinformatics
Volume
34
Issue
14
Copyright Statement
© The Author(s) 2018. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/
4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/
4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited
Sponsor
Biotechnology and Biological Sciences Research Council (BBSRC)
Commission of the European Communities
Grant Number
BB/L020858/1
634402
Subjects
01 Mathematical Sciences
06 Biological Sciences
08 Information And Computing Sciences
Bioinformatics
Publication Status
Published
Date Publish Online
2018-03-10