Repository logo
  • Log In
    Log in via Symplectic to deposit your publication(s).
Repository logo
  • Communities & Collections
  • Research Outputs
  • Statistics
  • Log In
    Log in via Symplectic to deposit your publication(s).
  1. Home
  2. Faculty of Medicine
  3. Faculty of Medicine
  4. Exploiting and assessing multi-source data for supervised biomedical named entity recognition
 
  • Details
Exploiting and assessing multi-source data for supervised biomedical named entity recognition
File(s)
bty152.pdf (789.57 KB)
Published version
Author(s)
Galea, Dieter
Laponogov, Ivan
Veselkov, Kirill
Type
Journal Article
Abstract
Motivation:
Recognition of biomedical entities from scientific text is a critica
l component of natural
language processing and automated information extraction platfo
rms. Modern named entity recognition
approaches rely heavily on supervised machine learning tech
niques, which are critically dependent on
annotated training corpora. These approaches have been shown to
perform well when trained and tested
on the same source. However, in such scenario, the performance
and evaluation of these models may be
optimistic, as such models may not necessarily generalize to in
dependent corpora, resulting in potential
non-optimal entity recognition for large-scale tagging of widel
y diverse articles in databases such as
PubMed.
Results:
Here we aggregated published corpora for the recognition of bio
molecular entities (such as
genes, RNA, proteins, variants, drugs, and metabolites), identi
fied entity class overlap and performed
leave-corpus-out cross validation strategy to test the efficiency o
f existing models. We demonstrate
that accuracies of models trained on individual corpora decre
ase substantially for recognition of the
same biomolecular entity classes in independent corpora. Thi
s behavior is possibly due to limited
generalizability of entity-class-related features captured by i
ndividual corpora (model “overtraining”) which
we investigated further at the orthographic level, as well as potenti
al annotation standard differences.
We show that the combined use of multi-source training corpora re
sults in overall more generalizable
models for named entity recognition, while achieving comparab
le individual performance. By performing
learning-curve-based power analysis we further identified that
performance is often not limited by the
quantity of the annotated data.
Date Issued
2018-07-15
Date Acceptance
2018-02-14
Citation
Bioinformatics, 2018, 34 (14), pp.2472-2482
URI
http://hdl.handle.net/10044/1/57872
DOI
https://www.dx.doi.org/10.1093/bioinformatics/bty152
ISSN
1367-4803
Publisher
Oxford University Press (OUP)
Start Page
2472
End Page
2482
Journal / Book Title
Bioinformatics
Volume
34
Issue
14
Copyright Statement
© The Author(s) 2018. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/
4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited
Sponsor
Biotechnology and Biological Sciences Research Council (BBSRC)
Commission of the European Communities
Grant Number
BB/L020858/1
634402
Subjects
01 Mathematical Sciences
06 Biological Sciences
08 Information And Computing Sciences
Bioinformatics
Publication Status
Published
Date Publish Online
2018-03-10
About
Spiral Depositing with Spiral Publishing with Spiral Symplectic
Contact us
Open access team Report an issue
Other Services
Scholarly Communications Library Services
logo

Imperial College London

South Kensington Campus

London SW7 2AZ, UK

tel: +44 (0)20 7589 5111

Accessibility Modern slavery statement Cookie Policy

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback