Towards a natural language processing pipeline and search engine for biomedical associations derived from scientific literature
File(s)
Author(s)
Galea, Dieter
Type
Thesis
Abstract
Biomedical research is published at a rapid rate, with PubMed containing over 29 million
publications. A natural language processing pipeline (NLP) facilitating information extraction
is required. Existing pipelines achieve promising performance, but are often restricted to a
small number of bioentities (such as genes and diseases), ignore negative associations, and treat
new claims and background sentences equally. Here, different NLP tasks required to develop
a scalable and generalizable open source pipeline for biomedical association extraction that
tackles these limitations are investigated. In turn, this is used to build a repository of queryable
associations.
Starting by optimizing how biomedical language is represented in machine learning (ML)
models, state-of-the-art representations are obtained and subsequently used in downstream
tasks, including bioentity recognition. Latter work indicates that current recognition models
are poorly generalizable, resulting in unrealistic performance when applied at scale.
Additionally, it is shown here that acquiring more data does not improve ML-based entity
recognition performance. Beyond ML methods, this work presents a number of dictionarybased approaches and graph-based dictionaries for more than 13 sources covering metabolites,
genes/proteins, species, chemicals, toxins, drugs, diseases, foods, food compounds and
anatomy are compiled. These are used to annotate PubMed for subsequent association
extraction.
To achieve a diverse association extraction pipeline for 10 entity types, we attempt to find a
balance between generalizable rules and ML models. A neural model is trained to identify
novel association claims with 94% accuracy and a rule-based approach to identify negated
statements with up to 91% accuracy. A set of rules are devised to define associations.
Quantitative evaluation shows promising results, however further work is required. Extracted
associations are stored in a graph database, enabling querying for associations reported in
literature, as well as discovering new potential indirect linkages. To demonstrate its future use,
a frontend proof of concept is presented.
publications. A natural language processing pipeline (NLP) facilitating information extraction
is required. Existing pipelines achieve promising performance, but are often restricted to a
small number of bioentities (such as genes and diseases), ignore negative associations, and treat
new claims and background sentences equally. Here, different NLP tasks required to develop
a scalable and generalizable open source pipeline for biomedical association extraction that
tackles these limitations are investigated. In turn, this is used to build a repository of queryable
associations.
Starting by optimizing how biomedical language is represented in machine learning (ML)
models, state-of-the-art representations are obtained and subsequently used in downstream
tasks, including bioentity recognition. Latter work indicates that current recognition models
are poorly generalizable, resulting in unrealistic performance when applied at scale.
Additionally, it is shown here that acquiring more data does not improve ML-based entity
recognition performance. Beyond ML methods, this work presents a number of dictionarybased approaches and graph-based dictionaries for more than 13 sources covering metabolites,
genes/proteins, species, chemicals, toxins, drugs, diseases, foods, food compounds and
anatomy are compiled. These are used to annotate PubMed for subsequent association
extraction.
To achieve a diverse association extraction pipeline for 10 entity types, we attempt to find a
balance between generalizable rules and ML models. A neural model is trained to identify
novel association claims with 94% accuracy and a rule-based approach to identify negated
statements with up to 91% accuracy. A set of rules are devised to define associations.
Quantitative evaluation shows promising results, however further work is required. Extracted
associations are stored in a graph database, enabling querying for associations reported in
literature, as well as discovering new potential indirect linkages. To demonstrate its future use,
a frontend proof of concept is presented.
Version
Open Access
Date Issued
2019-07
Date Awarded
2020-09
Copyright Statement
Creative Commons Attribution Licence
License URL
Advisor
Veselkov, Kirill
Takats, Zoltan
Sponsor
Imperial College London
Publisher Department
Department of Metabolism, Digestion and Reproduction
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)