Repository logo
  • Log In
    Log in via Symplectic to deposit your publication(s).
Repository logo
  • About
  • Communities & Collections
  • Advanced Search
  • Statistics
  • Log In
    Log in via Symplectic to deposit your publication(s).
  1. Home
  2. Faculty of Medicine
  3. Department of Medicine
  4. Medicine PhD theses
  5. Towards a natural language processing pipeline and search engine for biomedical associations derived from scientific literature
 
  • Details
Towards a natural language processing pipeline and search engine for biomedical associations derived from scientific literature
File(s)
Galea-D-2020-PhD-Thesis.pdf (5.91 MB)
Thesis
Author(s)
Galea, Dieter
Type
Thesis
Abstract
Biomedical research is published at a rapid rate, with PubMed containing over 29 million
publications. A natural language processing pipeline (NLP) facilitating information extraction
is required. Existing pipelines achieve promising performance, but are often restricted to a
small number of bioentities (such as genes and diseases), ignore negative associations, and treat
new claims and background sentences equally. Here, different NLP tasks required to develop
a scalable and generalizable open source pipeline for biomedical association extraction that
tackles these limitations are investigated. In turn, this is used to build a repository of queryable
associations.
Starting by optimizing how biomedical language is represented in machine learning (ML)
models, state-of-the-art representations are obtained and subsequently used in downstream
tasks, including bioentity recognition. Latter work indicates that current recognition models
are poorly generalizable, resulting in unrealistic performance when applied at scale.
Additionally, it is shown here that acquiring more data does not improve ML-based entity
recognition performance. Beyond ML methods, this work presents a number of dictionarybased approaches and graph-based dictionaries for more than 13 sources covering metabolites,
genes/proteins, species, chemicals, toxins, drugs, diseases, foods, food compounds and
anatomy are compiled. These are used to annotate PubMed for subsequent association
extraction.
To achieve a diverse association extraction pipeline for 10 entity types, we attempt to find a
balance between generalizable rules and ML models. A neural model is trained to identify
novel association claims with 94% accuracy and a rule-based approach to identify negated
statements with up to 91% accuracy. A set of rules are devised to define associations.
Quantitative evaluation shows promising results, however further work is required. Extracted
associations are stored in a graph database, enabling querying for associations reported in
literature, as well as discovering new potential indirect linkages. To demonstrate its future use,
a frontend proof of concept is presented.
Version
Open Access
Date Issued
2019-07
Date Awarded
2020-09
URI
http://hdl.handle.net/10044/1/83109
DOI
https://doi.org/10.25560/83109
Copyright Statement
Creative Commons Attribution Licence
License URL
Attribution 4.0 International
Advisor
Veselkov, Kirill
Takats, Zoltan
Sponsor
Imperial College London
Publisher Department
Department of Metabolism, Digestion and Reproduction
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)
About
Spiral Depositing with Spiral Publishing with Spiral Symplectic
Contact us
Open access team Report an issue
Other Services
Scholarly Communications Library Services
logo

Imperial College London

South Kensington Campus

London SW7 2AZ, UK

tel: +44 (0)20 7589 5111

Accessibility Modern slavery statement Cookie Policy

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback