Efficient Learning and Evaluation of Complex Concepts in Inductive Logic Programming

Santos, Jose Carlos Almeida Santos

doi:https://doi.org/10.25560/6409

File(s)

Santos-JCA-2011-PhD-Thesis.pdf (1018.9 KB)

Author(s)

Santos, Jose Carlos Almeida Santos

Type

Thesis or dissertation

Abstract

Inductive Logic Programming (ILP) is a subfield of Machine Learning with foundations in logic
programming. In ILP, logic programming, a subset of first-order logic, is used as a uniform
representation language for the problem specification and induced theories. ILP has been
successfully applied to many real-world problems, especially in the biological domain (e.g. drug
design, protein structure prediction), where relational information is of particular importance.
The expressiveness of logic programs grants flexibility in specifying the learning task and understandability
to the induced theories. However, this flexibility comes at a high computational
cost, constraining the applicability of ILP systems. Constructing and evaluating complex concepts
remain two of the main issues that prevent ILP systems from tackling many learning
problems. These learning problems are interesting both from a research perspective, as they
raise the standards for ILP systems, and from an application perspective, where these target
concepts naturally occur in many real-world applications. Such complex concepts cannot
be constructed or evaluated by parallelizing existing top-down ILP systems or improving the
underlying Prolog engine. Novel search strategies and cover algorithms are needed.
The main focus of this thesis is on how to efficiently construct and evaluate complex hypotheses
in an ILP setting. In order to construct such hypotheses we investigate two approaches.
The first, the Top Directed Hypothesis Derivation framework, implemented in the ILP system
TopLog, involves the use of a top theory to constrain the hypothesis space. In the second approach
we revisit the bottom-up search strategy of Golem, lifting its restriction on determinate
clauses which had rendered Golem inapplicable to many key areas. These developments led to
the bottom-up ILP system ProGolem. A challenge that arises with a bottom-up approach is the
coverage computation of long, non-determinate, clauses. Prolog’s SLD-resolution is no longer
adequate. We developed a new, Prolog-based, theta-subsumption engine which is significantly
more efficient than SLD-resolution in computing the coverage of such complex clauses.
We provide evidence that ProGolem achieves the goal of learning complex concepts by presenting
a protein-hexose binding prediction application. The theory ProGolem induced has
a statistically significant better predictive accuracy than that of other learners. More importantly,
the biological insights ProGolem’s theory provided were judged by domain experts to
be relevant and, in some cases, novel.

Date Issued

2010-12

Date Awarded

2011-03

URI

http://hdl.handle.net/10044/1/6409

DOI

https://doi.org/10.25560/6409

Advisor

Muggleton, Stephen

Sternberg, Michael

Sponsor

Wellcome Trust

Creator

Santos, Jose Carlos Almeida Santos

Grant Number

0807/12/Z/06/Z

Publisher Department

Computing

Publisher Institution

Imperial College London

Qualification Level

Doctoral

Qualification Name

Doctor of Philosophy (PhD)