Efficient Learning and Evaluation of Complex Concepts in Inductive Logic Programming
Author(s)
Santos, Jose Carlos Almeida Santos
Type
Thesis or dissertation
Abstract
Inductive Logic Programming (ILP) is a subfield of Machine Learning with foundations in logic
programming. In ILP, logic programming, a subset of first-order logic, is used as a uniform
representation language for the problem specification and induced theories. ILP has been
successfully applied to many real-world problems, especially in the biological domain (e.g. drug
design, protein structure prediction), where relational information is of particular importance.
The expressiveness of logic programs grants flexibility in specifying the learning task and understandability
to the induced theories. However, this flexibility comes at a high computational
cost, constraining the applicability of ILP systems. Constructing and evaluating complex concepts
remain two of the main issues that prevent ILP systems from tackling many learning
problems. These learning problems are interesting both from a research perspective, as they
raise the standards for ILP systems, and from an application perspective, where these target
concepts naturally occur in many real-world applications. Such complex concepts cannot
be constructed or evaluated by parallelizing existing top-down ILP systems or improving the
underlying Prolog engine. Novel search strategies and cover algorithms are needed.
The main focus of this thesis is on how to efficiently construct and evaluate complex hypotheses
in an ILP setting. In order to construct such hypotheses we investigate two approaches.
The first, the Top Directed Hypothesis Derivation framework, implemented in the ILP system
TopLog, involves the use of a top theory to constrain the hypothesis space. In the second approach
we revisit the bottom-up search strategy of Golem, lifting its restriction on determinate
clauses which had rendered Golem inapplicable to many key areas. These developments led to
the bottom-up ILP system ProGolem. A challenge that arises with a bottom-up approach is the
coverage computation of long, non-determinate, clauses. Prolog’s SLD-resolution is no longer
adequate. We developed a new, Prolog-based, theta-subsumption engine which is significantly
more efficient than SLD-resolution in computing the coverage of such complex clauses.
We provide evidence that ProGolem achieves the goal of learning complex concepts by presenting
a protein-hexose binding prediction application. The theory ProGolem induced has
a statistically significant better predictive accuracy than that of other learners. More importantly,
the biological insights ProGolem’s theory provided were judged by domain experts to
be relevant and, in some cases, novel.
programming. In ILP, logic programming, a subset of first-order logic, is used as a uniform
representation language for the problem specification and induced theories. ILP has been
successfully applied to many real-world problems, especially in the biological domain (e.g. drug
design, protein structure prediction), where relational information is of particular importance.
The expressiveness of logic programs grants flexibility in specifying the learning task and understandability
to the induced theories. However, this flexibility comes at a high computational
cost, constraining the applicability of ILP systems. Constructing and evaluating complex concepts
remain two of the main issues that prevent ILP systems from tackling many learning
problems. These learning problems are interesting both from a research perspective, as they
raise the standards for ILP systems, and from an application perspective, where these target
concepts naturally occur in many real-world applications. Such complex concepts cannot
be constructed or evaluated by parallelizing existing top-down ILP systems or improving the
underlying Prolog engine. Novel search strategies and cover algorithms are needed.
The main focus of this thesis is on how to efficiently construct and evaluate complex hypotheses
in an ILP setting. In order to construct such hypotheses we investigate two approaches.
The first, the Top Directed Hypothesis Derivation framework, implemented in the ILP system
TopLog, involves the use of a top theory to constrain the hypothesis space. In the second approach
we revisit the bottom-up search strategy of Golem, lifting its restriction on determinate
clauses which had rendered Golem inapplicable to many key areas. These developments led to
the bottom-up ILP system ProGolem. A challenge that arises with a bottom-up approach is the
coverage computation of long, non-determinate, clauses. Prolog’s SLD-resolution is no longer
adequate. We developed a new, Prolog-based, theta-subsumption engine which is significantly
more efficient than SLD-resolution in computing the coverage of such complex clauses.
We provide evidence that ProGolem achieves the goal of learning complex concepts by presenting
a protein-hexose binding prediction application. The theory ProGolem induced has
a statistically significant better predictive accuracy than that of other learners. More importantly,
the biological insights ProGolem’s theory provided were judged by domain experts to
be relevant and, in some cases, novel.
Date Issued
2010-12
Date Awarded
2011-03
Advisor
Muggleton, Stephen
Sternberg, Michael
Sponsor
Wellcome Trust
Creator
Santos, Jose Carlos Almeida Santos
Grant Number
0807/12/Z/06/Z
Publisher Department
Computing
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)