
[ ILPnet2 | Library | Newsletter | CSCW | Education | End-User Club | Events | Nodes | Systems | Applications | Members only ]
Are grammatical representations useful for learning from biological
sequence data? - a case study
S. H. Muggleton,
C. H.
Bryant,
A. Srinivasan,
A. Whittaker,
S. Topp,
and
C. Rawlings.
Journal of Computational Biology, 8(5):493--522, October 2001. More behind this link.
Abstract
This paper investigates whether Chomsky-like grammar representations are useful
for learning cost-effective, comprehensible predictors of members of
biological sequence families. The Inductive Logic Programming (ILP) Bayesian
approach to learning from positive examples is used to generate a grammar for
recognising a class of proteins known as human neuropeptide precursors
(NPPs). Collectively, five of the co-authors of this paper, have extensive
expertise on NPPs and general bioinformatics methods. Their motivation for
generating a NPP grammar was that none of the existing bioinformatics methods
could provide sufficient cost-savings during the search for new NPPs. Prior
to this project experienced specialists at SmithKline Beecham had tried for
many months to hand-code such a grammar but without success. Our best
predictor makes the search for novel NPPs \bf more than 100 times more
efficient than randomly selecting proteins for synthesis and testing them
for biological activity. As far as these authors are aware, this is both the
first biological grammar learnt using ILP and the first real-world scientific
application of the ILP Bayesian approach to learning from positive examples.
A group of features is derived from this grammar. Other groups of features of
NPPs are derived using other learning strategies. Amalgams of these groups
are formed. A recognition model is generated for each amalgam using C4.5 and
C4.5rules and its performance is measured using both predictive accuracy and
a new cost function, \em Relative Advantage ($RA$). The highest $RA$ was
achieved by a model which includes grammar-derived features. This $RA$ is
significantly higher than the best $RA$ achieved without the use of the
grammar-derived features. Predictive accuracy is not a good measure of
performance for this domain because it does not discriminate well between NPP
recognition models: despite covering varying numbers of (the rare) positives,
all the models are awarded a similar (high) score by predictive accuracy
because they all exclude most of the abundant negatives.
BibTeX entry.
Other publications
S H Muggleton,
stephen@cs.york.ac.uk,
C H Bryant,
bryant@cs.york.ac.uk,
A Srinivasan,
Ashwin.Srinivasan@comlab.ox.ac.uk. Last modified on Wednesday 9 April 2003 at 18:31. © 2003 ILPnet2