Background: We investigate whether annotation of gene
function can be improved using a classification scheme that is aware
that functional classes are organized in a hierarchy. The classifiers
look at phylogenic descriptors, sequence based attributes, and
predicted secondary structure. We discuss three Bayesian models and
compare their performance in terms of predictive accuracy. These
models are the ordinary multinomial logit (MNL) model, a hierarchical
model based on a set of nested MNL models, and an MNL model with a
prior that introduces correlations between the parameters for classes
that are nearby in the hierarchy. We also provide a new scheme for
combining different sources of information. We use these models to
predict the functional class of Open Reading Frames (ORFs) from the
E. coli genome.
Results: The results from all three models show substantial improvement over previous methods, which were based on the C5 decision tree algorithm. The MNL model using a prior based on the hierarchy outperforms both the non-hierarchical MNL model and the nested MNL model. In contrast to previous attempts at combining the three sources of information in this dataset, our new approach to combining data sources produces a higher accuracy rate than applying our models to each data source alone.
Conclusion: Together, these results show that gene function can be predicted with higher accuracy than previously achieved, using Bayesian models that incorporate suitable prior information.
BMC Bioinformatics, 2006, 7:448, 9 pages: pdf, html.
Shahbaba, B. and Neal, R. M. (2006) ``Gene function classification using Bayesian models with hierarchy-based priors'', Technical Report No. 0606, Dept. of Statistics, 14 pages: abstract, postscript, pdf, associated references.
The class of models that are used here for gene function classification was introduced in the following technical report:
Shahbaba, B. and Neal, R. M. (2005) ``Improving classification when a class hierarchy is available using a hierarchy-based prior'', Technical Report No. 0510, Dept. of Statistics, 11 pages: abstract, postscript, pdf, associated references.