Beyond gap models: Reconstructing alignments and phylogenies under genomic-scale events.

Abstract

Multiple sequence alignment (MSA) has long been a mainstay of bioinformatics and has proved quite useful in the alignment of well conserved protein and DNA sequences; some of these sequences have also been used with great success in phylogenetic reconstruction. Sequences with low percentage identity, on the other hand, typically yield poor alignments. Now that researchers want to produce alignments among widely divergent genomes, including both coding and noncoding sequences we need to revisit both multiple alignment and phylogenetic reconstruction under more ambitious models, ones that take into account the plethora of genomic events rather than just substitutions and insertions/deletions (indels). We also need to revisit multiple sequence alignment and phylogeny reconstruction for datasets currently beyond the capability of existing methods, due to high rates of site substitution, high indel rates, and large numbers of taxa or sites. Most current methods postulate only two types of events: substitutions (modeled with a transition matrix, such as PAM or BLOSUM matrices for protein data) and indels (rarely modelled beyond a simple affine cost function for the size of the gaps). While these two events can indeed transform any sequence into any other, their model of genomic events is far too simplistic: substitutions are not location- or neighbor-independent and indels can be caused by a variety of complex events, such as uneven recombination, insertion of transposable elements, gene duplication/loss, lateral transfer, etc. Moreover, genomic rearrangement events can completely mislead procedures based on most current models, resulting in a total loss of alignment when a homologous element has undergone an inversion or a duplication. Computational biologists have been studying genome rearrangements for 20 years and have started work on duplication and loss events. Taking these events into account in a multiple alignment will require the simultaneous construction of the alignment and of the phylogenetic tree -- an approach also known as phylogenetic alignment. Up to very recently, the computational complexity of phylogenetic alignment was widely viewed as too high, but the state of the art in phylogenetic reconstruction has advanced significantly over the last 10 years, both in terms of accuracy and in terms of computational efficiency, so that what was then impossible is now merely difficult.

Publication
Pacific Symposium on Biocomputing, 13: 1-2, 2008
Date
Links