Algorithms for comparison of dna sequences.

Michael Brudno

Abstract

To compare entire genomes from different species, biologists increasingly need alignment methods that are efficient enough to handle long sequences, and accurate enough to correctly align the conserved biological features between distant species. The sequencing of several mammalian genomes necessitated the development of tools for multiple alignment of large genomes. In this dissertation we describe several algorithms for alignment of long genomic sequences. The most basic of all alignment problems is that of local alignment. In this problem one is asked to return all regions of similarity that score above a particular threshold under some distance metric. We developed CHAOS, a novel heuristic local alignment algorithm that is meant to model the evolution of non-coding regions of the genome and has been shown to perform well at alignment of very distant genomic sequences. In the problem of global alignment one introduces an additional constraint of monotonicity; the set of similarities has to be ordered in the two sequences. We developed LAGAN, an algorithm for multiple alignment of long DNA sequences. This program was shown to correctly align exons from divergent organisms and has been used extensively by biologists over the last two years. As one example of LAGANs use, we show the multiple alignment of human, mouse and rat genomes built using this program. In order to overcome the limitations of local and global alignment methods, we have introduced the concept of glocal alignment, where one finds a transformation of the letters of one sequence into another (as in global alignment) while allowing for various rearrangement events (as in local alignment). An approach to this problem was implemented in the Shuffle-LAGAN program that has been shown effective at finding genomic rearrangements. Finally we demonstrate a novel approach that uses variable length words instead of fixed length ones to find seeds - short matching words between a single sequence and a database of sequences that are used to narrow down the search space for a local alignment algorithm. We demonstrate that these words (called var-mers) result in a factor of 4 decrease in the number of spurious seeds.

Publication

Date

January, 2004

Links

Link