CHAOS: A Heuristic Algorithm for Local Alignment

Michael Brudno and Burkhard Morgenstern

I. Introduction

CHAOS (CHAins Of Scores) is a novel heuristic local alignment program optimized for intronic and intergenic DNA. Most of the methods for heuristic local alignment, such as BLAST[1] and FASTA[2] were developed in the late 80s, when the bulk of available sequences were proteomic. With the new availability of genomic sequences it is appropriate to refine the algorithms used for local alignment so that these more closely reflect the fashion in which the non-coding sequences are conserved.

The major shortcoming of BLAST is that it relies on long exact matches that are not likely to be conserved on the genomic level. Further, both BLAST and FASTA require significant ungapped homology between two regions before allowing gaps in the alignment. While both of these assumptions were appropriate for coding sequences, they are less likely to hold in other regions of the genome.

II Description

The main idea behind the algorithm lies in the chaining together of similar regions, or seeds. While 1 or 2 such seeds by themselves do not necessairly indicate a homology, several of them within a limited distance of each other indicate a common ancestry.

The algorithm is based on initially finding all of the seeds. A seed is a pair of k-long words with at least n identical base pairs (bp). A seed k1 can then be chained to the seed k2 whenever the indeces of k1 in both sequences are higher than the indeces of k2, and k1 and k2 are "near" each other, with "near" defined by both a distance and a gap criteria. The final score of a chain is the total number of matching bp in it. There is no explicit gap penalty for matching seeds which are seperated by an unequal number of bases in the two sequences.

The seeds are located using the trie [3] data structure. The seeds are chained together by conducting range queries by the number of the diaginal over the previously examined seeds stored in a skip-list data structure [4]. Because the seeds are not generated until they are needed, with a seed of length 7 aligning 200kbp verses 200kbp needs only 10mb of memory.

III Results

We have done preliminary tests of CHAOS by using it to anchor the global alignment program DIALIGN [5]. By picking the highest scoring increasing set of CHAOS hits between two sequences we allowed DIALIGN to compare only the sequences between the hits. In several tests the running time of DIALIGN had decreased 70-90%, while the quality scores fell by only 1-2%. We are currently testing CHAOS on the data set from Jareborg, et al [6].

IV Future Work and Availability

We are currently working on defining a sound statistical confidence value of a chain returned by CHAOS, and throughly testing it on a large dataset. CHAOS is freely available at http://www.stanford.edu/~brudno/chaos/

[1] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402. 1997

[2] Pearson, WR. "Flexible sequence similarity searching with the FASTA3 program package." Methods Mol Biol. 2000;132:185-219.

[3] Fredkin, E. "Trie memory." Comm. ACM, vol. 3:490-500, 1960.

[4] Pugh, W. "Skip Lists: Skip lists: A probabilistic alternative to balanced trees." Comm ACM, 33:668-676, 1990.

[5] Morgenstern, B. "DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment."

[6] Jareborg, N. Birney, E & Durbin R. "Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs." Genome Res 1999 Sep;9(9):815-24