Michael Brudno — Research

Current Projects:

Members of the group work on a diverse set of topics. Broadly we work on Bioinformatics, applying techniques ranging from Theory, to Machine Learning, to Systems (on the computer science side) to Genome Analysis, Molecular Evolution, and Comparative Genomics (on the biology side). The projects below are some of the most active research areas within the group. One specific area of concentration is analysys of High Throughput Sequencing (HTS) data, including alignment, variant detection, visualization, and assembly.

Alignment & Mapping of Short Reads to a Genome

Together with Stephen Rumble (with contributions from Adrian Dalca, Marc Fiume, Vlad Yanovsky, and in a collaboration with Arend Sidow and his group) we developed SHRiMP -- the SHort Read Mapping Program. SHRiMP can align short reads to a reference genome quickly and accurately, while allowing for insertions/deletions. It also comes with special color-space options to handle reads made by the AB SOLiD technology. More recently Matei David, Misko Dzamba, and others have worked on the second version of SHRiMP (SHRiMP2), which significantly speeds up mapping without sacrificing sensitivity.

Variation Detection from HTS data

SNPs & micro-indels. With Adrian Dalca we developed VARiD, a variation detection tool that allows for the mixing of color-space and regular, base-space data, to predict SNPs and very small indels. VARiD is a Hidden Markov Model-based method that represents the genotype as hidden states, while the mapped reads and their bases and colors are emissions from the model. While VARiD has been shown to be the most accurate tool for predicting SNPs from color space data, it also allows mixing of various data-types to predict variants more accurately than can be obtained from color- and letter-space data alone.

Structural variants. Our group has also worked extensively on detection of structural and copy-number variation in the human genome. Elango Cheran and Seunghak Lee developed the first rigorous framework for identifying structural variants (SVs) using paired-end data. Seunghak then extended this work into the MoDIL algorithm that exploits the key advantage of HTS data, the high sequencing coverage. MoDIL is able to identify smaller indels than was previously possible, and explicitly models both homozygous and heterozygous genome variation. MoGUL is a variation on the MoDIL framework that can handle many individuals sequenced at lower coverage.

Copy-number variants.With Paul Medvedev, Marc Fiume, Misko Dzamba, and Tim Smith we developed CNVer, a method for CNV detection that combines the depth-of-coverage (number of reads generated from a segment) and paired-end mapping information (such as MoDIL, above) within a unified computational framework called the donor graph, where paired ends are used to delineate the borders of CNVs. Combining the two information types allows us to mitigate the sequencing biases that cause uneven coverage and accurately predict CNVs.

Savant Browser for HTS Data

The Savant Genome Broswer, developed by Marc Fiume, Andrew Brook, Eric Smith, and Vanessa Williams is a desktop visualization tool for genomic data. It was primarily developed for visualizing high throughput (aka next generation) sequencing data, although it can be used to visualize virtually any genome-based dataset. Savant offers flexible bookmarking and a textual table view that allows for easy viewing and download of underlying data from any track. Savant also features a general plug- in framework, which enables the users to directly compute on the data and create novel visualizations.

Algorithms for Genome Assembly

In a recent paper with Paul Medvedev, Konstantinos Georgiou, and Gene Myers we analyzed the complexity of several popular assembly paradigms, as well as the problem of assembly of double-stranded DNA molecules rather than single-stranded strings. Following up on this work, with Paul Medvedev we developed an algorithm for genome assembly with short, mated reads via convex optimization, and the all-pairs shortest path algorithm. It was published at RECOMB 2008. Together with Nilgun Donmez I am now working on expanding our assembly framework for assembly of a diploid organism with extensive genome variation.

Genome Variation in C. savignyi

Several members of our group are exploring the variation present in the genome of Ciona savingyi. In collaboration with Arend Sidow's group (Kerrin Small et al) I previously worked to assemble and explain the extensive variation in this genome. Assembling the Ciona genome was especially difficult because of its high polymorphism rate - 5%, or 50 fold higher than in humans. More recentley, in collaboration with Alexey Kondrashov and Yegor Bazykin, Nilgun Donmez and I explored evidence of positive selection in this genome. Currently Louie Dinh is looking at additional individuals sequenced with HTS technolgies to identify selective sweeps and other population-genetic phenomena.

Past projects:

DNA Alignment

I led the development of the LAGAN toolkit, which consists of several algorithms for sequence alignment. LAGAN was developed in Serafim Batzoglou's lab at Stanford; Chuong Do, Sanket Malde, Michael F. Kim and Mukund Sundararajan have contributed to various programs in the package. LAGAN has been cited in over 500 publications. Adrian Dalca and I have worked to generalize the sequence alignment scoring schemes into a common framework we call "Rectangle Scoring", implemented in the FRESCO Package.

LAGAN proper consists of three main parts:

(Multi-)LAGAN
LAGAN is a global aligner for long genomic sequences. It has been proven effective at aligning not only closely related genomes, such as mammals, but demonstrated significant conservation of non-coding functional elements between distant organisms such as mammals and fish.
Shuffle-LAGAN
Shuffle-LAGAN is a glocal aligner (one that combines features of global and local alignment) for genomic sequence that have undergone rearrangements. The initial approach was for alignment of two sequences, which we have extended to alignment of whole genomes. A multiple sequence version of shuffle-LAGAN is in the works.
CHAOS
CHAOS is a highly sensitive local aligner I wrote in a collaboration with Burkhard Morgenstern. It is used as the anchoring system in the LAGAN programs, the CHAOS/DIALIGN alignment program. CHAOS has been used for C. intestinalis-C. savignyi comparisons and human-fish comparisons.

Whole Genome Alignments

Working within the Rat Genome Consortium we developed some of the first methods for multiple alignment of whole genomes, and applied them to the comparison and analysis of the rat genome. More recently I worked with Inna Dubchak's group on developing methodologies for whole genome synteny mapping using the Shuffle-LAGAN algorithm.

Snowflock: Parallelization with Virtual Machines

In collaboration with Andres Lagar Cavilla and Eyal de Lara, Joe Whitney and Stephen Rumble have been working to enable the use of Virtual Machines for parallelizable applications as part of the SnowFlock project. This work won the Best Paper Award at EuroSys 2009. GridCentric, is a start-up that is based on the SnowFlock technology for which I act as a scientific advisor.

Protein Sequence Alignment

I participated in the development of the ProbCons protein aligner that was written by Chuong (Tom) Do. This aligner combines the ideas of consistency introduced in previous programs such as DIALIGN and T-COFFEE, with a maximum expected accuracy parse of the alignment pair-HMM that leads to results more accurate than other alignment tools, but with no heuristics.

Alternative Splicing Regulation

Alternative splicing is an important regulatory mechanism known to be used in about half of all mammalian genes. During this process an exon present in DNA may be left out of the mature mRNA, and hence will not be converted into a protein. This mechanism can be used to tailor the protein to the current needs of the cell, and many of the known alternative splicing exons are either tissue-specific or development-specific. With John Conboy, Inna Dubchak, and Mikhail Gelfand we worked on identification of enhancers of alternative splicing.

Alignment Visualization

My work in sequence alignment has lead me to think extensively about methods to interpret the resulting alignments for the biologist. This interest has lead to my participation in both VISTA and Phylo-VISTA projects with Inna Dubchak, Kelly Frazer and many others.