Members of the group work on a diverse set of topics. Broadly we work
on Bioinformatics, applying techniques ranging from
Theory, to Machine Learning, to Systems (on the computer science side)
to Genome Analysis, Molecular Evolution, and Comparative Genomics (on the
biology side). The projects below are some of the most active research
areas within the group. One specific area of concentration is
analysys of High Throughput Sequencing (HTS) data, including alignment,
variant detection, visualization, and assembly.
Alignment & Mapping of Short Reads to a Genome
Together with Stephen Rumble (with contributions from Adrian Dalca, Marc Fiume, Vlad Yanovsky, and in a collaboration
with Arend Sidow and his
group) we developed
SHRiMP -- the SHort Read Mapping Program.
SHRiMP can align short reads to a reference genome quickly and accurately,
while allowing for insertions/deletions. It also comes with special
color-space options to handle reads made by the
AB SOLiD technology. More recently Matei David,
Misko Dzamba, and others have worked on the second version of SHRiMP (SHRiMP2), which significantly
speeds up mapping without sacrificing sensitivity.
Variation Detection from HTS data
SNPs & micro-indels. With Adrian Dalca we developed
VARiD, a variation detection tool
that allows for the mixing of color-space and regular, base-space data, to predict
SNPs and very small indels. VARiD is a Hidden Markov Model-based method that represents the genotype
as hidden states, while the mapped reads and their bases and colors are emissions from the model.
While VARiD has been shown to be the most accurate tool for predicting SNPs from color space data, it
also allows mixing of various data-types to predict variants more accurately than can be obtained from
color- and letter-space data alone.
Structural variants. Our group has also worked extensively on
detection of structural and copy-number variation in the human genome. Elango Cheran and Seunghak Lee
developed the first rigorous framework for identifying structural variants (SVs) using paired-end data.
Seunghak then extended this work into the MoDIL algorithm that exploits the key advantage of HTS
data, the high sequencing coverage. MoDIL is able to identify smaller indels than was previously
possible, and explicitly models both homozygous and heterozygous genome variation. MoGUL is a variation on
the MoDIL framework that can handle many individuals sequenced at lower coverage.
Copy-number variants.With Paul Medvedev, Marc Fiume, Misko Dzamba, and Tim Smith
we developed CNVer, a method for CNV detection that combines the
depth-of-coverage (number of reads generated from a segment) and paired-end mapping information
(such as MoDIL, above) within a unified computational framework called the donor graph, where paired
ends are used to delineate the borders of CNVs. Combining the two information types allows us to
mitigate the sequencing biases that cause uneven coverage and accurately predict CNVs.
Savant Browser for HTS Data
The Savant Genome Broswer, developed by Marc Fiume, Andrew Brook,
Eric Smith, and Vanessa Williams is a desktop visualization tool for genomic data. It was primarily
developed for visualizing high throughput (aka next generation) sequencing data, although it can
be used to visualize virtually any genome-based dataset. Savant offers flexible bookmarking and
a textual table view that allows for easy viewing and download of underlying data from any track.
Savant also features a general plug- in framework, which enables the users to directly compute on
the data and create novel visualizations.
Algorithms for Genome Assembly
In a recent paper with
Konstantinos Georgiou, and
Gene Myers we analyzed
the complexity of several popular assembly paradigms, as well as the
problem of assembly of double-stranded DNA molecules rather than
single-stranded strings. Following up on this work, with Paul Medvedev
we developed an algorithm for genome assembly with short, mated reads
via convex optimization, and the all-pairs shortest path algorithm. It was
published at RECOMB 2008.
Together with Nilgun Donmez I am now working on expanding our assembly framework for assembly of a
diploid organism with extensive genome variation.
Genome Variation in C. savignyi
Several members of our group are exploring the variation
present in the genome of Ciona savingyi. In collaboration with Arend Sidow's group (Kerrin Small et al)
I previously worked to assemble and explain the extensive variation
in this genome. Assembling the Ciona genome was especially difficult
because of its high polymorphism rate - 5%, or 50 fold higher than in humans.
More recentley, in collaboration with Alexey Kondrashov and Yegor Bazykin, Nilgun Donmez and I explored
evidence of positive
selection in this genome. Currently Louie Dinh is looking at additional
individuals sequenced with HTS technolgies to identify selective sweeps and other population-genetic phenomena.
I led the development of the
LAGAN toolkit, which consists of several algorithms for sequence
alignment. LAGAN was developed in
Serafim Batzoglou's lab at
Stanford; Chuong Do, Sanket Malde,
Michael F. Kim and Mukund Sundararajan have contributed to various programs in the
package. LAGAN has been cited in over 500 publications. Adrian Dalca and I have worked to generalize
the sequence alignment scoring schemes into a common
framework we call "Rectangle Scoring", implemented in the
LAGAN proper consists of three main parts:
LAGAN is a global aligner for long genomic sequences.
It has been proven effective at aligning not only closely related
genomes, such as
mammals, but demonstrated significant conservation of
non-coding functional elements between distant organisms such as
mammals and fish.
Shuffle-LAGAN is a glocal aligner (one that combines features of global and local
alignment) for genomic sequence that have undergone rearrangements. The initial
approach was for alignment of two sequences, which we have extended to
alignment of whole genomes. A multiple
sequence version of shuffle-LAGAN is in the works.
CHAOS is a highly sensitive local aligner I wrote in a collaboration with
Burkhard Morgenstern. It is used as the
anchoring system in the LAGAN programs, the
CHAOS/DIALIGN alignment program. CHAOS has been used for
C. intestinalis-C. savignyi comparisons and
Whole Genome Alignments
Working within the Rat Genome Consortium we developed some of the first
methods for multiple alignment of whole genomes, and applied them
to the comparison and analysis of the rat genome. More recently I worked
with Inna Dubchak's group
on developing methodologies for whole genome synteny mapping using the
In collaboration with Andres
Lagar Cavilla and Eyal de
Lara, Joe Whitney and Stephen Rumble have been working to enable the use
of Virtual Machines for parallelizable applications as part of the SnowFlock project. This work won the
Best Paper Award at EuroSys 2009. GridCentric, is a start-up that is based on the SnowFlock technology
for which I act as a scientific advisor.
Protein Sequence Alignment
I participated in the development of the
protein aligner that was written by
Chuong (Tom) Do. This aligner combines the ideas of consistency
introduced in previous programs such as
with a maximum expected accuracy parse of the alignment pair-HMM that leads to results
more accurate than other alignment tools, but with no heuristics.
Alternative Splicing Regulation
Alternative splicing is an important regulatory mechanism known to be used in about half
of all mammalian genes. During this process an exon present in DNA may be left out of
the mature mRNA, and hence will not be converted into a protein. This mechanism can be
used to tailor the protein to the current needs of the cell, and many of the known
alternative splicing exons are either tissue-specific or development-specific. With
John Conboy, Inna Dubchak, and
we worked on identification of enhancers of alternative splicing.
My work in sequence alignment has lead me to think extensively about methods to
interpret the resulting alignments for the biologist. This
interest has lead to my participation in both
Phylo-VISTA projects with
Kelly Frazer and many others.