Analysis of Liquid Chromatography Mass Spectrometry (LC-MS) Data

 
     
 

In this work we aim to develop and apply statistical and computational methods for extracting information in novel ways (i.e., without use of MS/MS) from Liquid-Chromatography-Mass Spectrometry proteomic data. Advancement in the area of LC-MS analysis has the potential to make a significant impact on human health by playing a key role in the solution to problems such as (i) uncovering early disease markers, which can then lead to improved patient outcomes, (ii) drug development by way of identifying target molecules or pathways, and (iii) systematic monitoring of physiological responses to allow for a more comprehensive understanding of drug action and disease. It could also help in more general/basic problems confronting modern biology by providing a new kind of microscopic window into living organisms.


Back to main page .


Related Papers:

  1. Analysis of sibling time series data: alignment and difference detection. (abstract and thesis)
    Jennifer Listgarten,
    Ph.D. Thesis, Department of Computer Science, University of Toronto, 2006.

  2. Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure. (abstract, paper)
    Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Rachel Puckrin and Sean Cutler,
    To appear in Advances in Neural Information Processing Systems 19, MIT Press, Cambridge, MA, 2007 (NIPS 2006).
    Best Student Paper, Honorable Mention

  3. Difference detection in LC-MS data for protein biomarker discovery. (abstract, paper and data set)
    Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Peter Wong and Andrew Emili,
    In Bioinformatics, 2007, 23:e198-e204 [by way of ECCB 2006 (European Conference on Computational Biology)]
    Best Student Paper, 3rd prize

  4. Practical proteomic biomarker discovery: taking a step back to leap forward. (abstract) (paper)
    Jennifer Listgarten and Andrew Emili,
    in Drug Discovery Today, 2005 10:1697-1702.

  5. Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry. (abstract) (paper)
    Jennifer Listgarten and Andrew Emili,
    in Molecular and Cellular Proteomics, 2005 4:419-434.

  6. Multiple Alignment of Continuous Time Series. (abstract, paper, slides, and audio demo)
    Jennifer Listgarten, Radford M. Neal, Sam T. Roweis and Andrew Emili,
    in Advances in Neural Information Processing Systems 17, MIT Press, Cambridge, MA, 2005.
    The Continuous Profile Models (CPM) Matlab Toolbox is now available.

    The Short Story:

    Our specific goals in this project are to:

    • Develop a principled method for alignment of the experimental LC time axis so that times can be compared across experiments, and so that experimental LC-MS replicates (numerous runs with the same biological specimen) can be combined to improve the signal-to-noise ratio. This first goal is a critical sub-task of the other two goals, and is an interesting problem in its own right. Principled alignment of time series data is useful in many domains, such as speech processing and time-series microarray experiments, to name but a few.

    • Develop a probabilistic model which allows us to predict which of two classes an unknown specimen belongs to.

    • Develop statistical and computational methods which allow us to reliably detect differences in protein composition between two related sets of specimens (e.g., cancer versus non-cancer) (without use of Tandem Mass).



    The Longer Story:

    Proteins Play a Starring Role

    The central dogma in molecular biology informs us that DNA is transcribed into mRNA, and mRNA is translated into protein, which in turn directly affects many events in the body, from very local events like cell death to more global events such as muscle contraction. Though DNA is often called the blue-print of life, there is, under most circumstances, no simple mapping between a person's DNA sequence and that person's health, intelligence or metabolism, for example. In between DNA and disease lies a complex biological machinery, in which proteins play a fundamental role. As such, emphasis in molecular biology is shifting away from DNA sequencing and analysis to focus on the study of richer downstream events such as those centered on proteins.

    Measuring Protein Levels

    The biological research community is tackling the large-scale study of proteins, referred to as proteomics, with a range of high-throughput technologies which seek to achieve systematic and reliable quantitation and identification of protein levels from, for example, human blood serum specimens. One popular and promising technology is that of mass spectrometry. A mass spectrometer takes a specimen as input and produces a measure of the abundance of molecules (e.g., protein fragments) occurring at every mass/charge ratio observed. From the pattern of abundance values, one can hope to infer which proteins are present and in what quantity. For mixtures which contain an enormous number of proteins, such as human serum, a specimen preparation step is used to physically separate components of the specimen on the basis of some property of the molecules such as hydrophobicity. One powerful separation technique, Liquid Chromatography (LC), spreads out a single specimen over time, so that at multiple, unique time points, a less complex mixture can be fed into the mass spectrometer. This combined technique is referred to as Liquid Chromatography-Mass Spectrometry (LC-MS).

    Problem Set-Up

    A single specimen run through LC-MS results in the output of a two-dimensional spectrogram, with mass/charge on one axis, and time of input to the mass spectrometer on the other. Protein fragments appear as very noisy two-dimensional peaks in this spectrogram. A typical specimen might contain many thousands of protein fragments and produce on the order of 1,000 unique mass/charge measurements at each of say 800 unique time points. These spectrograms have a number of properties which make them difficult to analyze. Most notably, the LC step results in a tremendous amount of variation in the time axis, due to factors such as variations in ambient pressure/temperature, and also to physical/chemical properties related to the chromatography procedure which cannot be made identical from run to run. As such, the time axis is variously shifted, compressed and expanded, in complex, non-linear ways. Furthermore, the rate of the mechanical process which injects specimen from the LC into the MS cannot be made constant, futher confounding the comparibility of time measurements in different experimental runs. Lastly, the measured abundance of molecules is inherently subject to both random and systematic noise, each of which are variable both within a given LC-MS run as well as between such runs. Extracting information from LC-MS spectrograms is currently an open problem, and one in great need of analytical, computational and statistical research [2].

    Our specific goals are to:
    • Develop a principled method for alignment of the experimental LC time axis so that times can be compared across experiments, and so that experimental LC-MS replicates (numerous runs with the same biological specimen) can be combined to improve the signal-to-noise ratio. This first goal is a critical sub-task of the other two goals, and is an interesting problem in its own right. Principled alignment of time series data is useful in many domains, such as speech processing and time-series microarray experiments, to name but a few.

    • Develop a probabilistic model which allows us to predict which of two classes an unknown specimen belongs to.

    • Develop statistical and computational methods which allow us to reliably detect differences in protein composition between two related sets of specimens (e.g., cancer versus non-cancer) (without use of Tandem Mass).

    Our Approach

    To date, our work has gone toward solving the time alignment problem simultaneously with the abundance normalization problem, where the latter refers to correction of systematic noise in the measured abundance levels. We have attacked this problem within the machine learning/graphical model paradigm which provides a flexible, powerful and theoretically well-founded basis for solving real world problems such as those presented above.

    In [3] we developed the Continuous Profile Model (CPM) to do multiple alignment of time series with application to LC-MS data (and speech). A form of conditional HMM (Hidden Markov Model), the CPM is a generative model in which each observed time series is a non-uniformly subsampled version of a single latent trace, to which local rescaling and additive noise are applied. After unsupervised training, the learned latent trace represents a canonical, high resolution fusion of all the replicate LC-MS spectrograms. As well, an alignment in time and scale of each observation to this trace can be found by inference in the model. Each hidden state in the CPM maps to i) a latent trace time, and ii) a local scale factor. Thus one can view traversal through the hidden state space as a series of stochastic 'jumps' forward in time through the latent trace, along with application of a local scale factor at each time point visited. Additionally, we learn parameters which control a global scale factor and noise level for each spectrogram, and the time sampling trend within a spectrogram. Since publication of this model in [3], we have had interest from other researchers seeking to use our algorithm both for mass spectrometry data, as well as those in speech processing.

    We have been working on extensions of the CPM to allow non-replicate data (i.e., specimens from different classes such as cancer versus non-cancer) to be aligned in a principled way. Typically one applies an alignment algorithm to data under the assumption that all specimens should map to one another, that is, without incorporating contextual knowledge of differences. However, if one knows, for example, that differences in the replicates are constrained to be sparse, or close together in time, or that two sets of time series are likely to diverge by a considerable amount in 1% of the time points, then it would be desirable to include this domain knowledge to improve alignment.

    In the aforementioned model extension, a second parameter requiring cross-validation (CV) is introduced. Also, we find indications that a maximum-likelihood estimate may in some cases be insufficient. We have thus been motivated to port our CPM model into a fully Bayesian paradigm which has the potential to benefit us in two ways: (1) it eliminates the need to do CV, a CPU-intensive process which can be limited due to its imprecision (granted the Markov Chain Monte Carlo required for our Bayesian model may prove to be equally slow, but we are also working on speed-ups for this by making changes to the base CPM), and (2) it allows us to model the full posterior, thereby allowing us to look at regions of high-probability density containing similar solutions which lie below the probability of a maximum-likelihood solution, but which may be superior.

    Upon successful completion of the task above, we should be in a position to align LC-MS spectrograms, both replicates, and those within different classes. This will provide the foundation for development and application of pattern recognition algorithms, which, in the case of predictive models, could naturally be built on top of this framework (for example, by asking for an unknown specimen, whether it aligns better to the cancer latent trace, or to the non-cancer latent trace in the extended model, as measured probabilistically). While the task of determining the precise set of proteins (or protein fragments) which differ between two classes could be closely linked to that of building predictive models, one may able to construct an excellent predictive model which in and of itself does not elucidate all of the class-specific protein differences, nor present those that it does find in an easy-to-understand and useful manner. Thus a different suite of methods may be required for this task, as well as for the task of inferring statistical significance (i.e., determining whether a discovered difference is likely to be spurious, or revealing of true, underlying biological differences) and methods which allow for hypotheses to be ranked so that top-ranking ones can be confirmed by independent laboratory tests (thereby also validating our model).

    Back to home page.