Analysis of Liquid Chromatography Mass Spectrometry (LC-MS) Data |
||
In this work we aim to develop and apply statistical and computational
methods for extracting information in novel ways (i.e., without use of MS/MS)
from Liquid-Chromatography-Mass Spectrometry proteomic data. Advancement in
the area of LC-MS analysis has the potential to make a significant impact on
human health by playing a key role in the solution to problems such as (i)
uncovering early disease markers, which can then lead to improved patient
outcomes, (ii) drug development by way of identifying target molecules or
pathways, and (iii) systematic monitoring of physiological responses to allow
for a more comprehensive understanding of drug action and disease. It could
also help in more general/basic problems confronting modern biology by
providing a new kind of microscopic window into living organisms.
The Short Story: The Longer Story: Proteins Play a Starring Role The central dogma in molecular biology informs us that DNA is transcribed into mRNA, and mRNA is translated into protein, which in turn directly affects many events in the body, from very local events like cell death to more global events such as muscle contraction. Though DNA is often called the blue-print of life, there is, under most circumstances, no simple mapping between a person's DNA sequence and that person's health, intelligence or metabolism, for example. In between DNA and disease lies a complex biological machinery, in which proteins play a fundamental role. As such, emphasis in molecular biology is shifting away from DNA sequencing and analysis to focus on the study of richer downstream events such as those centered on proteins. Measuring Protein Levels The biological research community is tackling the large-scale study of proteins, referred to as proteomics, with a range of high-throughput technologies which seek to achieve systematic and reliable quantitation and identification of protein levels from, for example, human blood serum specimens. One popular and promising technology is that of mass spectrometry. A mass spectrometer takes a specimen as input and produces a measure of the abundance of molecules (e.g., protein fragments) occurring at every mass/charge ratio observed. From the pattern of abundance values, one can hope to infer which proteins are present and in what quantity. For mixtures which contain an enormous number of proteins, such as human serum, a specimen preparation step is used to physically separate components of the specimen on the basis of some property of the molecules such as hydrophobicity. One powerful separation technique, Liquid Chromatography (LC), spreads out a single specimen over time, so that at multiple, unique time points, a less complex mixture can be fed into the mass spectrometer. This combined technique is referred to as Liquid Chromatography-Mass Spectrometry (LC-MS). Problem Set-Up A single specimen run through LC-MS results in the output of a two-dimensional spectrogram, with mass/charge on one axis, and time of input to the mass spectrometer on the other. Protein fragments appear as very noisy two-dimensional peaks in this spectrogram. A typical specimen might contain many thousands of protein fragments and produce on the order of 1,000 unique mass/charge measurements at each of say 800 unique time points. These spectrograms have a number of properties which make them difficult to analyze. Most notably, the LC step results in a tremendous amount of variation in the time axis, due to factors such as variations in ambient pressure/temperature, and also to physical/chemical properties related to the chromatography procedure which cannot be made identical from run to run. As such, the time axis is variously shifted, compressed and expanded, in complex, non-linear ways. Furthermore, the rate of the mechanical process which injects specimen from the LC into the MS cannot be made constant, futher confounding the comparibility of time measurements in different experimental runs. Lastly, the measured abundance of molecules is inherently subject to both random and systematic noise, each of which are variable both within a given LC-MS run as well as between such runs. Extracting information from LC-MS spectrograms is currently an open problem, and one in great need of analytical, computational and statistical research [2]. Our specific goals are to:
Our Approach To date, our work has gone toward solving the time alignment problem simultaneously with the abundance normalization problem, where the latter refers to correction of systematic noise in the measured abundance levels. We have attacked this problem within the machine learning/graphical model paradigm which provides a flexible, powerful and theoretically well-founded basis for solving real world problems such as those presented above. In [3] we developed the Continuous Profile Model (CPM) to do multiple alignment of time series with application to LC-MS data (and speech). A form of conditional HMM (Hidden Markov Model), the CPM is a generative model in which each observed time series is a non-uniformly subsampled version of a single latent trace, to which local rescaling and additive noise are applied. After unsupervised training, the learned latent trace represents a canonical, high resolution fusion of all the replicate LC-MS spectrograms. As well, an alignment in time and scale of each observation to this trace can be found by inference in the model. Each hidden state in the CPM maps to i) a latent trace time, and ii) a local scale factor. Thus one can view traversal through the hidden state space as a series of stochastic 'jumps' forward in time through the latent trace, along with application of a local scale factor at each time point visited. Additionally, we learn parameters which control a global scale factor and noise level for each spectrogram, and the time sampling trend within a spectrogram. Since publication of this model in [3], we have had interest from other researchers seeking to use our algorithm both for mass spectrometry data, as well as those in speech processing. We have been working on extensions of the CPM to allow non-replicate data (i.e., specimens from different classes such as cancer versus non-cancer) to be aligned in a principled way. Typically one applies an alignment algorithm to data under the assumption that all specimens should map to one another, that is, without incorporating contextual knowledge of differences. However, if one knows, for example, that differences in the replicates are constrained to be sparse, or close together in time, or that two sets of time series are likely to diverge by a considerable amount in 1% of the time points, then it would be desirable to include this domain knowledge to improve alignment. In the aforementioned model extension, a second parameter requiring cross-validation (CV) is introduced. Also, we find indications that a maximum-likelihood estimate may in some cases be insufficient. We have thus been motivated to port our CPM model into a fully Bayesian paradigm which has the potential to benefit us in two ways: (1) it eliminates the need to do CV, a CPU-intensive process which can be limited due to its imprecision (granted the Markov Chain Monte Carlo required for our Bayesian model may prove to be equally slow, but we are also working on speed-ups for this by making changes to the base CPM), and (2) it allows us to model the full posterior, thereby allowing us to look at regions of high-probability density containing similar solutions which lie below the probability of a maximum-likelihood solution, but which may be superior. Upon successful completion of the task above, we should be in a position to align LC-MS spectrograms, both replicates, and those within different classes. This will provide the foundation for development and application of pattern recognition algorithms, which, in the case of predictive models, could naturally be built on top of this framework (for example, by asking for an unknown specimen, whether it aligns better to the cancer latent trace, or to the non-cancer latent trace in the extended model, as measured probabilistically). While the task of determining the precise set of proteins (or protein fragments) which differ between two classes could be closely linked to that of building predictive models, one may able to construct an excellent predictive model which in and of itself does not elucidate all of the class-specific protein differences, nor present those that it does find in an easy-to-understand and useful manner. Thus a different suite of methods may be required for this task, as well as for the task of inferring statistical significance (i.e., determining whether a discovered difference is likely to be spurious, or revealing of true, underlying biological differences) and methods which allow for hypotheses to be ranked so that top-ranking ones can be confirmed by independent laboratory tests (thereby also validating our model).
Back to home page. |
------>