|
Difference Detection in LC-MS Data for Protein Biomarker Discovery
Jennifer Listgarten*
,
Radford M. Neal*
,
Sam T. Roweis*
,
Peter Wong**
,
and
Andrew Emili**
* Department of Computer Science
and
** Banting and Best Department of Medical
Research
, at the
University of Toronto.
(pdf)
In Bioinformatics, 2007, 23:e198-e204 [by way of
ECCB 2006 (European Conference on Computational Biology)]
Best Student Paper, 3rd prize
Typos in equations on p.3 corrected on August 9th, 2006 . (Thanks to Mark Robinson for
noticing.)
(download the data set below)
Motivation:
There is a pressing need for improved proteomic screening methods
allowing for earlier diagnosis of disease, systematic monitoring of
physiological responses, and the uncovering of fundamental mechanisms
of drug action. The combined platform of LC-MS
(Liquid-Chromatography-Mass-Spectrometry) has shown promise in moving
toward a solution in these areas. In this paper we present a
technique for discovering differences in protein signal between two
classes of samples of LC-MS serum proteomic data without use of tandem
mass spectrometry, gels, or labelling. This method works on data from
a lower-precision MS instrument, the type routinely used by and
available to the community at large today. We test our technique on a
controlled (spike-in) but realistic (serum biomarker discovery)
experiment which is therefore verifiable. We also develop a new
method for helping to assess the difficulty of a given spike-in
problem. Lastly, we show that the problem of class prediction,
sometimes mistaken as a solution to biomarker discovery, is actually a
much simpler problem.
Results:
Using precision-recall curves with experimentally extracted ground
truth, we show that i) our technique has good performance using 7
replicates from each class, ii) performance degrades with decreasing
number of replicates, iii) the signal that we are teasing out is not
trivially available (i.e., the differences are not so large that
the task is easy). Lastly, we easily obtain perfect classification
results for data in which the problem of extracting differences does
not produce absolutely perfect results. This emphasizes the different
nature of the two problems and also their relative difficulties.
Availability:
Our data is publically available as a benchmark for further
studies of this nature.
|