Difference Detection in LC-MS Data for Protein Biomarker Discovery

Jennifer Listgarten* , Radford M. Neal* , Sam T. Roweis* , Peter Wong** , and Andrew Emili**
* Department of Computer Science and ** Banting and Best Department of Medical Research , at the University of Toronto.

(pdf) In Bioinformatics, 2007, 23:e198-e204 [by way of ECCB 2006 (European Conference on Computational Biology)]
Best Student Paper, 3rd prize
Typos in equations on p.3 corrected on August 9th, 2006
. (Thanks to Mark Robinson for noticing.)

(download the data set

There is a pressing need for improved proteomic screening methods allowing for earlier diagnosis of disease, systematic monitoring of physiological responses, and the uncovering of fundamental mechanisms of drug action. The combined platform of LC-MS (Liquid-Chromatography-Mass-Spectrometry) has shown promise in moving toward a solution in these areas. In this paper we present a technique for discovering differences in protein signal between two classes of samples of LC-MS serum proteomic data without use of tandem mass spectrometry, gels, or labelling. This method works on data from a lower-precision MS instrument, the type routinely used by and available to the community at large today. We test our technique on a controlled (spike-in) but realistic (serum biomarker discovery) experiment which is therefore verifiable. We also develop a new method for helping to assess the difficulty of a given spike-in problem. Lastly, we show that the problem of class prediction, sometimes mistaken as a solution to biomarker discovery, is actually a much simpler problem.

Using precision-recall curves with experimentally extracted ground truth, we show that i) our technique has good performance using 7 replicates from each class, ii) performance degrades with decreasing number of replicates, iii) the signal that we are teasing out is not trivially available (i.e., the differences are not so large that the task is easy). Lastly, we easily obtain perfect classification results for data in which the problem of extracting differences does not produce absolutely perfect results. This emphasizes the different nature of the two problems and also their relative difficulties.

Our data is publically available as a benchmark for further studies of this nature.


Supplementary Materials

LC-MS Data Set:

  • This data is available for public use. Please cite this paper if using it.
  • The raw data files for the spike-in-serum versus serum-only can be found here (143M), and a list of the contents of this tar/zip file is here.
  • The raw data files for the spike-in-buffer (ground truth) can be found here (100M), and a list of the contents of this tar/zip file is here.
  • An explanation of the format of these raw *.dat files can be found here.
  • Information about the spiked-in peptides can be found here.

    Continuous Profile Model (CPM) Links:

  • Here is a link to modifications of the original CPM which were used in this paper.
  • And here is a link to the original CPM paper.
  • The Continuous Profile Models (CPM) Matlab Toolbox is now available.

    Also, here is a link to other papers we've written about protein biomarker discovery.

  • Back to main page .