The sequencing of the human genome was the first step in understanding the ways in which we are wired; However, this genetic blueprint provides only a "parts list", and neither information about how the human organism is actually working, nor insight into function or interactions among the ~30 thousand constitutive parts that comprise our genome. Considering that the 30 years of worldwide molecular biology efforts have only annotated about 10% of this gene set, and we know even less about proteins, it is comforting to know that high-throughput data generation and analysis is now widely available.
By arraying tens of thousands of genes and analyzing abundance of and interaction among proteins, it is now possible to measure the relative activity of genes and proteins in normal and diseased tissue. The technology and datasets of such profiling-based analyses will be described along with the mathematical challenges that face the mining of the resulting datasets; We describe the issues related to using this information in the clinical setting, and the future steps that will lead to drug design and development to cure complex diseases such as cancer.
Who knows useful things, not many things, is wise. Aeschylus (ca. 525-456 BC)
The nascent fields of bioinformatics and computational biology are currently an odd amalgam of everything from biologists with a computational bent, through physicists and mathematicians, to computer scientists and engineers sifting through the myriad of data and grappling with biological questions. Much of the excitement comes from a collective sense that there is something truly new evolving. Hardware and software limitations are declaring themselves as major challenges to managing and interpreting the avalanche of data from high-throughput biological platforms. This drinking from the fire hydrant'' sensation continues to spark interest and draw technical skill from other domains. As we move forward to true systems biology experimentation, it is increasingly obvious that experts in robotics, engineering, mathematics, physics, and computer science have become key players alongside traditional molecular biology.
Life sciences applications are typically characterized by multimodal representations, lack of complete and consistent domain theories, rapid evolution of domain knowledge, high dimensionality, and large amounts of missing information. Data in these domains require robust approaches to deal with missing and noisy information. Modern proteomics is no exception. As our understanding of protein structure and function becomes ever more complicated, we have reached a point in time where the actual management of data is a major hurdle to knowledge discovery. Many of the browse-through applications of yesterday are clearly not useful for computational manipulation. If the data was not created having data mining and decision support in mind, how well can it serve that purpose?
We felt this book was a timely discussion of some of the key issues in the field. In subsequent chapters we discuss a number of examples from our own experience that represent some of the challenges of knowledge discovery in high-throughput proteomics. This discussion is by no means comprehensive, and does not attempt to highlight all relevant domains. However, we hope to provide the reader with an overview of what we envision as an important and emerging field in its own right by discussing the challenges and potential solutions to the problems presented. We have selected five specific domains to discuss: (1) Mass spectrometry based protein analysis; (2) Protein--protein interaction network analysis; (3) Systematic high-throughput protein crystallization; (4) A systematic and integrated analysis of multiple data repositories using a diverse set of algorithms and tools; and (5) Systems biology. In each of these areas, we describe the challenges created by the type of data produced, and potential solutions to the problem of data mining within the domain. We hope this stimulates even more discussion, and newer and better ways to deal with the problems at hand.
Biomedical research is drowning in data, yet starving for knowledge. Current challenges in biomedical research and clinical practice include information overload - the need to combine vast amounts of structured, semi-structured, weakly structured data and vast amounts of unstructured information - and the need to optimize workflows, processes and guidelines, to increase capacity while reducing costs and improving efficiencies. In this paper we provide a very short overview on interactive and integrative solutions for knowledge discovery and data mining. In particular, we emphasize the benefits of including the end user into the .interactive. knowledge discovery process. We describe some of the most important challenges, including the need to develop and apply novel methods, algorithms and tools for the integration, fusion, pre-processing, mapping, analysis and interpretation of complex biomedical data with the aim to identify testable hypotheses, and build realistic models. The HCI-KDD approach, which is a synergistic combination of methodologies and approaches of two areas, Human-Computer Interaction (HCI) and Knowledge Discovery & Data Mining (KDD), offer ideal conditions towards solving these challenges: with the goal of supporting human intelligence with machine intelligence. There is an urgent need for integrative and interactive machine learning solutions, because no medical doctor or biomedical researcher can keep pace today with the increasingly large and complex data sets -- often called "Big Data".