Research Statement


My research interests lie in the general area of database management systems, and more specifically in data reduction techniques, data mining, and approximate query answering. I also have an overall interest in improving and adapting the current database engine functionality.

Current Work

It is a common practice in a data warehouse environment to either expire old data, and thus make them no longer available, or move them to tertiary storage, and consequently make their access expensive. In both cases the goal is to save storage space. Usually, it is the detailed data that are removed from the system, and only data in aggregated form are kept for further processing. In my thesis I explore the question of how much information the aggregated data in a datacube carry about the detailed values that produced those aggregates, and how we can exploit this information. More specifically, based only in the information in the aggregates, I am interested in the problems of producing estimates of the detailed values, and automatically identifying values of interest to the analyst.

The problem of reconstructing detailed values from aggregated data is underspecified. Therefore, we employ the information theoretic principle of maximum entropy. The method is suitable for this estimation problem as it is designed for lack of data rather than an excess thereof. The overriding principle in maximum entropy is that when nothing is known about the real distribution then the estimated distribution should be as uniform as possible. We couple this algorithm with a technique that provides error guarantees for the approximation. Experimental evaluation with both real and synthetic datasets shows the applicability and efficiency of the proposed algorithms.

Using the same method based on maximum entropy, we also demonstrate the ability of the algorithm to identify ``interesting'' values in multidimensional datasets. Maximum entropy reconstruction from a number of aggregates is performed based on the assumption that the aggregates of interest are pairwise independent. Any data value that violates the pairwise independence assumption, will induce a larger reconstruction error than one that does not, and this data value is potentially interesting. The main advantage of our approach is that, contrary to previous work in the area, the deviations are reported without any need for parameter setting or fine tuning, and without any prior knowledge of the domain. Once again we used synthetic and real datasets for the experimental evaluation, and showed the effectiveness of our approach.

Subsequently, we remove the underlying assumption of the previous work that we have enough space to materialize all the aggregates needed. The above situation gives rise to an optimization problem that is NP-Hard. We propose several heuristics and meta-heuristics - some of which can incrementally produce the answer, which is useful for non-strict space constraints - and experimentally show that the problem is amenable to an efficient solution.

Future Work

During the years of my PhD study, I have acquired a solid understanding of data reduction techniques and data mining algorithms. Both these directions of research are essential for coping with the large sizes of datasets generated nowadays. Data summarization and approximate query answering in large data warehouses is important for efficiently answering decision support queries on large data marts. Data mining can help the analyst focus on interesting parts of the dataset, thus saving time and effort. I am planning to continue research in this context, by examining the interplay between lossy data representations and the insight that one can gain during this procedure. I believe that acquiring a better understanding of the issues related to the combination of summarization and data mining is a fruitful area of research.

I have a strong interest in directing my research efforts towards the creation of prototype systems that prove the concepts of the research ideas. I believe that this will help my students as well, since they will have the chance to get hands-on experience and demonstrate that their ideas work in practical situations. I gained invaluable experience towards this direction and the challenges it involves when I worked at Microsoft Research and IBM Almaden. In both cases, the implementation of a prototype within a commercial database management system was an integral part of the research project.

Recently I have been involved in the development of a general framework for fast incremental maintenance, and in the design of semantic cache management policies. I intend to continue research in the direction of improving current database engine functionality. New data types and applications bring along the need for database management systems to adapt and provide new solutions. Furthermore, databases are heading towards a more self-managing and self-tuning environment, which poses many intriguing research questions as well.

Finally, I have an interest in exploring the application of database techniques, or the development of new ones, in the web environment. Some of my past work addressed the problem of page retrieval latencies. Semantic caching is certainly relevant in this context, and data mining algorithms are being applied in this domain. It would be interesting to investigate the peculiarities of this environment leveraging the experience gained in the databases area.

During my PhD study, I had the opportunity to attend several major database conferences and meet database people in research and development from both the academia and the industrial research laboratories. I plan to maintain these ties and initiate collaborative projects. I also have a strong interest in cooperating with colleagues having expertise in areas related to data mining, like machine learning and statistics.