The problem of reconstructing detailed values from aggregated data is underspecified. Therefore, we employ the information theoretic principle of maximum entropy. The method is suitable for this estimation problem as it is designed for lack of data rather than an excess thereof. The overriding principle in maximum entropy is that when nothing is known about the real distribution then the estimated distribution should be as uniform as possible. We couple this algorithm with a technique that provides error guarantees for the approximation. Experimental evaluation with both real and synthetic datasets shows the applicability and efficiency of the proposed algorithms.
Using the same method based on maximum entropy, we also demonstrate the ability of the algorithm to identify ``interesting'' values in multidimensional datasets. Maximum entropy reconstruction from a number of aggregates is performed based on the assumption that the aggregates of interest are pairwise independent. Any data value that violates the pairwise independence assumption, will induce a larger reconstruction error than one that does not, and this data value is potentially interesting. The main advantage of our approach is that, contrary to previous work in the area, the deviations are reported without any need for parameter setting or fine tuning, and without any prior knowledge of the domain. Once again we used synthetic and real datasets for the experimental evaluation, and showed the effectiveness of our approach.
Subsequently, we remove the underlying assumption of the previous work that we have enough space to materialize all the aggregates needed. The above situation gives rise to an optimization problem that is NP-Hard. We propose several heuristics and meta-heuristics - some of which can incrementally produce the answer, which is useful for non-strict space constraints - and experimentally show that the problem is amenable to an efficient solution.
I have a strong interest in directing my research efforts towards the creation of prototype systems that prove the concepts of the research ideas. I believe that this will help my students as well, since they will have the chance to get hands-on experience and demonstrate that their ideas work in practical situations. I gained invaluable experience towards this direction and the challenges it involves when I worked at Microsoft Research and IBM Almaden. In both cases, the implementation of a prototype within a commercial database management system was an integral part of the research project.
Recently I have been involved in the development of a general framework for fast incremental maintenance, and in the design of semantic cache management policies. I intend to continue research in the direction of improving current database engine functionality. New data types and applications bring along the need for database management systems to adapt and provide new solutions. Furthermore, databases are heading towards a more self-managing and self-tuning environment, which poses many intriguing research questions as well.
Finally, I have an interest in exploring the application of database techniques, or the development of new ones, in the web environment. Some of my past work addressed the problem of page retrieval latencies. Semantic caching is certainly relevant in this context, and data mining algorithms are being applied in this domain. It would be interesting to investigate the peculiarities of this environment leveraging the experience gained in the databases area.
During my PhD study, I had the opportunity to attend several major database conferences and meet database people in research and development from both the academia and the industrial research laboratories. I plan to maintain these ties and initiate collaborative projects. I also have a strong interest in cooperating with colleagues having expertise in areas related to data mining, like machine learning and statistics.