Current Research

Data Science over Open Data and Massive Data Lakes

In 2016, Forbes assessed that “data preparation accounts for about 80% of the work of data scientists” where preparation includes finding and collecting data, cleaning and integrating data, and managing data for data analysis. They also concluded that this is also the least enjoyable part of a data scientist’s job. As scientists, they would rather be deriving new knowledge and insights. The paradox is that without principled data management and preparation, those new insights are suspect at best. Data preparation and data management in support of analysis is so time consuming and unenjoyable because of the lack of tools, scientific frameworks, and mathematical foundations to support principled data preparation. My research is helping to correct this deficit. As part of my methodology, I use open data, both because of its availability for scientific research and because of its importance to governments and society.