About DataScrew

Machine learning frameworks abound. Deep Learning models are accesible and at the disposal of data scientists and practisioners. However these models are as good as the data they are trained on. Generating training data is not easy as it requires manual effort and human curation. Yet for machine learning to become succesful within an enterprise the availability of relevant data to train models is key. Such data have to be relevant to the organization and come from the data sets every enterprise has accumulated over the years. The goal of DataScrew is to unlease the potential of data within an organization by automating the process of creating and labelling training data sets from the plethora of data existing in an enterprise. This goes beyond relational data to encompass all data types (text, json, image, video, etc) available in a unifying manner.
We are working on:
  • Intelligent semantic crawlers to automatically discover relevant data for a model and a prediction task
  • Data and relationship extraction with associated inferences to enhance the extracted data
  • Automated labelling of data sets for a specific task (binary or multi class)
  • Incorporating vertical (domain dependent knowledge) in the data collection task easily
  • Standardization and unification of the extracted data across data sources

Publications:
  • Data Acquisition for Improving Machine Learning Models, Li, Yu, Koudas, VLDB 2021
  • FILA: Online Auditing of Machine Learning Model Accuracy Under Finite Labelling Budget, Guan, Koudas, SIGMOD 2022