DataScrew

Machine learning frameworks abound. Deep Learning models are accesible and at the disposal of data scientists and practisioners. However these models are as good as the data they are trained on. Generating training data is not easy as it requires manual effort and human curation. Yet for machine learning to become succesful within an enterprise the availability of relevant data to train models is key. Such data have to be relevant to the organization and come from the data sets every enterprise has accumulated over the years. The goal of DataScrew is to unlease the potential of data within an organization by automating the process of creating and labelling training data sets from the plethora of data existing in an enterprise. This goes beyond relational data to encompass all data types (text, json, image, video, etc) available in a unifying manner.
We are working on:

Intelligent semantic crawlers to automatically discover relevant data for a model and a prediction task
Data and relationship extraction with associated inferences to enhance the extracted data
Automated labelling of data sets for a specific task (binary or multi class)
Incorporating vertical (domain dependent knowledge) in the data collection task easily
Standardization and unification of the extracted data across data sources

Publications:

For up to date publications see: Publications

DataScrew: Creating Massive Training Data Sets for Machine Learning

About DataScrew