Machine learning frameworks abound. Deep Learning models are accesible and at the disposal of data scientists and practisioners. However these models are
as good as the data they are trained on. Generating training data is not easy as it requires manual effort and human curation. Yet for machine learning to
become succesful within an enterprise the availability of relevant data to train models is key. Such data have to be relevant to the organization and
come from the data sets every enterprise has accumulated over the years. The goal of DataScrew is to unlease the potential of data within an organization
by automating the process of creating and labelling training data sets from the plethora of data existing in an enterprise. This goes beyond relational
data to encompass all data types (text, json, image, video, etc) available in a unifying manner.
We are working on:
- Intelligent semantic crawlers to automatically discover relevant data for a model and a prediction task
- Data and relationship extraction with associated inferences to enhance the extracted data
- Automated labelling of data sets for a specific task (binary or multi class)
- Incorporating vertical (domain dependent knowledge) in the data collection task easily
- Standardization and unification of the extracted data across data sources
Publications: