Erkang (Eric) Zhu

[祝尔康] [Github] [LinkedIn] [Email]

Profile Pic

I am a PhD student in Computer Science at the University of Toronto, Canada. I am a member of the Data Curation Lab and my supervisor is Renée J. Miller.

My research focuses on dataset search for massive number of datasets such as Open Data and data on the Web, without relying on human-curated metadata. I am also interested in randomized algorithms and their application in data management.

In the lab, I am collaborating with Ken Pu and Fatemeh Nargesian on several projects. In the summer of 2016, I was fortunate to collaborate with Surajit Chaudhuri and Yeye He of DMX Group at Microsoft Research in Redmond, WA. In 2014, I interned at the Dynamo DB team of Amazon Web Services in Seattle.


Open Data and Web Data have become important sources of datasets for data scientists due to their massive quantities and broad coverage. However, many dataset publishers (e.g., support only keyword search over human-curated metadata, which may be insufficient to reveal the underlying content. A data scientist often wants to find new datasets that can be joined with his/her own dataset, in order to produce a comprehensive analysis. We solve this search problem for data repositories with hundreds of millions of datasets using a novel indexing technique based on locality sensitive hashing, with a partitionining strategy that permits the use of containment as the relevance measure.
[Paper] [Code] [Demo]

Generating Data Transformation for Join

Joining tables (e.g., pivot table, SQL JOIN) from multiple sources with heterogeneous formats requires writing a special-purpose data transformation program. This can be cost-ineffective as most such joins are ad hoc. We propose a novel system that automatically generates data transformation programs without any user input such as join rows and columns pairs. The system achieves interactive response time up to 10K input rows and superior accuracy comparing to existing techniques on real-world tables.

Data Sketches

Data sketches are probabilistic summaries of datasets. They can be used to estimate statistics such as cardinality and similarity with high accuracy. The advantage of data sketches is their small and fixed memory footprint, typically just a few KBs, making them extremely suitable for processing very large datasets or streams. I developed a library of highly performant data sketch implementations in Python.




Here are the courses I have been involved in teaching at University of Toronto.

Invited presentations