Erkang (Eric) Zhu

[祝尔康] [Github] [LinkedIn] [Email]

Profile Pic

I am a PhD student in Computer Science at the University of Toronto, Canada. I am a member of the Data Curation Lab and my supervisor is Renée J. Miller.

My current research focuses on building search engine for massive data repositories such as Open Data and data on the Web, by developing innovative techniques in the domains of information retrieval, approximate nearest neighbour search, and machine learning.

In the lab, I am collaborating with Ken Pu and Fatemeh Nargesian on several projects. In the summer of 2016, I was fortunate to collaborate with Surajit Chaudhuri and Yeye He of DMX Group at Microsoft Research in Redmond, WA. In 2014, I interned at the Dynamo DB team of Amazon Web Services in Seattle.


Open Data and Web Data have become important sources of tabluar data for data scientists due to their massive quantities and broad coverage. However, many publishers (e.g., support only keyword search over human-curated metadata, which may be insufficient to reveal the underlying content. A data scientist often wants to find new tables that can be joined with his/her own tables, in order to produce a comprehensive analysis. We solve this search problem for table repositories with hundreds of millions of tables using a novel indexing technique based on locality sensitive hashing, with a partitionining strategy that permits the use of containment as the relevance measure.
[Paper] [Code]

Generating Data Transformation for Join

Joining tables (e.g., pivot table, SQL JOIN) from multiple sources with heterogeneous formats requires writing a special-purpose data transformation program. This can be cost-ineffective as most such joins are ad hoc. We propose a novel system that automatically generates data transformation programs without any user input such as join rows and columns pairs. The system achieves interactive response time up to 10K input rows and superior accuracy comparing to existing techniques on real-world tables.

The Auto Join algorithm is used by Azure Machine Learning Workbench. AutoJoin Screenshot AutoJoin Screenshot AutoJoin Screenshot

Data Sketches

Data sketches are probabilistic summaries of datasets. They can be used to estimate statistics such as cardinality and similarity with high accuracy. The advantage of data sketches is their small and fixed memory footprint, typically just a few KBs, making them extremely suitable for processing very large datasets or streams. I developed a library of highly performant data sketch implementations in Python.




Here are the courses I have been involved in teaching at University of Toronto.

Invited presentations