Erkang (Eric) Zhu

[祝尔康] [Github] [LinkedIn] [Email]

I am a PhD student in Computer Science at the University of Toronto, Canada. I am a member of the Data Curation Lab and my supervisor is Renée J. Miller.

My research is in data discovery for massive data lakes such as Open Data, large-scale similarity search, and randomized algorithms (data sketches).

In the lab, I am collaborating with Ken Pu and Fatemeh Nargesian on several projects. I worked on data mining on news articles at the Structured Data Group in Google Research NYC during the summer of 2018. In the summer of 2016, I was fortunate to collaborate with Surajit Chaudhuri and Yeye He of DMX Group at Microsoft Research in Redmond, WA. In 2014, I interned at the Dynamo DB team of Amazon Web Services in Seattle.


Data lakes (e.g., enterprise catalogs and Open Data portals) are data dumps if users cannot efficiently find the data in them. In this project we focus on Open Data lakes that have large numbers of tables (e.g., has over 164K CSV tables), and we study the problem of finding tables that are joinable or unionable, with respect to a user-provided query table. Solving this problem allows users to efficiently discover relevant tables for analytics. Our solutions are built on ideas from similarity search and data sketches. They do not rely on any metadata or schema, and do not require a precomputed join graph.
[Paper] [Code]

Another demo video with an intriguing example.

Generating Data Transformation for Join

Joining tables (e.g., pivot table, SQL JOIN) from multiple sources with heterogeneous formats requires writing a special-purpose data transformation program. This can be cost-ineffective as most such joins are ad hoc. We propose a novel system that automatically generates data transformation programs without any user input such as join rows and columns pairs. The system achieves interactive response time up to 10K input rows and superior accuracy comparing to existing techniques on real-world tables.

The Auto Join algorithm is used by Azure Machine Learning Workbench. AutoJoin Screenshot AutoJoin Screenshot AutoJoin Screenshot

Data Sketches

Data sketches are probabilistic summaries of datasets. They can be used to estimate statistics such as cardinality and similarity with high accuracy. The advantage of data sketches is their small and fixed memory footprint, typically just a few KBs, making them extremely suitable for processing very large datasets or streams. I developed a library of highly performant data sketch implementations in Python.




Here are the courses I have been involved in teaching at University of Toronto.

Invited presentations