| Faculty name: | Renee Miller |
|---|---|
| Research area: | Databases |
| Campus address: | Bahen 7270 |
| Campus phone: | (416) 946-3621 |
| Email address: |
miller [at] cs.toronto.edu
|
| Number of students: | 1 |
Web data is often in an unstructured text format or in a semi-structured record format. To support effective querying of this data, the information needs to be converted into a structured format. For example, to execute a query at amazon.com to find all 46" 120 Hz televisions, we need to identify and extract all televisions along with the manufacturer, model, size, refresh frequency, etc., from the string record. Extracting attributes such as year can be captured using tools with regular expressions, e.g., year ([0-9][0-9][0-9][0-9]). However, for categorical attributes such as TV manufacturer, model, and size, they are more naturally captured using external tables called dictionaries.
For this project, you will build upon our Web-based Automated Dictionary Discovery tool. The tool takes as input a dataset and a set of parameters, and outputs a set of dictionaries. You will evaluate the performance of the underlying algorithms and consider new ways to improve the scalability and running time. This may involve building new data structures, re-designing some of the algorithms, and enhancing the functionality of the tool.