| Faculty name: | Renee Miller |
|---|---|
| Research area: | Databases |
| Campus address: | Bahen 7270 |
| Campus phone: | (416) 946-3621 |
| Email address: |
miller [at] cs.toronto.edu
|
| Number of students: | 1 |
| Skills required: |
|
The currently live web application BibBase facilitates the dissemination of scientific publications over the Internet as part of a linked open data cloud. The service fulfils a wide variety of needs such as storage, retrieval and sharing of bibliographic data within the scientific community, as well as simplifying the task of managing publications in a structured manner.
The project is intended to improve the quality of data on BibBase.org, as well as the usability of the graphical user interface, with applications in many research areas of databases. The effort will allow for collection and management of more accurate feedback from the non-expert user base. The existing set of roles within the system will be extended to allow for registered users, in addition to the existing OpenID user role, which will let us fine-grain the permissions associated with each role. With this well-defined hierarchy of roles we intend to weight the feedback we receive based on the trustworthiness of the source. From this feedback mechanism and statistics regarding users of BibBase we intend to learn more about the relationships between various data on the World Wide Web and refine our current link and duplicate detection mechanisms.
Further work will be done to introduce a web-based administrative toolkit that will allow users with appropriate system roles and permissions to carry out manual data cleaning tasks on the existing BibBase data (such as for example, merging and unmerging duplicate entries). The resulting availability of a clean dataset will serve as a ground truth for experiments involving automatically finding high quality links within BibBase entities as well as external data sources (such as DBPedia, WordNet, and others).
The challenges involved in linking open data are a subject of active research. The results of this project have wide applications in the areas of duplicate detection, data mining, data cleaning and other areas of databases.