Understanding Billions of Triples with Usage Summaries

Shahan Khatchadourian and Mariano P. Consens

Welcome to our website for our Semantic Web Challenge, Billion Triple Track (BTC) 2011 submission - last updated Oct 25, 2011

Linked Data is a way to share and consume interlinked semantic web datasets. Usage summaries can help to understand the structure within and across interlinked datasets by partitioning entities based on how they are described, such as grouping entities that are instances of the same types and described with the same predicates. Because Linked Data is growing to billions of triples, scalable techniques for generating usage summaries are essential. In this work, we implement a novel Hadoop-based technique for generating usage summaries of billions of triples. We analyze and compare usage summaries generated for the entire BTC 2010 and 2011 datasets. We generate usage summaries involving classes and predicates, and of recommended patterns, such as for inferencing and interlinking.

Downloads

  1. Extended BTC submission report (pdf)
  2. BTC poster slides (pdf)
  3. BTC submission (pdf)
  4. Result Datasets

Online Result Exploration

  1. SKOS usage neighbourhoods - interactive, shown and described in paper.
  2. FOAF usage neighbourhoods - interactive, shown and described in paper.
  3. Index usage neighbourhoods - interactive, described but not shown in paper.
  4. Inference usage neighbourhoods - interactive, described but not shown in paper.
  5. Interlink usage neighbourhoods - interactive, described but not shown in paper.
  6. Topic usage neighbourhoods - interactive, described but not shown in paper.
  7. CPO (class and predicate) usage neighbourhoods - not interactive, shown and described in paper.
  8. LinkedMDB interlink usage neighbourhoods - A Walkthrough - interactive, with sets of instances being described by each usage neighbourhood, not in paper.
  9. Goodrelations interlink usage neighbourhoods - interactive, with sets of instances being described by each usage neighbourhood, not in paper.

Result Files

SummaryBTC 2010BTC 2011
nbrdownloaddownload (650MB due to some usage neighbourhoods that contain a large number of blank node classes)
foafdownloaddownload
indexdownloaddownload
inferencedownloaddownload
interlinkdownloaddownload
skosdownloaddownload
topicdownloaddownload

The tab-delimited result files have the following columns:

  1. Type of summary followed by usage neighbourhood. A usage neighbourhood is ordered by bisimulation label (usage, graph, and entity if included). For example, This usage neighbourhood "(usage,interlink)[C+geonames.org, P+geonames.org, P+linkedmdb.org|http://www.w3.org/2000/01/rdf-schema#seeAlso, P+nytimes.com]" groups instances that are typed as one or more class by geonames.org ("C+geonames.org", the actual class entity is not considered), and there are predicates used by nytimes.com ("P+nytimes.com", the actual predicate entity is not considered) to describe instances that are interlinked by linkedmdb.org using the predicate "rdfs:seeAlso" ("P+linkedmdb.org|http://www.w3.org/2000/01/rdf-schema#seeAlso").
  2. Number of class and predicate usage neighbourhoods that it captures as it is constructed by aggregating usage neighbourhoods in the most detailed summary constructed: (usage, graph main hostname, entity URL). The current interlink example captures 8 different class and predicate usage neighbourhoods.
  3. Number of non-blank-node instances. This example interlink usage neighbourhood describes 16 instances.
  4. Number of blank-node instances. No blank nodes are described with this usage neighbourhood.
  5. Total number of instances (the sum of blank and non-blank instances, 16 in this example). In some cases this field was not generated.

Our related work: