Understanding Billions of Triples with Usage Summaries

Shahan Khatchadourian and Mariano P. Consens

Welcome to our website for our Semantic Web Challenge, Billion Triple Track (BTC) 2011 submission - last updated Oct 25, 2011

Linked Data is a way to share and consume interlinked semantic web datasets. Usage summaries can help to understand the structure within and across interlinked datasets by partitioning entities based on how they are described, such as grouping entities that are instances of the same types and described with the same predicates. Because Linked Data is growing to billions of triples, scalable techniques for generating usage summaries are essential. In this work, we implement a novel Hadoop-based technique for generating usage summaries of billions of triples. We analyze and compare usage summaries generated for the entire BTC 2010 and 2011 datasets. We generate usage summaries involving classes and predicates, and of recommended patterns, such as for inferencing and interlinking.

Downloads

Extended BTC submission report (pdf)
BTC poster slides (pdf)
BTC submission (pdf)
Result Datasets

Online Result Exploration

SKOS usage neighbourhoods - interactive, shown and described in paper.
FOAF usage neighbourhoods - interactive, shown and described in paper.
Index usage neighbourhoods - interactive, described but not shown in paper.
Inference usage neighbourhoods - interactive, described but not shown in paper.
Interlink usage neighbourhoods - interactive, described but not shown in paper.
Topic usage neighbourhoods - interactive, described but not shown in paper.
CPO (class and predicate) usage neighbourhoods - not interactive, shown and described in paper.
LinkedMDB interlink usage neighbourhoods - A Walkthrough - interactive, with sets of instances being described by each usage neighbourhood, not in paper.
Goodrelations interlink usage neighbourhoods - interactive, with sets of instances being described by each usage neighbourhood, not in paper.

Result Files

Summary	BTC 2010	BTC 2011
nbr	download	download (650MB due to some usage neighbourhoods that contain a large number of blank node classes)
foaf	download	download
index	download	download
inference	download	download
interlink	download	download
skos	download	download
topic	download	download

The tab-delimited result files have the following columns:

Type of summary followed by usage neighbourhood. A usage neighbourhood is ordered by bisimulation label (usage, graph, and entity if included). For example, This usage neighbourhood "(usage,interlink)[C+geonames.org, P+geonames.org, P+linkedmdb.org|http://www.w3.org/2000/01/rdf-schema#seeAlso, P+nytimes.com]" groups instances that are typed as one or more class by geonames.org ("C+geonames.org", the actual class entity is not considered), and there are predicates used by nytimes.com ("P+nytimes.com", the actual predicate entity is not considered) to describe instances that are interlinked by linkedmdb.org using the predicate "rdfs:seeAlso" ("P+linkedmdb.org|http://www.w3.org/2000/01/rdf-schema#seeAlso").
Number of class and predicate usage neighbourhoods that it captures as it is constructed by aggregating usage neighbourhoods in the most detailed summary constructed: (usage, graph main hostname, entity URL). The current interlink example captures 8 different class and predicate usage neighbourhoods.
Number of non-blank-node instances. This example interlink usage neighbourhood describes 16 instances.
Number of blank-node instances. No blank nodes are described with this usage neighbourhood.
Total number of instances (the sum of blank and non-blank instances, 16 in this example). In some cases this field was not generated.

Our related work:

Shahan Khatchadourian, Mariano P. Consens, ExpLOD: Summary-Based Exploration of Interlinking and RDF Usage in the Linked Open Data Cloud. ESWC 2010, Pages: 272-287.
online version
Shahan Khatchadourian, Mariano P. Consens, Exploring RDF Usage and Interlinking in the Linked Open Data Cloud using ExpLOD. LDOW 2010 Demonstration. online version