Understanding Billions of Triples with Usage Summaries
Shahan Khatchadourian and Mariano P. Consens
Welcome to our website for our Semantic Web Challenge, Billion Triple Track (BTC) 2011 submission - last updated Oct 25, 2011
Linked Data is a way to share and consume interlinked
semantic web datasets. Usage summaries can help to understand the
structure within and across interlinked datasets by partitioning
entities based on how they are described, such as grouping entities
that are instances of the same types and described with the same
predicates. Because Linked Data is growing to billions of triples,
scalable techniques for generating usage summaries are essential.
In this work, we implement a novel Hadoop-based technique for
generating usage summaries of billions of triples. We analyze and
compare usage summaries generated for the entire BTC 2010 and
2011 datasets. We generate usage summaries involving classes and
predicates, and of recommended patterns, such as for inferencing and
interlinking.
Downloads
- Extended BTC submission report (pdf)
- BTC poster slides (pdf)
-
BTC submission (pdf)
-
Result Datasets
Online Result Exploration
- SKOS usage neighbourhoods - interactive, shown and described in paper.
- FOAF usage neighbourhoods - interactive, shown and described in paper.
- Index usage neighbourhoods - interactive, described but not shown in paper.
- Inference usage neighbourhoods - interactive, described but not shown in paper.
- Interlink usage neighbourhoods - interactive, described but not shown in paper.
- Topic usage neighbourhoods - interactive, described but not shown in paper.
- CPO (class and predicate) usage neighbourhoods - not interactive, shown and described in paper.
- LinkedMDB interlink usage neighbourhoods - A Walkthrough - interactive, with sets of instances being described by each usage neighbourhood, not in paper.
- Goodrelations interlink usage neighbourhoods - interactive, with sets of instances being described by each usage neighbourhood, not in paper.
The tab-delimited result files have the following columns:
-
Type of summary followed by usage neighbourhood. A usage neighbourhood is ordered by bisimulation label (usage, graph, and entity if included). For example, This usage neighbourhood "(usage,interlink)[C+geonames.org, P+geonames.org, P+linkedmdb.org|http://www.w3.org/2000/01/rdf-schema#seeAlso, P+nytimes.com]" groups instances that are typed as one or more class by geonames.org ("C+geonames.org", the actual class entity is not considered), and there are predicates used by nytimes.com ("P+nytimes.com", the actual predicate entity is not considered) to describe instances that are interlinked by linkedmdb.org using the predicate "rdfs:seeAlso" ("P+linkedmdb.org|http://www.w3.org/2000/01/rdf-schema#seeAlso").
-
Number of class and predicate usage neighbourhoods that it captures as it is constructed by aggregating usage neighbourhoods in the most detailed summary constructed: (usage, graph main hostname, entity URL). The current interlink example captures 8 different class and predicate usage neighbourhoods.
-
Number of non-blank-node instances. This example interlink usage neighbourhood describes 16 instances.
-
Number of blank-node instances. No blank nodes are described with this usage neighbourhood.
-
Total number of instances (the sum of blank and non-blank instances, 16 in this example). In some cases this field was not generated.
Our related work:
- Shahan Khatchadourian, Mariano P. Consens,
ExpLOD: Summary-Based Exploration of Interlinking and RDF Usage in the Linked Open Data Cloud.
ESWC 2010, Pages: 272-287.
online version
- Shahan Khatchadourian, Mariano P. Consens,
Exploring RDF Usage and Interlinking in the Linked Open Data Cloud using ExpLOD.
LDOW 2010 Demonstration.
online version