1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

Guide to the Provenance Vocabulary

From trdf

Jump to: navigation, search

The Provenance Vocabulary provides classes and properties to describe the provenance of data from the Web. Hence, this vocabulary enables providers of Web data to publish provenance-related metadata about their data. Notice, this vocabulary is not designed to describe provenance of other kinds of content such as documents. This guide explains how to use the Provenance Vocabulary. The document is aimed at developers of data publishing tools as well as at data providers who want to offer additional provenance information not provided by the tools used.

Authors:

Contents

Introduction

The openness of the Web of Linked Data allows everyone to publish anything. Applications that are based on data from the Web have to evaluate the provenance of this data in order to estimate its reliability [1]. There are mainly two sources of provenance information about data: information recorded by the application that performs the provenance-based evaluation of the data and information published by the providers of data or services. Only a small amount of provenance information about the processed data can be recorded by the applications itself. Hence, to obtain more complete knowledge the applications rely on provenance-related metadata from third parties such as the data providers. However, a recent study [2] revealed a general lack of provenance-related metadata about data on the Web. One reason - among others - might be the lack of suitable vocabularies to describe provenance of Web data. The Provenance Vocabulary aims to fill this void.

The effort for creation and for publication of this provenance-related metadata should be kept to a minimum. For this reason, it is encouraged to extend tools that publish data on the Web with a provenance component; this component should automatically generate provenance metadata where possible.

This guide to the Provenance Vocabulary is aimed at developers of data publishing tools as well as at data providers who want to offer additional provenance information not provided by the tools used. Users of the Provenance Vocabulary can discuss vocabulary-related questions on the prv-vocab-users mailing list. Issues with the vocabulary can be reported using the issue tracker.

All examples in this document are written in the Turtle RDF syntax. Throughout this document, the following namespaces are used:

@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

@prefix prv:      <http://purl.org/net/provenance/ns#> .
@prefix prvTypes: <http://purl.org/net/provenance/types#> .
@prefix prvFiles: <http://purl.org/net/provenance/files#> .
@prefix prvIV:    <http://purl.org/net/provenance/integrity#> .

@prefix cs:   <http://purl.org/vocab/changeset/schema#> .
@prefix irw:  <http://www.ontologydesignpatterns.org/ont/web/irw.owl#> .
@prefix http: <http://www.w3.org/2006/http#> .
@prefix doap: <http://usefulinc.com/ns/doap#> .

Overview of the Provenance Vocabulary

The Provenance Vocabulary which is defined as an OWL-DL ontology is partitioned in a core ontology and supplementary modules. To avoid making the core ontology too complex the modules provide less frequently used concepts and a broad range of extensions of the core concepts. At present the Provenance Vocabulary has three modules: Types, Files and Integrity Verification.

The vocabulary is designed very closely to the model for Web data provenance as presented in [2]. This model comprises two dimensions of Web data provenance: data creation and data access. Accordingly, the Provenance Vocabulary basically consists of three main parts: general terms, terms for data creation, and terms for data access.

File:ProvenanceVocabularyOverview.png

The general terms include classes for the three general types of provenance elements introduced by the model: prv:Actor, prv:Execution, and prv:Artifact. prv:Actors are classified in prv:HumanActors and prv:NonHumanActors; prv:Artifacts are (mainly) classified in prv:DataItems and prv:Files.

Furthermore, the general terms include properties that relate individuals of the general classes with each other: an prv:Artifact was prv:yieldedBy an prv:Execution which may have used further prv:employedArtifacts. An prv:Execution was prv:performedBy an prv:Actor and might have had other prv:involvedActors. prv:Executions were prv:performedAt a specific time; an prv:Artifact might have been prv:serializedBy a prv:File; a prv:DataItem might have been prv:containedBy another prv:DataItem; a prv:DataItem might have been prv:precededBy a former version of this item; a prv:NonHumanActor was prv:operatedBy a prv:HumanActor; a prv:NonHumanActor may have prv:deployedSoftware. Notice, some of these properties are abstract (prv:yieldedBy, prv:involvedActor, and prv:employedArtifact) which means they are not intended to be used to describe instance data but to provide an abstract base for other properties.

To allow for a wide range of applications the vocabulary does not prescribe a specific granularity by which provenance information has to be described. Hence, the classes defined in the core ontology are quite general. For instance, a prv:DataItem could be a whole RDF graph as well as a single RDF statement, depending on the granularity chosen. More specific specializations of the general classes are provided with the types module.

Describing Provenance with the Vocabulary

This section provides an informal overview on how to represent different kinds of provenance information with the Provenance Vocabulary. In general, data publishers are strongly encouraged to provide as much provenance information about their data as possible. This will support applications that evaluate provenance of processed data items. Furthermore, it is recommended to use (existing) HTTP-dereferenceable URIs for the identification of provenance elements where possible and, thus, to link to further information about these elements. This holds, in particular, for (human) actors (e.g. people, organizations, companies) that were involved in the history of described data items.

In the following, various examples illustrate the description of different kinds of data creations and data accesses. The section is completed by a discussion of related vocabularies that might enable further descriptions for certain aspects of provenance.

Data Creation

The terms in the data creation dimension describe how a prv:DataItem has been prv:createdBy a prv:DataCreation. The property prv:usedData refers to source prv:DataItems that have been used during the execution of the prv:DataCreation; prv:usedGuideline refers to prv:CreationGuidelines that have been used. Each prv:DataCreation has been prv:performedBy an prv:Actor which can be a prvTypes:DataCreatingEntity, a prvTypes:DataCreatingService, or a prvTypes:DataCreatingDevice as introduced by the types module.

A large number of data creations are based on the creation of a file that encodes the created data item. Since it is more convenient to describe these file-based data creations implicitly by referring to the creation of the file, the Provenance Vocabulary provides additional terms for these file-based descriptions. Therefore, it is also possible that a prv:File has been prv:createdBy a prv:DataCreation; this implies the prv:DataItem that was prv:serializedBy the prv:File was also prv:createdBy the same prv:DataCreation. As an alternative to the property prv:usedData it is possible to use the property prvFiles:usedDataFile which refers to a prv:File that was serializing the source prv:DataItem used during a prv:DataCreation. Similarily, the property prvFiles:usedGuidelineFile refers to a prv:File that was serializing prv:CreationGuidelines.

Example: Manual Creation of an RDF Graph

The information that the manual creation of an RDF graph by Alice (represented by the URI http://example.org/Alice) was performed on July 10, 2009 can be described as follows:

<> rdf:type prv:DataItem ;
   rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
   prv:createdBy [ rdf:type prv:DataCreation ;
                   prv:performedAt "2009-07-10T12:00:00Z"^^xsd:dateTime ;
                   prv:performedBy <http://example.org/Alice> ] .

Example: Sensor Measurements

The following RDF data describes the provenance of a data item that was measured by a sensor; at the time the measurement was taken, Bob was responsible for this sensor.

_:a rdf:type prv:DataItem ;
    prv:createdBy [ rdf:type prv:DataCreation , prvTypes:Measurement ;
                    prv:performedAt "2009-07-10T12:00:00Z"^^xsd:dateTime ;
                    prv:performedBy <http://example.org/Sensor1> ] .

<http://example.org/Sensor1> rdf:type prv:Actor , prvTypes:Sensor ;
                             prv:operatedBy <http://example.org/Bob> .

Example: Data Transformation (Using Triplify)

The provenance of an RDF graph which was created by a Triplify service can be described as follows.

<> rdf:type prv:DataItem ;
    rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
    prv:createdBy [ rdf:type prv:DataCreation ;
                    prv:usedGuideline _:b3 ;
                    prv:performedBy <http://example.org/triplify> ] .

<http://example.org/triplify> rdf:type prvTypes:DataCreatingService ;
                              prv:operatedBy <http://example.org/Carol> ;
                              prv:deployedSoftware _:b1 .

_:b1 rdf:type doap:Version ;
     doap:revision "0.5" .
_:b2 rdf:type doap:Project ;
     doap:release _:b1 ;
     doap:homepage <http://triplify.org> .

As can be seen from the description Carol was responsible for the sample Triplify service; the Triplify service runs Triplify v.0.5; during the creation the service used a mapping identified by blank node _:b3. The mapping itself also has provenance. This provenance is indirectly part of the provenance of the created data item and, hence, related metadata should be added to the provenance description of the data item. For instance, the following data describes that the mapping was created by Carol.

_:b3 rdf:type prv:CreationGuideline ;
     rdf:type prvTypes:TriplifyMapping ;
     prv:createdBy [ prv:performedAt "2008-03-11T12:00:00Z"^^xsd:dateTime ;
                     prv:performedBy <http://example.org/Carol> ] .

Example: Coarse-Granular vs. Fine-Granular Descriptions

The previous example illustrates the description of the provenance of an RDF graph created by a Triplify service based on a mapping. Alternatively to this description it is also possible to describe the provenance in finer granularity as the following description illustrates:

<> rdf:type prv:DataItem ;
   rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
   prv:createdBy [
               rdf:type prv:DataCreation ;
               prv:performedAt "2009-07-10T12:00:28Z"^^xsd:dateTime ;
               prv:usedGuideline [ rdf:type prv:CreationGuideline ;
                                   rdf:type prvTypes:QueryTemplate ;
                                   rdfs:label "SELECT ..." ;
                                   prv:containedBy _:b3 ];
               prv:performedBy <http://example.org/triplify>
                   ] .

According to this description the RDF graph was created by using a specific query template from the mapping introduced before.

It is also possible to provide an even more detailed description:

<> rdf:type prv:DataItem ;
   rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
   prv:createdBy [ rdf:type prv:DataCreation ;
                   prv:performedAt "2009-07-10T12:00:28Z"^^xsd:dateTime ;
                   prv:usedData _:d1 ;
                   prv:performedBy <http://example.org/triplify> ] .

_:d1 rdf:type prv:DataItem ;
     rdf:type prvTypes:QueryResult ;
     prv:createdBy [ rdf:type prv:DataCreation ;
                     rdf:type prvTypes:QueryExecution ;
                     prv:performedAt "2009-07-10T12:00:25Z"^^xsd:dateTime ;
                     prv:usedGuideline _:d2 ;
                     prv:performedBy <http://example.org/dbms> ] .

_:d2 rdf:type prv:CreationGuideline ;
     rdf:type prvTypes:SQLQuery ;
     rdfs:label "SELECT ..." ;
     prv:createdBy [
               rdf:type prv:DataCreation ;
               prv:performedAt "2009-07-10T12:00:24Z"^^xsd:dateTime ;
               prv:usedData [ rdf:type prvTypes:QueryTemplate ;
                              rdfs:label "SELECT ..." ;
                              prv:containedBy _:b3 ] ;
               prv:performedBy <http://example.org/triplify>
                   ] .

According to this description the RDF graph was created using a query result as source data. This query result, represented by the blank node _:d1, was created by a query execution that was performed by a database management system identified by http://example.org/dbms; the executed query was created from a query template which was part of the Triplify mapping.

Example: Change History

A very common type of provenance is information about the change history of data. A change to data is the creation of a new data item representing the new version of the data. In many cases this new data item was created using the data item that represents the preceding version as source data.

_:e1 rdf:type prv:DataItem ;
     rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;#
     foaf:primaryTopic <http://example.org/books/A1> ;
     prv:createdBy [ rdf:type prv:DataCreation ;
                     prv:performedAt "2009-07-10T12:00:28Z"^^xsd:dateTime ;
                     prv:performedBy <http://example.org/triplify> ] ;
                     prv:usedData _:e2 ;
                     prv:usedData _:e3 ;
                     prv:usedGuideline [ rdf:type cs:ChangeSet ;
                                         cs:subjectOfChange <http://example.org/books/A1> ;
                                         cs:addition _:e3 ]
                   ] ;
     prv:precededBy _:e2 .

_:e3 rdf:type rdf:Statement .

This example illustrates the description of the change history of an RDF graph. The blank node identifier _:e1 in the example represents the changed (i.e. new) version. This version has evolved from a previous version which is represented by the blank node _:e2. The actual changes are represented by a (virtual) guideline which is described using the Changeset vocabulary. As can be seen from the description, _:e1 was derived by adding statement _:e3 to _:e2. Notice, the preceding version and the added statement are source data items that were used during the creation of the new version. It is recommended to provide further provenance information about the preceding version as well as about the added statement.

Example: File-based Data Creation

A very common case of creating a data item is the creation of a file that serializes the data item. For instance, an RDF graph that is encoded in an RDF/XML document resulting from an XSL transformation was created implicitly by that transformation. Such a file-based data creation can be described as illustrated by the following example:

_:rdfgraph rdf:type prv:DataItem ;
           prv:serializedBy _:rdfdoc ;
           prv:createdBy [ rdf:type prv:DataCreation ;
                           # ...
                           prv:usedData _:csvEncodedData ;
                           prv:usedGuideline _:transformationDef ] .

_:rdfdoc rdf:type prv:File ;
         irw:isEncodedIn <http://www.iana.org/assignments/media-types/application/rdf+xml> .

_:csvEncodedData rdf:type prv:DataItem ;
                 prv:serializedBy [
                  rdf:type prv:File ;
                  irw:isEncodedIn <http://www.iana.org/assignments/media-types/text/csv> 
                               ] .

_:transformationDef rdf:type prv:CreationGuideline ;
                    prv:serializedBy [
                      rdf:type prv:File ;
                      irw:isEncodedIn <http://www.iana.org/assignments/media-types/application/xslt+xml> .

This example describe the RDF graph represented by blank node _:rdfgraph was created by using source data that was serialized in a CSV file. The creation was an XSL transformation of the CSV file and it was guided by the transformation defined in an XSLT file.

While the RDF graph in the example was created by creating an RDF/XML document using a file-based transformation the example describes the creation of the graph explicitly. For these kind of file-based data creations where data items were created by creating the files that serialized them, it might be more intuitive for many users to describe the creation of the file instead of explicitly describing the creation of the data item serialized in the created file. For this reason the property prv:createdBy can also be used for prv:Files; Furthermore, the files module of the Provenance Vocabulary provides the properties prvFiles:usedDataFile and prvFiles:usedGuidelineFile that can be used to refer to the prv:Files that were serializing the source data items and creation guidelines that have been used during a prv:DataCreation. The following description uses these properties and provides an alternative to the previous description:

_:rdfgraph prv:serializedBy _:rdfdoc .

_:rdfdoc rdf:type prv:File ;
         irw:isEncodedIn <http://www.iana.org/assignments/media-types/application/rdf+xml> ;
         prv:createdBy [ rdf:type prv:DataCreation ;
                         # ...
                         prvFiles:usedDataFile _:csvFile ;
                         prvFiles:usedGuidelineFile _:xsltFile ] .

_:csvFile rdf:type prv:File ;
          irw:isEncodedIn <http://www.iana.org/assignments/media-types/text/csv> .

_:xsltFile rdf:type prv:File ;
           irw:isEncodedIn <http://www.iana.org/assignments/media-types/application/xslt+xml> .

Notice, even if this example describes the creation of a file this description should be understood as the description of a creation of a data item. And, in fact, the Provenance Vocabulary is defined to enable an OWL2 reasoner to infer that the data item that was serialized in the created file was created by the same prv:DataCreation as the file was. For instance, from the given description a reasoner can automatically infer that _:rdfgraph was prv:createdBy the data creation _:rdfdoc was prv:createdBy. The sample description also refers to a source file and to a source guideline file. Nonetheless, the understanding here is still that actually the data and guideline encoded in the files were used for the creation of the resulting RDF graph. This could also be inferred by a reasoner given the data serialized in the files would have been identified in the description.

Data Access

The data access dimension of the model for Web data provenance focuses on retrieving data items from the Web. When a Linked Data publishing system creates provenance-related metadata about a data item to be served this item has not yet been retrieved. Hence, no access-related provenance information about the considered data item can be provided. However, it is recommended to provide information about the retrieval of source prv:DataItems and of prv:CreationGuidelines. The Provenance Vocabulary allows to describe how an prv:Artifact has been prv:retrievedBy the execution of a prv:DataAccess from the Web. The retrieved prv:Artifact is a irw:WebRepresentation of the prv:accessedResource. The prv:accessedService is a prv:DataProvidingService which was prv:usedBy a prv:DataPublisher; furthermore each prv:DataProvidingService is usually prv:operatedBy a prv:HumanActor. If the integrity of the retrieved prv:DataItem was checked it is possible to describe the prvIV:IntegrityVerification and its prvIV:VerificationResult using terms defined in the integrity verification module.

Example: Retrieval of Source Data

The following RDF data describes provenance of a query result which was created by querying an RDF graph. Before creating this result the queried graph had been retrieved by the query engine. The queried graph is a specific representation of the data identified by URI http://acme.com/stores/bob/data/1245749021361.

_:f1 rdf:type prv:DataItem ;
     rdf:type prvTypes:QueryResult ;
     prv:createdBy [ rdf:type prv:DataCreation ;
                     rdf:type prvTypes:QueryExecution ;
                     prv:performedAt "2009-07-10T12:00:28Z"^^xsd:dateTime ;
                     prv:performedBy <http://example.org/sparqlendpoint> ;
                     # ...
                     prv:usedData _:f2 ] .

_:f2 rdf:type prv:DataItem ;
     rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
     prv:retrievedBy [ rdf:type prv:DataAccess ;
                       prv:performedAt "2009-07-10T12:00:21Z"^^xsd:dateTime ;
                       prv:performedBy <http://example.org/sparqlendpoint> ;
                       prv:accessedResource <http://acme.com/stores/bob/data/1245749021361> ] .

Example: Introducing the Accessed Service

In addition to the access time and the data accessor it is also possible to describe the service from which the accessed data item has been retrieved. For instance, the query result that was used by the Triplify service in Example: Coarse-Granular vs. Fine-Granular Descriptions was retrieved from the database management system identified by http://example.org/dbms. This information may be represented as follows:

_:d1 rdf:type prv:DataItem ;
     rdf:type prvTypes:QueryResult ;
     prv:createdBy [ rdf:type prv:DataCreation ;
                     rdf:type prvTypes:QueryExecution ;
                     # ...
                     prv:performedBy <http://example.org/dbms> ] ;
     prv:retrievedBy [ rdf:type prv:DataAccess ;
                       prv:performedAt "2009-07-10T12:00:24Z"^^xsd:dateTime ;
                       prv:performedBy <http://example.org/triplify> ;
                       prv:accessedService <http://example.org/dbms> ] .

Example: Publication Responsibilities 1

It is possible that the parties responsible for publishing a retrieved data item are known to the provider of the provenance information. In this case this information should be made available, too. The Provenance Vocabulary distinguishes two kinds of responsibilities here: publishing the data and providing the publication service. Given Bob, identified by URI http://example.org/Bob, uses the Linked Data server of the fictional company Acme Corp. to publish RDF data such as the RDF graph from Example: Retrieval of Source Data. This information can be added to the data as follows:

_:f2 rdf:type prv:DataItem ;
     rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
     prv:retrievedBy [ rdf:type prv:DataAccess ;
                       # ...
                       prv:accessedService _:h1 ] .

_:h1 rdf:type prv:DataProvidingService ;
     prv:usedBy <http://example.org/Bob> ;
     prv:operatedBy <http://dbpedia.org/resource/Acme_Corporation> .

Example: Publication Responsibilities 2

While a prv:DataPublisher may use a prv:DataProvidingService that is prv:operatedBy another party it is also possible that publishers use their own service.

_:i1 rdf:type prv:DataItem ;
     rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
     foaf:primaryTopic <http://www4.wiwiss.fu-berlin.de/eurostat/resource/countries/Polska> ;
     prv:retrievedBy [
             rdf:type prv:DataAccess ;
             prv:accessedService _:i3 ;
             prv:accessedResource <http://www4.wiwiss.fu-berlin.de/eurostat/data/countries/Polska>
                     ] .

_:i2 rdf:type void:Dataset ;
     rdfs:seeAlso <http://www4.wiwiss.fu-berlin.de/eurostat/all> ;
     void:exampleResource <http://www4.wiwiss.fu-berlin.de/eurostat/resource/countries/Polska> .

_:i3 rdf:type prv:DataProvidingService ;
     rdfs:label "D2R Server instance for the Eurostat dataset" ;
     rdfs:seeAlso <http://www4.wiwiss.fu-berlin.de/eurostat/> ;
     prv:usedBy <http://www4.wiwiss.fu-berlin.de/is-group/resource/groups/Group1> ;
     prv:operatedBy <http://www4.wiwiss.fu-berlin.de/is-group/resource/groups/Group1> .
     prv:deployedSoftware _:i4 .

_:i4 rdf:type doap:Version ;
     doap:revision "0.7" .
<http://www4.wiwiss.fu-berlin.de/is-group/resource/projects/Project3> doap:release _:i4 .

This example describes an RDF graph that was retrieved as part of the Linked Data version of the Eurostat dataset. This graph was retrieved from a Linked Data server that runs D2R Server, v.0.7. The Web-based Systems Group of Freie Universit├Ąt Berlin, identified by http://www4.wiwiss.fu-berlin.de/is-group/resource/groups/Group1, operates this server to publish the Linked Data version of Eurostat.

_:i5 rdf:type prv:DataItem ;
     rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
     foaf:primaryTopic <http://purl.org/net/tcm/tcm.lifescience.ntu.edu.tw/id/medicine/Ginkgo_biloba> ;
     prv:retrievedBy [
             rdf:type prv:DataAccess ;
             prv:accessedService _:i7
             prv:accessedResource <http://purl.org/net/tcm/tcm.lifescience.ntu.edu.tw/data/medicine/Ginkgo_biloba>
                     ] .

_:i6 rdf:type void:Dataset ;
     rdfs:seeAlso <http://code.google.com/p/junsbriefcase/wiki/RDFTCMData> ;
     void:exampleResource <http://purl.org/net/tcm/tcm.lifescience.ntu.edu.tw/id/medicine/Ginkgo_biloba> .

_:i7 rdf:type prv:DataProvidingService ;
     rdfs:label "Pubby instance for the TCM dataset" ;
     prv:usedBy <http://users.ox.ac.uk/~zool0770/foaf.rdf#me> ;
     prv:operatedBy <http://users.ox.ac.uk/~zool0770/foaf.rdf#me> .

The second example describes an RDF graph that was retrieved as part of the Linked Data version of the Traditional Chinese Medicine Dataset. The representation of this graph was provided by a Linked Data server which is powered by Pubby. The researcher, identified by http://users.ox.ac.uk/~zool0770/foaf.rdf#me, operates this server to publish the Linked Data version of the dataset.

Example: Retrieval of Creation Guidelines

The previous examples focus on the retrieval of source data used during a data creation. In addition to source data many data creations involve the use of creation guidelines (cf. Example: Data Transformation (Using Triplify)). If these guidelines were not created by the operator of the corresponding data creation service (as in Example: Data Transformation (Using Triplify)) it should be described how and where they were retrieved from the Web. The following RDF data describes the provenance of an RDF graph that has been created using a Triplify mapping. This mapping was retrieved by Alice from a Web server that is used by the AKSW research group of the University of Leipzig.

<> rdf:type prv:DataItem ;
   prv:createdBy [ rdf:type prv:DataCreation ;
                   # ...
                   prv:usedGuideline _:j1 ] .

_:j1 rdf:type prv:CreationGuideline , prvTypes:TriplifyMapping ;
     rdfs:seeAlso <http://triplify.org/Configuration/WordPress/2.1> ;
     prv:retrievedBy [ rdf:type prv:DataAccess ;
                       prv:performedAt "2009-01-11T12:00:00Z"^^xsd:dateTime ;
                       prv:performedBy <http://example.org/Alice> ;
                       prv:accessedResource <http://triplify.org/Configuration/WordPress/2.1> ;
                       prv:accessedService _:j3 ] .

_:j3 rdf:type <http://www.ontologydesignpatterns.org/ont/web/irw.owl#WebServer> ;
     prv:usedBy _:j4 .

_:j4 rdf:type <http://swrc.ontoware.org/ontology#ResearchGroup> ;
     rdfs:label "Agile Knowledge Engineering and Semantic Web (AKSW)" ;
     rdfs:seeAlso <http://aksw.org> ;
     <http://swrc.ontoware.org/ontology#homepage> "http://aksw.org" .

Example: Services that Create the Provided Data

There are various cases in which a prv:DataProvidingService creates the provided data items itself; hence, it this actor is also the prvType:DataCreatingService. For instance, a Triplify service provides Linked Data that it creates on the fly. The following descriptions refers to source data of a data creation. This source data was an RDF graph that has been retrieved from the Triplify service http://example.org/triplify2 which also created the graph.

_:k1 prv:createdBy [ rdf:type prv:DataCreation ;
                     # ...
                     prv:usedData _:k2 ] .

_:k2 rdf:type prv:DataItem ;
     rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
     prv:createdBy [ rdf:type prv:DataCreation ;
                     # ...
                     prv:performedBy <http://example.org/triplify2> ] ;
     prv:retrievedBy [ rdf:type prv:DataAccess ;
                       # ...
                       prv:accessedResource <http://example.org/ds2/data/object1> ;
                       prv:accessedService <http://example.org/triplify2> ] .

Other examples of a data providing services that are also the data creating service are Web-accessible DBMS query services or SPARQL endpoints.

Example: Retrieval of Files

The Provenance Vocabulary can also used to describe how a file has been retrieved from the Web as the following description illustrates:

_:l1 prv:createdBy [ rdf:type prv:DataCreation ;
                     # ...
                     prv:usedDataFile _:csvFile ] .

_:csvFile rdf:type prv:File ;
          prv:retrievedBy [ rdf:type prv:DataAccess ;
                            # ...
                            prv:accessedResource <http://example.org/files/mydata.csv> ] .

Related Vocabularies

The Provenance Vocabulary contains the fundamental terms to describe Web data provenance. However, it is not the aim of the Provenance Vocabulary to provide terms for very detailed descriptions of certain aspects of provenance. These details might be described using other, more focused vocabularies. Before using terms from such vocabularies it is strongly recommended to make sure that the semantics of these terms refers to things executed or created in the past, being consistent with the descriptions recorded in the provenance information. Exceptions can be made under the condition that, used in the context of describing provenance, the properties of these vocabularies are intended for describing facts that hold at the time of capturing the provenance description. This section briefly introduces some of the vocabularies that might be used together with the Provenance Vocabulary to provide more detailed information about certain aspects of the provenance of a data item.

Additional vocabularies that might be useful but are missing from the following discussion can be suggested at the prv-vocab-users mailing list. Suggestions that would be welcome in particular are vocabularies to describe services and the software they are running.

IRW Ontology

The Identity of Resources on the Web ontology (IRW) describes the identification of resources on the Web by defining the relationships between them and by describing their representations on the Web. IRW introduces classes such as irw:URI, irw:NonInformationResource, irw:InformationResource, irw:WebResource, irw:WebRepresentation, irw:WebServer (a specialization of prv:DataProvidingService), and irw:WebClient (equivalent to prvTypes:DataAccessor).

IRW may be used to provide more details about data accesses. For instance, the property irw:isEncodedIn might be used to describe the irw:MediaType of a prv:File; irw:isReferencedBy might be used to refer to the irw:URI of a irw:WebResource that is ir:realizedBy (i.e. represented by) a prv:DataItem or a prv:File retrieved from the Web (see the section on Dereferencable HTTP URIs below for an example).

HTTP Vocabulary in RDF

The HTTP Vocabulary in RDF allows to describe the HTTP headers that have been exchanged between a client and a server during the execution of a data access. This vocabulary provides classes such as http:Connection, http:Request, http:Response, and http:MessageHeader.

An approach to describe the execution of a data access on the Web in more detail could be based on the HTTP Vocabulary. Using this vocabulary it is possible to describe the messages exchanged during an HTTP based data access; this possibility allows to describe 303 redirections and content negotiations that happended during an access. The types module of the Provenance Vocabulary provides the property prvTypes:exchangedHTTPMessage to refer to the http:Messages exchanged during a prvTypes:HTTPBasedDataAccess as illustrated in the following example:

<> rdf:type prv:DataItem ;
   prv:serializedBy [ rdf:type prv:File ;
                      prv:retrievedBy _:m1 ] .

_:m1 rdf:type prv:DataAccess , prvTypes:HTTPBasedDataAccess ;
     prv:accessedResource <http://dbpedia.org/data/Berlin.xml> ;
     prvTypes:exchangedHTTPMessage _:m2 ;
     prvTypes:exchangedHTTPMessage _:m3 ;
     prvTypes:exchangedHTTPMessage _:m4 ;
     prvTypes:exchangedHTTPMessage _:m5 .

_:m2 rdf:type http:Request ;
     http:httpVersion "1.1" ;
     http:methodName "GET" ;
     http:mthd <http://www.w3.org/2008/http-methods#GET> ;
     http:abs_path "/resource/Berlin" ;
     http:resp _:m3 ;
     http:headers (
           [ http:fieldName "Host";
             http:fieldValue "dbpedia.org";
             http:hdrName <http://www.w3.org/2008/http-header#host> ]
           [ http:fieldName "Accept";
             http:fieldValue "application/rdf+xml";
             http:hdrName <http://www.w3.org/2008/http-header#accept> ]
                  ) .

_:m3 rdf:type http:Response ;
     http:httpVersion "1.1" ;
     http:statusCodeNumber "303" ;
     http:sc <http://www.w3.org/2008/http-statusCodes#statusCode303>
     http:headers (
           [ http:fieldName "Location";
             http:fieldValue "http://dbpedia.org/data/Berlin.xml";
             http:hdrName <http://www.w3.org/2008/http-header#location> ]
           # ...
                  ) .

_:m4 rdf:type http:Request ;
     http:httpVersion "1.1" ;
     http:methodName "GET" ;
     http:mthd <http://www.w3.org/2008/http-methods#GET> ;
     http:abs_path "/data/Berlin.xml" ;
     # ...

Changeset Vocabulary

The Changeset Vocabulary describes changes to RDF based resource descriptions. This vocabulary can be used to describe the change history of an RDF graph or of a linked dataset as illustrated in the change history example above.

Vocabularies for Organizations, Participation, and Roles

The Organization Ontology describes organizational structures and the Participation Ontology provides terms to describe roles of people within groups and organizations. These descriptions might be used to provide details about human actors that are part of the provenance of a data item. Even if such information about membership and roles is not strongly provenance-related it could be very useful to assess the data for which the provenance description is provided. Possible roles found in academic institutions are defined by the AIISO Roles Ontology. Similar definitions in the context of enterprises do not seem to exist, yet.

Additional Vocabularies

Further vocabularies that might be useful are:

  • The Proof Markup Language (PML) describes justifications for results of an answering engine or an inference process. These results are data items. Hence, PML might be used to describe certain aspects of their creation.
  • The Ouzo Provenance Ontology presented in [3] describes the run of a (scientific) workflow, the processed data, and the entities responsible for the workflow run. A workflow run that produced a data item is an execution of a data creation. The Ouzo Provenance Ontology could be used to describe this type of data creation.
  • The SPIN SPARQL Syntax and the RDF Graph Patterns and Templates vocabulary allow users to describe (parts of) a SPARQL query. Hence, these vocabularies might be used to provide a detailed description of SPARQL queries that were executed to create query results. Such a query execution is a special kind of data creation for which the used guideline is the query.

Publishing Provenance-Related Metadata about Linked Data

While the Provenance Vocabulary may be used to describe the provenance of any kind of data its creation was driven by the need for provenance-related metadata in the context of Linked Data. To achieve the goal of the availability of provenance information in the Web of Linked Data it is not only necessary to provide a vocabulary but also to provide guidance on how to publish the provenance information. For this reason, this section provides recommendations for publishing provenance-related metadata in the Web of Linked Data. These recommendations should be understood as a proposal while a set of best practices still has to emerge. Publication practices can be discussed at the prv-vocab-users mailing list.

The primary location of metadata about a linked dataset is its voiD description. A voiD description should comprise general provenance information for the described dataset.

In addition to general provenance information about a linked dataset it is recommended to provide more detailed information with each access to the dataset. There are basically three options to provide access to a linked dataset on the Web: dereferencable HTTP URIs, RDF dumps, and SPARQL endpoints. While these options do not exclude each other they require the application of different provenance publication approaches.

Dereferencable HTTP URIs

Publishing data about entities following the Linked Data principles requires the identification of these entities with URI references that can be resolved over the HTTP protocol into RDF data that describes the identified entity. Technically, a Linked Data publishing Web server provides representations of information resources from which RDF graphs can be extracted. These RDF graphs should contain provenance-related metadata about themselves and about the contained triples. Provenance of specific triples could be described using RDF reification. The provenance of the whole RDF graph should be expressed using a resource that represents the graph as illustrated in the following example:

<http://dbpedia.org/resource/Berlin> rdfs:label "Berlin"@en ;
                                     # ...
                                     georss:point "52.5 13.4" .

<> rdf:type prv:DataItem ;
   rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
   foaf:primaryTopic <http://dbpedia.org/resource/Berlin> ;
   informationrealization:realizes [
                          rdf:type irw:WebResource ;
                          irw:isReferencedBy [ rdf:type irw:URI ;
                                               irw:hasURIString "http://dbpedia.org/data/Berlin"^^xsd:anyURI ]
                                   ] ;
   prv:createdBy [ rdf:type prv:DataCreation ;
                   # ...
                   prv:usedData _:n1 ] .

<http://dbpedia.org/void/Dataset> rdf:type void:Dataset;
                                  void:exampleResource <http://dbpedia.org/resource/Berlin> .

_:n1 rdf:type prv:DataItem ;
     prv:retrievedBy [ rdf:type prv:DataAccess ;
                       # ...
                       prv:accessedResource <http://en.wikipedia.org/wiki/Berlin> ] ;
     informationrealization:realizes [
                   rdf:type irw:WebResource ;
                   irw:isReferencedBy [ rdf:type irw:URI ;
                                        irw:hasURIString "http://en.wikipedia.org/wiki/Berlin"^^xsd:anyURI ]
                                     ] .

The identifier <> represents the RDF graph that represents the data identified by http://dbpedia.org/data/Berlin. This relationship should be made explicit using the Information Realization ontology and the IRW ontology as demonstrated in the example. If possible, the provided provenance description should also comprise detailed provenance information about source data (and creation guidelines) that have been used during the creation of the RDF graph. Furthermore, the provenance description should cover the linked dataset the RDF graph belongs to. Instead of augmenting the graph itself with provenance metadata about its dataset it is possible to simply link to the voiD description using an HTTP-dereferenceable URI that identifies the dataset (as can be seen in the example).

RDF dumps

A linked dataset can be provided as an RDF dump; RDF dumps are (probably very large) documents that contain a whole dataset serialized in one of the RDF serialization formats. Usually, an RDF dump represents a linked dataset as a single RDF graph. This graph could contain provenance-related metadata similar to the RDF graphs provided for dereferenceable URIs (cf. previous section). However, in this case the added provenance metadata describe the provenance of the whole dataset and, thus, is likely to be the same as provided with a voiD description for the dataset. In addition to this information the metadata should also describe the provenance of the RDF dump itself.

It is also possible to serialize a linked dataset as a collection of Named Graphs. In this case each of these graphs could contain provenance-related metadata about itself. Alternatively, the document that serializes the collection of Named Graphs (using syntaxes such as TriX or TriG) could contain an additional Named Graph that describes the provenance of the other graphs.

SPARQL Endpoints

A third possibility to provide access to a linked dataset is via a SPARQL endpoint that enables the execution of SPARQL queries over the dataset. SPARQL, the query language for RDF data, defines four different query result forms: SELECT, CONSTRUCT, DESCRIBE, and ASK.

The result of CONSTRUCT and of DESCRIBE queries is an RDF graph. A provenance-enhanced SPARQL query engine could add provenance-related metadata to these result graphs.

The result of a SELECT query is a set of variable bindings that can be represented as a table; ASK queries result in a boolean value. To exchange these types of results over the Web SPARQL endpoints (i.e. Web services that implement the SPARQL protocol) usually serialize the results using the XML results format or the JSON format. It requires future work to define a possibility how these result serializations can be extended with provenance descriptions. It should be noted that a vocabulary for SPARQL SELECT result sets exists; however, this vocabulary does not seem to be in wide use today.

References

[1] Olaf Hartig and Jun Zhao: Using Web Data Provenance for Quality Assessment. In Proceedings of the 1st Int. Workshop on the Role of Semantic Web in Provenance Management (SWPM) at ISWC, Washington, DC, USA, October 2009 Download PDF

[2] Olaf Hartig: Provenance Information in the Web of Data. In Proceedings of the Linked Data on the Web (LDOW) Workshop at WWW, Madrid, Spain, April 2009 Download PDF

[3] Jun Zhao: A Conceptual Model for e-Science Provenance. Ph.D. Thesis, University of Manchester, June 2007 Download PDF

Personal tools