The goal of ODCleanStore is to build a server which will store, clean, link, and score incoming RDF data and provide aggregated and integrated views on the data to Linked Data consumers. Motivation behind the project is described in the specification.
ODCleanStore accepts arbitrary RDF data through the webservice for publishers, together with provenance metadata. The data are stored to the dirty database where they are cleaned, scored, linked to other data etc. Subsequently, the data are moved to the clean database where it can be queried through the webservice for consumers. The response to a query consists of relevant RDF triples together with their provenance information and quality estimate. Openlink Virtuoso is used for storing the data.
The webservices will communicate in standard formats (RDF/XML, TriG) in order to integrate with an arbitrary producer or consumer of data.
A data acquisition module Strigil, that would obtain information from (X)HTML pages or Excel spreadsheets, convert it to RDF and feed it to ODCleanStore, is currently under development. In the future, a data visualization and analysis module built on the webservice for consumers will be developed.
Data accepted through the webservice for publishers is stored as a named graph to the dirty database. The ODCleanStore Engine takes these named graphs and runs them through a pipeline of transformers. A transformer is a Java class implementing a defined interface. Each transformer may modify the processed named graph (e.g. normalize values, deal with blank nodes) or attach a new named graph (e.g. quality assessment results, links to the data in the clean database or links to other datasets). Custom transformers can be easily plugged in to an arbitrary place in the processing pipeline.
Several transformers have special importance with regard to integration of data from various sources and the quality assessment, and are integrated to the web user interface: Data Normalization, Quality Assessment and Object Identification. These are described in more detail below.
When the named graph passes through all the transformers in the pipeline, it is moved to the clean database and made available for queries.
Stored data can be accessed through a RESTful webservice. Two types of queries are supported: URI query and keyword query. Relevant triples from the clean database are returned for each query. Because the triples may originate in various sources with different ontologies used to model the data and with various quality, the data are aggregated according to aggregation settings which may be supplied with the query.
The returned RDF triples are accompanied with the sources they come from and with a quality estimate based on the quality of the sources and conflicts during the aggregation phase. More information about the provenance and quality score of each source named graph may be requested.
In addition to URI and keyword queries, a limited access to the clean database through a SPARQL endpoint will be provided.
Basic configuration of the whole application will be done through a simple website, based on the Apache Wicket framework.
The website will allow managing user accounts (restricting permissions to use various parts of the website and to insert data through input services), ontologies, configuring (custom) transformers and the engine.
The configurations are expected to be done through simple HTML forms. A more convenient (AJAX based) user interface might be implemented in future.
In order to supply a smooth user experience, some methods of sharing user accounts across all related projects (e.g. especially the storage and the scraper) are being considered. Target users could then administer all projects using individual websites without re-authorizations.
Data normalization and Quality Assessment are special implementation of transformers.
Data normalization is aimed to be applied early in the whole data evaluation process to simplify work of other transformers. Its main goal is to remove inconsistencies in forms the data is provided in. This is achieved by rules that specify patterns (data that comply to certain conditions) that need to be transformed and the way to transform them. The pairs of patterns and transformations are stored in database as rules and the set of all rules can be modified through the web frontend.
Quality Assessment assigns a score to each graph based on coefficients of different patterns present in the graph. Each time a score of a graph changes a total score of its publisher (domain of origin) is updated. Again the rules formed of patterns and coefficients are supplied as a special resource to this particular transformer.
The patterns are described in SPARQL conditions.
Object identification (or linking) is also a special implementation of a transformer.
The main purpose of this process is to interlink URIs which represent the same real-world entity by generating owl:sameAs links. It can be also used for creating other types of links between differently related URIs. Silk framework is used as the linking engine. Sets of linkage rules for the engine are written in Silk-LSL, stored in database and can be managed through our web frontend.