ODCleanStore provides means for storing, cleaning, linking, and scoring incoming RDF data data and provide aggregated and integrated views on the data to Linked Data consumers. In addition, we support trustworthiness of the data with quality assessment and provenance tracking. Our goal is to create a data store that would be easy to deploy and ready for use inside of the enterprise / organization.
Our focus is on the data processing and queries over cleaned data. Nevertheless, the extraction process that feeds data to ODCleanStore is also important - a related project Strigil implements a web scraper and document extractor that produces RDF data and integrates with ODCleanStore as a store for the produced data.
Strigil implements a web scraper and document extractor that produces RDF data and integrates with ODCleanStore as the producer of data.
Linked Data Manager (LDM) is a Java based Linked (Open) Data Management Suite to schedule and monitor required Extract - Transform - Load jobs for web-based Linked Open Data portals as well as for sustainable Data Management and Data Integration usage.
LDM data processing pipeline is similar to the data processing pipeline in ODCleanStore. LDM is a counterpart of ODCleanStore in that it provides facilities for managing the extraction process but doesn't provide any permanent storage or direct access to the data. Thus an LDM Loader could be used to send data to ODCleanStore and access it from here. Cooperation with LDM is currently being considered.
Linked Data Integration Framework (LDIF)
LDIF is an open-source Linked Data Integration Framework that can be used by Linked Data applications to translate Web data and normalize URI while keeping track of data provenance.
The framework consists of a Scheduler, Data Import and an Integration component with a set of pluggable modules.
LDIF components encompass the whole process from data import and processing to integration and quality assessment. We use some of LDIF componets internally in ODCleanStore (Silk). The main difference is that LDIF is a framework other applications can built on, while ODCleanStore is a ready-to-use solution that can be easily deployed and managed via a web interface. Differences in quality assessment and data aggregation with Sieve, a part of the LDIF framework, are described below.
Provenance in LDIF - see Figure 2 of LOD2 Deliverable 4.3.2
Sieve adds quality assessment and data fusion capabilities to the LDIF architecture. It uses metadata about named graphs in order to assess data quality, agnostic to provenance vocabulary and quality models. Sieve uses customizable scoring functions to output data quality descriptors. Based on these quality descriptors (and/or optionally other descriptors ), Sieve can use configurable FusionFunctions to clean the data according to task-specific requirements.
Sieve offers functionality similar to our Conflict Resolution component; however the purpose of Sieve in LDIF is different - it aggregates data while being stored to the clean database (unlike Conflict Resolution used at query time). This may be suitable when the desired data are known in advance but is not sufficient for open Web environments, where every consumer has different requirements on the aggregated data. Furthermore, ODCleanStore provides quality for each result statement where Sieve computes quality only for whole named graph.
** Karma **
Integration systems in relational databases
The problem of integration of heterogeneous data (solved in ODCleanStore for RDF data) is solved by several systems for relational databases, e.g. Aurora or Fusionplex.