Menu

Vize

Tomas Knap Mifeet

ODCleanStore - SW Project Proposal

Supervisor: Tomáš Knap (tomas.knap@mff.cuni.cz)
Team members: 5 students (Jakub Daniel, Petr Jerman, Jan Michelfeit, Dušan Rychnovský, Tomas Soukup)
Language: Java
OS: Windows 7 / Windows Server 2008 / Linux

Motivation

The advent of Linked Data [1,2] in the recent years accelerates the evolution of the Web into a giant information space where the unprecedented volume of resources will offer to the information consumer a level of information integration and aggregation that has up to now not been possible. Consumers can now 'mashup' and readily integrate information for use in a myriad of alternative end uses. Indiscriminate addition of information can, however, come with inherent problems such as (1) the provision of poor quality, (2) inaccurate, (3) irrelevant, or (4) fraudulent information. All will come with an associate cost which will ultimately affect decision making, system usage and uptake.

The ability to assess the quality of information on the Web, thus, presents one of the most
important aspects of the information integration on the Web and will play a fundamental role in the continued adoption of Linked Data principles [2].

Goal of the Project

The goal of the project is to built a Java Web application which will clean, link, and score incoming RDF data and provide aggregated and integrated views on the data to Linked Data consumers. The application will have graphical user interface for application administration. The main parts of the application are:

Data Storage
The application will store the incoming data, together with its metadata, to Openlink Virtuoso [3], the most popular RDF storage with a solid support. This task requires configuring the Openlink Virtuoso database and setting up/building mechanisms (e.g. web services, JAR libraries) to communicate with the storage (to store/retrieve the data). We will use two important data spaces - to store the incoming data (dirty database) and to store clean data (clean database).

Cleaning and scoring the data (Error Localization component)
The application will check whether the incoming data conform with:

  • the ontology used to describe these data
  • custom policies defined for all the data described by that ontology
  • custom policies defined for the data coming from a particular data source.

Based on that, we correct syntactical errors, score the incoming data and either send the data to the clean database, or drop the data. We will also provide several sample sets of policies applicable to real world data sources.

Linking the data (Object Identification & Record Linkage Component)
Since the same resources can be identified by various URIs (and often are), the application will support specification of rules, which will apply to the incoming resources and try to reveal whether the new incoming resource represents a new concept (not already involved in the clean database) or a concept already involved in the clean database; in the latter case, the application will create a link specifying that the given two resources are representing the same concept. The component will also support creation of arbitrary types of links between resources. We will use Silk engine [4] and its specification language [5] to specify the policies.

Providing data (Query Execution & Conflict Resolution Component)
The main purpose of the project is to provide data aggregated from various sources and according to consumers' needs.

Data consumers can retrieve data about resources via URI identifiers of these resources (see Linked Data principles) or by specifying keywords. The response contains all the data known about the relevant resource, together with provenance metadata (who created it, when etc.) and with a score based on the results of the Error Localization component and the aggregated score of the particular data source.

When the data are retrieved and prepared for the consumer, we do solve conflicts among the data (various sources may provide conflicting values for the same RDF properties) by the preferred conflict resolution policy, which can be specified by the consumer; otherwise, we will use default conflict resolution policies. The conflicts will be precomputed in the database or computed on the fly.

Consumers can also query the application using SPARQL query language. In this case, however, the information about metadata and data quality scores may be limited and the conflict resolution won't be supported.

Ontology Maintenance & Mapping
The project will maintain ontologies describing the consumed data (explicitly imported) and enable creation of mappings between these ontologies which is a crucial aspect when aggregating data. Mappings between ontologies will be taken into account during query answering.

Roles
The application will support several users' roles:

  • administrator - rights to assign other roles, particularize settings for the application, components
  • ontology creator - rights to adjust ontologies, ontology mappings
  • policy creator - rights to adjust policies for the Error Localization component specific for the given data source, to write policies for Object Identification and Record Linkage component, debugging policies for Error Localization component
  • scraper - rights to insert data, list registered ontologies
  • user - rights to query the application

Graphical User Interface
The application will involve graphical user interface enabling:

  • management of all kinds of policies (policies for Error Localization and Object Identification & Record Linkage components, conflict resolution policies, and ontology mapping policies) via GUI, GUI will also support debugging of policies for Error Localization component by showing which data match the given policies
  • Managing roles and rights of the application
  • Managing settings of the components, the whole application

Expected utilization of the team:

(including analysis, documentation, and testing of the proper parts):

  • Error localization component, analysis, documentation, testing of the component (1 person)
  • Query execution & Conflict resolution, analysis, documentation, testing of the component (1 person)
  • Object Identification, Record Linkage Component (includes ontology mappings), analysis, documentation, testing of the component (0.3 persons)
  • Application's core - creating and launching components; communication with data consumers/providers; storage configuration; roles management, analysis, documentation and testing of the whole application (1.7 persons)
  • Graphical User Interface, design, documentation, testing (1 persons)

Expected Work Plan:

We will implement prototype of the project with restricted functionality as soon as possible (in Month 4). The work plan is as follows, assuming deadline in 9 months:

  • general analysis, specification, architecture, specification of the modules (Months 1-2)
  • implementing prototype with restricted functionality (Months 3-4)
  • testing prototype (Month 5)
  • implementing full specification (Months 5 - 7)
  • testing full specification (Month 8)
  • Further testing and refinement, documentation (user, programmer, configuration guide) (Month 9)

Other requirements on the project

  • The final documentations of the project will be in English
  • The application will be freely available under Apache Software License [6]
  • It should be easy to incorporate other components, such as a component computing popularity of the data sources
  • It should be easy to adjust query execution/conflict resolution component to take into account different types of custom policies submitted by the data consumer, such as context (provenance) policies (e.g. "Distrust data coming from a datasource http://example.com", "Prefer data with the license ..."), content policies ("Distrust data older than 1 year") or rating policies ("Prefer popular sources").

Other notes

During the work on the project, we will stay in touch with Digital Enterprise Research Institute (DERI) in Ireland [8], one of the biggest Semantic Web Research institutes in the world with a specialized Linked Data Research Centre [9]. We will also discuss the proposed techniques with the Carlo Batini's team at University of Milano-Bicocca [7].

References

[1] Bizer, Ch., Heath, T. and Berners-Lee, T. Linked Data - The Story So Far. International
Journal on Semantic Web and Information Systems 5, 1-22 (2009).

[2] Berners-Lee, T. Linked Data - Design Issues.
http://www.w3.org/DesignIssues/LinkedData.html [2] DBLP (http://www.informatik.unitrier.
de/~ley/db/)

[3] http://virtuoso.openlinksw.com/

[4] http://www4.wiwiss.fu-berlin.de/bizer/silk/

[5] http://www.assembla.com/spaces/silk/wiki/Link_Specification_Language

[6] http://www.apache.org/licenses/LICENSE-2.0

[7] http://www.unimib.it/go/page/Italiano/Elenco-Docenti/BATINI-CARLO

[8] Digital Enterprise Research Institute, National University of Ireland, Galway.
http://www.deri.ie/

[9] Linked Data Research Centre, DERI. http://linkeddata.deri.ie/

Zakomponovane do textu

  • or implicitly added when used in the incoming data (vyhodil jsem)

    • Jak se tedy zachovame, kdyz prijmeme data pouzivajici trebas properties nezname onto?
  • debugging policies for silk? (vyhozeno)

    • To asi zalezi na tom, do jake miry se podari zaintegrovat silk gui
  • Mělo by se tam napsat, že součástí budou nějakové ukázkové policies na ukázkových datech? Nebo to raději nezmiňovat?

    • Asi ano, promyslim neco opatrneho.
  • "everybody can submit the data" - to by měli dělat pouze registrovaní uživatelé.

    • Souhlas, vyreseno
  • Až budeme mít sepsaná možná rozšíření, možná by bylo fajn dopsat tam explicitně, že v rámci projektu implementovaná nebudou (jinak by komisi mohla napadnout stejná rozšíření a mohli by je chtít).

    • To bych tam asi nepsal, co tam napsane nebude po nas nemuzou chtit
  • "specifying SPARQL queries - will be limited to particular users"
    Nepředpokládalo se, že bude SPARQL veřejný? Jinak bychom museli mít USR uživatele registrované.

    • Nakolik se aplikuje conflict resolution na sparql dotazy? Budem implementovat ten sparql construct?
    • nechame zatim verejny, kdyztak upravime konfiguraci virtuoso conductoru a nastavenim roli

Other notes

Communication between data consumer and storage is not encrypted.

Input & Output Restrictions

Data can be submitted to the storage via:

  • Web Service - everybody can submit the data (store the data to the dirty database, named graph is specified by the implementation of the storage)
  • JAR library - scraper can submit the data via JAR library (both scraper and storage are implemented in Java)

Data can be retrieved by:

  • specifying URI of the resource - everybody can, the response contains score of the retrieved triples and basic metadata about the triples (who created it, when, its source). If the consumer is interested, he can require further information explanation why the score was as it was.
  • specifying keywords - everybody can, number of matching URIs can be limited to overcome performance issues, response as in the first case
  • specifying SPARQL queries - will be limited to particular users, some queries may be cancelled due to the expected consumed resources, information about metadata and scores of the triples may be missing.

Furthermore, particular users can get information about ontologies.


Discussion


Log in to post a comment.