Supervisor: Tomáš Knap (tomas.knap@mff.cuni.cz)
Team members: 5 students (Jakub Daniel, Petr Jerman, Jan Michelfeit, Dušan Rychnovský, Tomas Soukup)
Language: Java
OS: Windows 7 / Windows Server 2008 / Linux
The advent of Linked Data [1,2] in the recent years accelerates the evolution of the Web into a giant information space where the unprecedented volume of resources will offer to the information consumer a level of information integration and aggregation that has up to now not been possible. Consumers can now 'mashup' and readily integrate information for use in a myriad of alternative end uses. Indiscriminate addition of information can, however, come with inherent problems such as (1) the provision of poor quality, (2) inaccurate, (3) irrelevant, or (4) fraudulent information. All will come with an associate cost which will ultimately affect decision making, system usage and uptake.
The ability to assess the quality of information on the Web, thus, presents one of the most
important aspects of the information integration on the Web and will play a fundamental role in the continued adoption of Linked Data principles [2].
The goal of the project is to built a Java Web application which will clean, link, and score incoming RDF data and provide aggregated and integrated views on the data to Linked Data consumers. The application will have graphical user interface for application administration. The main parts of the application are:
Data Storage
The application will store the incoming data, together with its metadata, to Openlink Virtuoso [3], the most popular RDF storage with a solid support. This task requires configuring the Openlink Virtuoso database and setting up/building mechanisms (e.g. web services, JAR libraries) to communicate with the storage (to store/retrieve the data). We will use two important data spaces - to store the incoming data (dirty database) and to store clean data (clean database).
Cleaning and scoring the data (Error Localization component)
The application will check whether the incoming data conform with:
Based on that, we correct syntactical errors, score the incoming data and either send the data to the clean database, or drop the data. We will also provide several sample sets of policies applicable to real world data sources.
Linking the data (Object Identification & Record Linkage Component)
Since the same resources can be identified by various URIs (and often are), the application will support specification of rules, which will apply to the incoming resources and try to reveal whether the new incoming resource represents a new concept (not already involved in the clean database) or a concept already involved in the clean database; in the latter case, the application will create a link specifying that the given two resources are representing the same concept. The component will also support creation of arbitrary types of links between resources. We will use Silk engine [4] and its specification language [5] to specify the policies.
Providing data (Query Execution & Conflict Resolution Component)
The main purpose of the project is to provide data aggregated from various sources and according to consumers' needs.
Data consumers can retrieve data about resources via URI identifiers of these resources (see Linked Data principles) or by specifying keywords. The response contains all the data known about the relevant resource, together with provenance metadata (who created it, when etc.) and with a score based on the results of the Error Localization component and the aggregated score of the particular data source.
When the data are retrieved and prepared for the consumer, we do solve conflicts among the data (various sources may provide conflicting values for the same RDF properties) by the preferred conflict resolution policy, which can be specified by the consumer; otherwise, we will use default conflict resolution policies. The conflicts will be precomputed in the database or computed on the fly.
Consumers can also query the application using SPARQL query language. In this case, however, the information about metadata and data quality scores may be limited and the conflict resolution won't be supported.
Ontology Maintenance & Mapping
The project will maintain ontologies describing the consumed data (explicitly imported) and enable creation of mappings between these ontologies which is a crucial aspect when aggregating data. Mappings between ontologies will be taken into account during query answering.
Roles
The application will support several users' roles:
Graphical User Interface
The application will involve graphical user interface enabling:
(including analysis, documentation, and testing of the proper parts):
We will implement prototype of the project with restricted functionality as soon as possible (in Month 4). The work plan is as follows, assuming deadline in 9 months:
During the work on the project, we will stay in touch with Digital Enterprise Research Institute (DERI) in Ireland [8], one of the biggest Semantic Web Research institutes in the world with a specialized Linked Data Research Centre [9]. We will also discuss the proposed techniques with the Carlo Batini's team at University of Milano-Bicocca [7].
[1] Bizer, Ch., Heath, T. and Berners-Lee, T. Linked Data - The Story So Far. International
Journal on Semantic Web and Information Systems 5, 1-22 (2009).
[2] Berners-Lee, T. Linked Data - Design Issues.
http://www.w3.org/DesignIssues/LinkedData.html [2] DBLP (http://www.informatik.unitrier.
de/~ley/db/)
[3] http://virtuoso.openlinksw.com/
[4] http://www4.wiwiss.fu-berlin.de/bizer/silk/
[5] http://www.assembla.com/spaces/silk/wiki/Link_Specification_Language
[6] http://www.apache.org/licenses/LICENSE-2.0
[7] http://www.unimib.it/go/page/Italiano/Elenco-Docenti/BATINI-CARLO
[8] Digital Enterprise Research Institute, National University of Ireland, Galway.
http://www.deri.ie/
[9] Linked Data Research Centre, DERI. http://linkeddata.deri.ie/
or implicitly added when used in the incoming data (vyhodil jsem)
debugging policies for silk? (vyhozeno)
Mělo by se tam napsat, že součástí budou nějakové ukázkové policies na ukázkových datech? Nebo to raději nezmiňovat?
"everybody can submit the data" - to by měli dělat pouze registrovaní uživatelé.
Až budeme mít sepsaná možná rozšíření, možná by bylo fajn dopsat tam explicitně, že v rámci projektu implementovaná nebudou (jinak by komisi mohla napadnout stejná rozšíření a mohli by je chtít).
"specifying SPARQL queries - will be limited to particular users"
Nepředpokládalo se, že bude SPARQL veřejný? Jinak bychom museli mít USR uživatele registrované.
Communication between data consumer and storage is not encrypted.
Input & Output Restrictions
Data can be submitted to the storage via:
Data can be retrieved by:
Furthermore, particular users can get information about ontologies.