ODClean Store Wiki

Linked Data management tool

Brought to you by: dusanr, jermanp, mifeet, toknap, tosoukup

Future extensions

Authors:

Future Work for ODCleanStore:

Data Normalizer

a possibility to view (debug) logs
a possibility to mark certain rule and track the application of such rule
score should take into account completeness of data in the graph (w.r.t. other graphs)
score should take into account timeliness of the data
- volatility is important: if date of birth does not change in a ten years update, it is ok. If the price of the product does not change in a ten years update, it might be suspicious.
support for SQL queries
action DELETE SUBGRAPH
more detailed logging; special recovery log for recovering from the destructive change in the database
Machine learning - derive rules from the given set of named graphs, derive volatility of data (to support score for timeliness)
Further advanced data correction techniques (be careful of functional dependencies)
Concept of a special database where the data for manual cleansing should be kept
- When manually editing data being wrong, a provenance record should be tracked and such information should be take into account when computing score for the publisher/domain.

Query Execution

Further QA policies:
- (provenance) policies, e.g., "Distrust data coming from a datasource http://example.com". "Distrust data older than 1 year (created at least one year ago)"
- content policies ("Distrust data with completeness less than ...") - policies connected with the content
- rating policies ("Distrust least popular sources")
support of CR for SPARQL queries
paging of the results
sorting results based on the relevance
a possibility to restrict the results on resources with the given rdf:type, for URI search the restriction will apply for the searched URI, for keyword search the restriction will apply for the subjects in the result
a possibility to query all resources of the given type (rdf:type)

Conflict Resolution

quality estimation should take into account who created the data or other data provenance elements.
user may set his (dis)trust in publisher
take into account data timeliness
TOP-K conflict handling policy (K best values), parametrized conflict handling policies (e.g. separator for CONCAT)
conflict resolution as a library, which will enable to build customized clean databases (customized data marts) with already conflict resolved data.

Engine

to detect identical update
should offer data processed by pipeline and original data submitted to ODCleanStore. Consumer may would like to see the original data, operators of the pipeline may would like to use the original data submitted to ODCleanStore
- Original data submitted to ODCleanStore should be stored together with the cleansed data
Concept of multiple pipelines for the same data? (more configurations, profiles)
- raw data, conservative pipeline (QA, simple linkers, basic Data Normalizers), agressive pipeline
- Realization: It is possible to create more pipelines {name}-1, {name}-2 ..Input WS submits data to all pipelines, stores the data as alternative graphs in the clean database, a data consumer may specify that he is interested in conservative data.
- Conflict resolution transformer

Custom Transformers

configuration when the transformer should be launched (everytime/according to source/ontology - similarly as for other policies)
the possibility to configure custom transformers as the data is inserted to ODCleanStore using input WS

Webový frontend

user in the admin role may edit arbitrary pipeline/group of policies
everyone can create its own copy of a group of rules (fork); such group has to have unique label
possibility of fork for single rules
possibility of showing only my own group of rules
sorting of tables
searching in groups/pipelines
visibility of groups of rules/pipelines -> private nebo public
notification and accepting changes in editing the groups
- if the author edits a group, all who are using the group are notified and have a possibility to accept or reject the changes (at global/transformer level)
- if the user does not accept changes, the old version of rules is used in his pipelines
- if the user accepts changes, new version will be used
  - if the user refuses, old version is used
- user should has the possibility to replace the group with a different fork
- user should know about the changes when logged in
- notification can be send to mail
user will have prepared certain tamplates of transformers, which may be directly use in the pipeline
- template is named, contains groups of rules
- when inserted to pipeline, a new copy of the template is created and inserted to the pipeline
- changes of the groups of rules the template uses are automatically accepted
- templates may be created/edited only by ADM (can be read by PIC a ADM)
new groups of rules, with permissions at the level of pipelines/rules
Management of data in the quarantine (who is responsible?)
Management of users
Possibility for the groups of rules to contain further groups
Possibility to upload transformer using administration interface
Possibility to donwload onto from the given URI
Possibility to rereun affected pipelines when the group of rules is deleted
Possibility to stop/pause pipeline execution
Possibility to show for the group the list of pipelines using it
Possibility to show for the transformer the list of pipelines using it
The navigation among the pages should be clickable
Filtering of the outputs according to single attributes
gui: management of the loads to the raw data mart (how often), management of the data marts (how often, which data, where - db, file, ..)

Ostatní

for rules (QA,DN), check the validity of the created rules (trying to launch it on some testing data, so that SPARQL does not fail on some syntactical error)
role Corrector - is able to edit data
Linking to external data sources - periodic/manual launching of something with well defined interfaces (like transformer) with the possibility to be managed in the frontend; see meeting 17.6.
a possibility to add label to the inserted graph
create user friendly tool for loading data to ODCleanStore, a possibility to divide big files to small graphs, ...

Visualization (Output WS)

Create GUI depicting quality scores/justifications, coloring results
Create GUI which will enable user to annotate the given SPARQL query - sample annotation: select source XY for var ?a

ODClean Store Wiki

Linked Data management tool

Future extensions

Data Normalizer

Query Execution

Conflict Resolution

Engine

Custom Transformers

Webový frontend

Ostatní

Visualization (Output WS)

Discussion