Menu

Future extensions

Tomas Knap Mifeet Petr Jerman

Future Work for ODCleanStore:

Data Normalizer

  • a possibility to view (debug) logs
  • a possibility to mark certain rule and track the application of such rule
  • score should take into account completeness of data in the graph (w.r.t. other graphs)
  • score should take into account timeliness of the data
    • volatility is important: if date of birth does not change in a ten years update, it is ok. If the price of the product does not change in a ten years update, it might be suspicious.
  • support for SQL queries
  • action DELETE SUBGRAPH
  • more detailed logging; special recovery log for recovering from the destructive change in the database

  • Machine learning - derive rules from the given set of named graphs, derive volatility of data (to support score for timeliness)

  • Further advanced data correction techniques (be careful of functional dependencies)
  • Concept of a special database where the data for manual cleansing should be kept
    • When manually editing data being wrong, a provenance record should be tracked and such information should be take into account when computing score for the publisher/domain.

Query Execution

  • Further QA policies:

    • (provenance) policies, e.g., "Distrust data coming from a datasource http://example.com". "Distrust data older than 1 year (created at least one year ago)"
    • content policies ("Distrust data with completeness less than ...") - policies connected with the content
    • rating policies ("Distrust least popular sources")
  • support of CR for SPARQL queries

  • paging of the results
  • sorting results based on the relevance
  • a possibility to restrict the results on resources with the given rdf:type, for URI search the restriction will apply for the searched URI, for keyword search the restriction will apply for the subjects in the result
  • a possibility to query all resources of the given type (rdf:type)

Conflict Resolution

  • quality estimation should take into account who created the data or other data provenance elements.
  • user may set his (dis)trust in publisher
  • take into account data timeliness
  • TOP-K conflict handling policy (K best values), parametrized conflict handling policies (e.g. separator for CONCAT)
  • conflict resolution as a library, which will enable to build customized clean databases (customized data marts) with already conflict resolved data.

Engine

  • to detect identical update
  • should offer data processed by pipeline and original data submitted to ODCleanStore. Consumer may would like to see the original data, operators of the pipeline may would like to use the original data submitted to ODCleanStore
    • Original data submitted to ODCleanStore should be stored together with the cleansed data
  • Concept of multiple pipelines for the same data? (more configurations, profiles)
    • raw data, conservative pipeline (QA, simple linkers, basic Data Normalizers), agressive pipeline
    • Realization: It is possible to create more pipelines {name}-1, {name}-2 ..Input WS submits data to all pipelines, stores the data as alternative graphs in the clean database, a data consumer may specify that he is interested in conservative data.
    • Conflict resolution transformer

Custom Transformers

  • configuration when the transformer should be launched (everytime/according to source/ontology - similarly as for other policies)
  • the possibility to configure custom transformers as the data is inserted to ODCleanStore using input WS

Webový frontend

  • user in the admin role may edit arbitrary pipeline/group of policies
  • everyone can create its own copy of a group of rules (fork); such group has to have unique label
  • possibility of fork for single rules
  • possibility of showing only my own group of rules
  • sorting of tables
  • searching in groups/pipelines
  • visibility of groups of rules/pipelines -> private nebo public
  • notification and accepting changes in editing the groups
    • if the author edits a group, all who are using the group are notified and have a possibility to accept or reject the changes (at global/transformer level)
    • if the user does not accept changes, the old version of rules is used in his pipelines
    • if the user accepts changes, new version will be used
      • if the user refuses, old version is used
    • user should has the possibility to replace the group with a different fork
    • user should know about the changes when logged in
    • notification can be send to mail
  • user will have prepared certain tamplates of transformers, which may be directly use in the pipeline
    • template is named, contains groups of rules
    • when inserted to pipeline, a new copy of the template is created and inserted to the pipeline
    • changes of the groups of rules the template uses are automatically accepted
    • templates may be created/edited only by ADM (can be read by PIC a ADM)
  • new groups of rules, with permissions at the level of pipelines/rules
  • Management of data in the quarantine (who is responsible?)
  • Management of users
  • Possibility for the groups of rules to contain further groups
  • Possibility to upload transformer using administration interface
  • Possibility to donwload onto from the given URI
  • Possibility to rereun affected pipelines when the group of rules is deleted
  • Possibility to stop/pause pipeline execution
  • Possibility to show for the group the list of pipelines using it
  • Possibility to show for the transformer the list of pipelines using it
  • The navigation among the pages should be clickable
  • Filtering of the outputs according to single attributes
    gui: management of the loads to the raw data mart (how often), management of the data marts (how often, which data, where - db, file, ..)

Ostatní

  • for rules (QA,DN), check the validity of the created rules (trying to launch it on some testing data, so that SPARQL does not fail on some syntactical error)
  • role Corrector - is able to edit data
  • Linking to external data sources - periodic/manual launching of something with well defined interfaces (like transformer) with the possibility to be managed in the frontend; see meeting 17.6.
  • a possibility to add label to the inserted graph
  • create user friendly tool for loading data to ODCleanStore, a possibility to divide big files to small graphs, ...

Visualization (Output WS)

  • Create GUI depicting quality scores/justifications, coloring results
  • Create GUI which will enable user to annotate the given SPARQL query - sample annotation: select source XY for var ?a

Discussion


Log in to post a comment.