Re: [dotNetRDF-Develop] About the SPIN Processor

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Max

Comments inline:

From:  Max - Micrologiciel <ma...@mi...>
Date:  Wednesday, 4 February 2015 16:26
To:  Rob Vesse <rv...@do...>
Subject:  Re: About the SPIN Processor

> Hi Rob, 
> 
> thanks for your answer.
> 
> Maybe I've been giving this too much hard thoughts lately but I tried to get
> back rapidly on tracks while I have the time ;)
> I may have been somewhat confused (thus confusing ) while explaining my
> concerns.
> 
> Let me try to get those clearer now that I've had some time to rethink some of
> the points I raised.
> 
> Mainly the framework will have to compensate for lack of support of the SPIN
> requirements by the underlying storage engine.
> Hopefully, those will be addressed someday so the implementation should be in
> time relevant only for the dotNetRDF InMemory storage. however this is not the
> case now and it won't be for a (long?) time.

Most likely so, and of course there is always the risk that SPIN gets
replaced with some newer approach in the future.

> 
> Meanwhile we should be able to provide local support by work-around
> strategies.

Ideally yes

> 
> Related to transaction support, I have been considering the following
> strategy:
> About isolation, I believe we can provide isolation between transactions by
> using "temporary" graphs in the underlying storage :
> * during SPARQL evaluation, FROM/FROM NAMED/GRAPH clauses or/and triple
> patterns can be rewritten to use those graphs (according to how those graphs
> are built and what they contain).
FROM is difficult to rewrite because where there are multiple FROM clauses
the default graph is the merge of those graphs so any duplicates you could
get by querying those graphs separately need suppressing

Restricting use of FROM/FROM NAMED may be an option
> * 
> * 
> * to avoid bloating the underlying storage with whole graph copies and perform
> a full graph diff (with all possible concurrency issues) at commit time, it
> maybe would be better to store only local additions and removals (those would
> need to be retrieved separately anyway to filter rules and constraints to run
> in the SPIN workflow)
This is what the existing transactional update support does
> * 
> 
> for instance: the graph pattern
>> GRAPH <g> { ?s ?p ?o . }
> 
> could be rewritten into into
>> { GRAPH <g> { ?s ?p ?o . }
>>   MINUS { GRAPH <mytrans:removals:g> {?s ?p ?o . }
>> } UNION {
>>   GRAPH <mytrans:additions:g> { ?s ?p ?o . }
>> }
> 
> To provide for concurrency, I am thinking of maintaining a special graph in
> the storage that would work as a transaction log to keep track of
> transactions, graphs changes and commits
> *    I think keeping this log directly in the storage would allow for
> consistent time-stamping and possible distributed usage between
> processes/threads.
Possibly though this has extra cost of round trips to the remote storage.

Distributed usage is likely ill advised and I would go for a simpler design
than try and adding distributed support immediately

> * 
> * 
> *    I feel all transaction changes (with transaction log updates) can also be
> performed with the "diff graphs"at commit time within a single SPARQL update
> command thus performing the commit atomically
> * 
> *    the "SpinWrappedStorage" implementation would then become responsible to
> ensure effective local concurrency within the framework.
> 
> This means we could provide a transactional layer upon any storage provider.
> Does that sound reasonable ?
> To try and simplify this I was thinking about using the .NET
> System.Transactions library. Would you have any advice or experience with it
> or do you know any alternative that would help ?

No never used it personally, it appears to be primarily tied to SQL Server
and requires MSDTC which would make it non-portable

> 
> 
> As about the SPARQL caveats and extensions, I feel this really concerns the
> core of the SPIN evaluation engine since the goal is to rely as much as
> possible on the underlying storage performances (Fully aware here that relying
> only on SPARQL rewriting will also rule out any extension mechanism without
> direct implementation by the underlying storage)
> 
> For instance, consider the getFather example here :
> http://composing-the-semantic-web.blogspot.fr/2009/01/understanding-spin-funct
> ions.html
> 
> Say we want to execute this SPIN query:
>> SELECT ?child ?father
>> WHERE {
>> ?child a kennedys:Person .
>> BIND ( :getFather(?child) as ?father )
>> }
> 
> given the definition for the getFather function, we could efficiently rewrite
> the query into the equivalent forms:
>> SELECT ?child ?father
>> WHERE {
>> ?child a kennedys:Person .
>> OPTIONAL { 
>> SELECT ?child ?father WHERE {
>> ?child kennedys:parent ?father .
>> ?father kennedys:gender kennedys:male .
>> }
>> }
>> }

Yep that makes sense

>> 
> this would only require a SPARQL rewrite and would avoid local evaluation of
> the query (this could represent in a non negligible performance gain in IO and
> processing for complex functions).
> However the SPIN recommandation does not constrain the SELECT modifiers usable
> in a function's spin:body query. The definition could then declare a LIMIT
> clause making the SPARQL substitution process unusable since evaluation would
> require using a co-related sub-query which SPARQL cannot handle.

What do you mean by SPARQL substitution here?

dotNetRDF will avoid index joins where they may alter the results of the
query so I don't think that is specifically an issue.  I think doing any
kind of static substitution during rewrite is probably ill advised and you
should likely just rely on the underlying query engine to restrict
appropriately based on the join.  If people have put solution modifiers into
their functions then this may not give the correct results.

For the in-memory implementation you could register SPIN functions as SPARQL
extension functions and evaluate them with substitution if necessary.

> 
> Rewriting the function calls by turning the query "inside-out" might be
> possible but I've still not evaluated how multiple different functions calls
> would translate into and how the output query would scale on the server.
> 
> So would it be safe/wise to rule out those cases right now ? Imho, that would
> hinder the interest and usefulness of the library.

I wouldn't rule those out but I think there might be better ways to approach
this as you've suggested later in your email

> 
> 
> If not, this would require to handle evaluation of SPIN extensions
> (notwithstanding the source of the extension i.e. either SPIN definitions,
> SPINx or any 3rd party extension mechanism) locally through the dotNetRDF
> engine whenever a SPARQL substitution cannot be effective but that raises the
> problem of finding an acceptable strategy to alleviate IO and local
> computation as much as possible which is where I find lacking.
> 
> So far (supposing the getFather function could not subtitute well into SPARQL)
> we could either :
> 1) naive: precompute in a storage temporary graph for each potential binding
> of ?child and directly query a left join
>> => would require much IO to get the Multiset and send back the results as a
>> graph
>> => would most probably generate too much unused computation compared to
>> filtered-out results.
> 2) define a "NestedJoin" algorithm that could pull a Multiset of the variables
> used by the function pre-bound by the LHS Multiset (using a VALUES clause for
> instance ?) for local evaluation
>> => would not work against a remote storage if the function arguments involves
>> blank nodes (unless we can enforce some kind of skolemization or filtering ?)

This sounds like the index joins we already do in many cases, however as
already noted these can't be applied in certain situations

> 3) Split the query into a remote query to pull pre joined/pre filtered data
> (if possible) as a Multiset or Graph from the expanded query patterns and
> evaluate locally the remaining algebra and computations ?
>> => as I am not fluent in the algebra evaluation algorithms and API, I feel
>> this approach would be too much for me alone :P
>> => this also include being able to identify which SPARQL extensions are also
>> natively supported by the underlying storage (even if this is more related to
>> configuration and possible SPARQL Service Description support).
> 
> There may be some more ideas to dig but I reached my imaginations limits now,
> so feel free to complete if you see any other way to handle this ;)

In some sense it might simplify things to just do an in-memory
implementation for the time being and consider supporting arbitrary storage
layers as a later extension I.e. iterative development approach.

Rob

> 
> Thanks for your advice.
> 
> Cheers,
> Max.
> 
> 
> 2015-02-03 22:38 GMT+01:00 Rob Vesse <rv...@do...>:
>> Max
>> 
>> Thanks for the updates, comments are inline:
>> 
>> From:  Max - Micrologiciel <ma...@mi...>
>> Date:  Thursday, 29 January 2015 03:58
>> To:  Rob Vesse <rv...@do...>
>> Subject:  About the SPIN Processor
>> 
>>> Hi Rob,
>>> 
>>> First of all, let me wish you a happy and successful year for 2015.
>> 
>> Thanks and same to you too
>> 
>>> 
>>> I'm still working on the inclusion of the SPIN layer into dotNetRDF.
>>> Since last year's first draft, much of my work has been more experimental
>>> (so not really committable) than formal and most often bound to check
>>> whether and how the different issues I encountered could be handled.
>>> 
>>> So before going further (I've been delaying this too much already...) I
>>> wanted to get your advice on the issues I encountered.
>>> 
>>> Here is a summary of where I stand for now.
>>> 
>>> About SPIN, my first conclusions came to this:
>>> * since SPIN user-defined functions and properties rely mainly on SPARQL, it
>>> should be possible to handle those through SPARQL rewriting.
>> 
>> Yes I think that would be a reasonable approach, the current API may make
>> this harder than it needs to be.  Hopefully the 1.9 changes will make this
>> much easier in the longer term
>> 
>>> * since SPIN allows data-integrity features
>>> (constructors/rules/constraints...) this requires capturing each SPARQL
>>> Update command to perform the SPIN pipeline afterwards.
>>> * 
>>> * since those data-integrity features may signal for violations, the command
>>> results must be cancelled somehow. This implies that there must be some
>>> transactional support in the processor.
>> 
>> Yes, however the SPARQL specs already require that updates within a request
>> (of which their may be many) are applied atomically so any SPARQL processor
>> will already need to support transactions in some sense
>> 
>>> 
>>> Based on the current state-of-the-art, we are faced with the subsequent
>>> issues:
>>> * pipe-lining the SPIN integrity chain requires handling multiple SPARQL
>>> updates/queries in a single transactional context.
>>>> * HTTP being stateless, there is no way (yet ? see
>>>> http://people.apache.org/~sallen/sparql11-transaction/) to span
>>>> transactions over multiple requests
>> Yes this is an issue, some 3rd party stores define their own protocols for
>> transactions e.g. Stardog
>> 
>> If you have a 3rd party store that doesn't support any kind of transactions
>> then the solution may be simply to say that we can't support that.
>>>> * subsequently, supporting transactions locally requires to handle proper
>>>> isolation between clients but also possible transaction concurrency
>>>> problems. 
>> Yep, right now dotNetRDF's in-memory implementation uses MRSW (Multi Reader
>> or Single Writer) concurrency so we avoid concurrency issues by only allowing
>> a single write transaction to be in progress and blocking all reads while
>> transactions are in progress
>>>> * It also requires to simulate the transactional environment on the
>>>> underlying server to alleviate as much as possible the memory consumption
>>>> by dotNetRDF or the storage server.
>> Yes ideally the server should manage the transactions but of course if you
>> are trying to layer SPIN over a server that doesn't support SPIN then some
>> state necessarily has to be maintained by the client.
>> 
>> This perhaps begs the question of how general the SPIN implementation should
>> be and whether it should be limited to a subset of suitable stores.
>> 
>>> * SPIN to SPARQL rewriting also raises some problems due to :
>>>> * how sub-queries are processed according to the recommendation
>>>> * some difficulties to find an equivalent evaluation strategy for some
>>>> forms of propertyPaths.
>> Can you elaborate on what you mean by this?
>> 
>> Is the sub-query stuff related to the use of SPIN functions and templates
>> which potentially require substituting some constants into the sub-query
>> prior to execution?
>> 
>>> 
>>> Going a bit further, I tried experimenting a simple SWP layer on top of the
>>> stack with some success until I deiscover my prototypes was biased by a
>>> Sesame bug on optional subqueries handling. Anyway, I got directly
>>> confronted with how to handle of the natively provided SWP functions which
>>> can not be converted into SPARQL. The problem arises also at the basic SPIN
>>> level if you consider extensions like SPINx, so it may be best and simpler
>>> to handle the case here?
>> 
>> I would start by getting the core working and worry about how to add the
>> extra layers later.  Presumably some of the none SPARQL things could be
>> implemented by using the existing extension function API
>> 
>>> 
>>> Also, I see you are well going on the 1.9 rewriting and since it introduces
>>> many API changes that could also make the implementation easier.
>> 
>> Yes although much slower than I would have liked since I have very little
>> time to work on this these days.  The changes are going to be quite invasive
>> as you've probably noticed but this is necessary to address a lot of the
>> shortcomings in the current API and to make it easier to improve the query
>> engine going forward.
>> 
>> I keep hoping to be able to start putting out some limited alpha releases of
>> the new API at some point this year but then I said that in 2014 and never
>> got far enough to do that.  The new query engine still has some big pieces
>> missing (a query parser and results IO support for a start) before it could
>> be meaningfully used.  Maybe it'll be ready later this year if I can find the
>> time to get it into a sufficiently usable state.
>> 
>> Rob
>> 
>>> 
>>> 
>>> Since you have a much more global view of dotNetRDF and of the RDF/SPARQL
>>> ecosystem than me, your advice would be welcome. If you're available, I'd
>>> rather discuss this with you so we can decide how efforts and contributions
>>> may be best directed.
>>> 
>>> Please, tell me what you think about this.
>>> 
>>> Thanks for your consideration,
>>> Max.
>