Ticket #267 (closed enhancement: fixed)
Support evaluation of 3rd party operators
| Reported by: | thompsonbry | Owned by: | thompsonbry |
|---|---|---|---|
| Priority: | critical | Milestone: | |
| Component: | Query Engine | Version: | TERMS_REFACTOR_BRANCH |
| Keywords: | Cc: | gjdev, mrpersonick, mroycsi |
Description
The openrdf platform has an ExternalSet? operator which can be used by stacked sails to integrate external databases into a SPARQL query plan. This amounts to an arbitrary extension operator, which accepts a single openrdf BindingSet? object as its input and produces a CloseableIteration? from which its result may be drained. This ticket was raised to address integration issues for a 3rd party GIS/free text index which makes use of the ExternalSet?.
As of the QUADS_QUERY_BRANCH, bigdata has an extensible operator model. By default, evaluation of an operator plan using the QueryEngine? will pipeline chunks of IBindingSets through operators. A variety of operators already exist, including a PipelineJoin?, SubqueryOp?, ConditionalRoutingOp?, etc.
There are several "impedance mismatches" involved. First, bigdata uses a different API for managing binding sets and works internally with IVs rather than RDF Value objects. Second, bigdata operators accept sources from which they can draw multiple input solutions, and generate chunks of output solutions (evaluation is vectored). Third, evaluation is controlled by the QueryEngine?, which schedules operators for evaluation passes as chunks of intermediate solutions become available for that operator. The openrdf EvaluationStrategy? interface presumes that evaluation occurs during a visitation pattern traversal of the openrdf operator tree. Bigdata uses that visitation to translate the openrdf operator tree into a bop (bigdata operator model) and then submits the bop plan for query optimization, join ordering optimization, and finally evaluation.
I have outlined two possible paths forward here:
1. "Run" the ExtensionSet? first and then just push the results into the bigdata query evaluation. openrdf only allows a single source BindingSet? to be specified, but bigdata query evaluation actually accepts a stream of source solutions. If we expose a means to specify that source solution stream, then the output of the resource queried by the ExternalSet? could simply be fed into the bigdata query evaluation. Since the ExternalSet? produces openrdf BindingSets?, those would need to be efficiently translated into IVs. BigdataBindingSetResolverator? does the reverse (efficiently translating IVs into materialized RDF values), so we would need to write another class to resolve binding set stream against the lexicon, obtaining IVs.
2. Write a bigdata operator (BOp) which encapsulates the logic required to query the external resource and annotate the ExtensionSet? object with sufficient information to enable bigdata to translate it into the appropriate extension BOp.
3. Use a "magic predicate" similar to the bigdata "search" predicate and modify the BigdataEvaluationStrategyImpl?3 class to recognize and handle that predicate, generating an appropriate bigdata IPredicate with an IAccessPathExpander annotation. The expander would embody the logic to query the external resource.
Options (1) and (2) both imply that the query to the external resource would occur before the rest of the query was evaluated. Option (3) raises the possibility that the integration would be a full fledged bigdata operator, but comes with additional implementation requires since the bigdata query optimizers must be able to handle the operator. That could imply both being able to self-report the estimated cardinality of the operator and support for "cutoff" evaluation in support of the runtime query optimizer (https://sourceforge.net/apps/trac/bigdata/ticket/64).
Overall, option (1) would appear to be the simplest. I'd like to get feedback from both Mike and Gerjon before proceeding on that basis.