1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

Ticket #267 (closed enhancement: fixed)

Opened 2 years ago

Last modified 15 months ago

Support evaluation of 3rd party operators

Reported by: thompsonbry Owned by: thompsonbry
Priority: critical Milestone:
Component: Query Engine Version: TERMS_REFACTOR_BRANCH
Keywords: Cc: gjdev, mrpersonick, mroycsi

Description

The openrdf platform has an ExternalSet? operator which can be used by stacked sails to integrate external databases into a SPARQL query plan. This amounts to an arbitrary extension operator, which accepts a single openrdf BindingSet? object as its input and produces a CloseableIteration? from which its result may be drained. This ticket was raised to address integration issues for a 3rd party GIS/free text index which makes use of the ExternalSet?.

As of the QUADS_QUERY_BRANCH, bigdata has an extensible operator model. By default, evaluation of an operator plan using the QueryEngine? will pipeline chunks of IBindingSets through operators. A variety of operators already exist, including a PipelineJoin?, SubqueryOp?, ConditionalRoutingOp?, etc.

There are several "impedance mismatches" involved. First, bigdata uses a different API for managing binding sets and works internally with IVs rather than RDF Value objects. Second, bigdata operators accept sources from which they can draw multiple input solutions, and generate chunks of output solutions (evaluation is vectored). Third, evaluation is controlled by the QueryEngine?, which schedules operators for evaluation passes as chunks of intermediate solutions become available for that operator. The openrdf EvaluationStrategy? interface presumes that evaluation occurs during a visitation pattern traversal of the openrdf operator tree. Bigdata uses that visitation to translate the openrdf operator tree into a bop (bigdata operator model) and then submits the bop plan for query optimization, join ordering optimization, and finally evaluation.

I have outlined two possible paths forward here:

1. "Run" the ExtensionSet? first and then just push the results into the bigdata query evaluation. openrdf only allows a single source BindingSet? to be specified, but bigdata query evaluation actually accepts a stream of source solutions. If we expose a means to specify that source solution stream, then the output of the resource queried by the ExternalSet? could simply be fed into the bigdata query evaluation. Since the ExternalSet? produces openrdf BindingSets?, those would need to be efficiently translated into IVs. BigdataBindingSetResolverator? does the reverse (efficiently translating IVs into materialized RDF values), so we would need to write another class to resolve binding set stream against the lexicon, obtaining IVs.

2. Write a bigdata operator (BOp) which encapsulates the logic required to query the external resource and annotate the ExtensionSet? object with sufficient information to enable bigdata to translate it into the appropriate extension BOp.

3. Use a "magic predicate" similar to the bigdata "search" predicate and modify the BigdataEvaluationStrategyImpl?3 class to recognize and handle that predicate, generating an appropriate bigdata IPredicate with an IAccessPathExpander annotation. The expander would embody the logic to query the external resource.

Options (1) and (2) both imply that the query to the external resource would occur before the rest of the query was evaluated. Option (3) raises the possibility that the integration would be a full fledged bigdata operator, but comes with additional implementation requires since the bigdata query optimizers must be able to handle the operator. That could imply both being able to self-report the estimated cardinality of the operator and support for "cutoff" evaluation in support of the runtime query optimizer (https://sourceforge.net/apps/trac/bigdata/ticket/64).

Overall, option (1) would appear to be the simplest. I'd like to get feedback from both Mike and Gerjon before proceeding on that basis.

Change History

Changed 2 years ago by gjdev

Ok I'm justing starting to realize the impact of Bigdata's query evaluation vs. Sesame's top-down approach... I can't really comment on the Bigadata internals, so I'll focus on the ExternalSet? implementation and how we use it in our geo spatial solution. Note that any of the three implementation solutions will help us. Our sail will use a standard topdown sesame-based QueryEvaluationStrategy? for non-conforming queries, until an AST (TupleExpr?) subtree is found that does conform. Those subtrees will be send to the BigdataSailConnection?. This may make the entire query run very slow though.

Regarding the issues raised:
* All values (solutions) returned by the ExternalSet? are values that do actually exist in the database (we just build an additional index, but [at least currently] a geometry literal will still be stored by bigdata itself as well). This means that conversion to IV should be straightforward. But note that this is specific for our geospatial extension. Other third-party implementations of ExternalSet? may (and in fact the name suggests that this is its intended use) return solutions that are not part of the store.
* Running the evaluation first, streaming the results into the rest of the query will be good enough for almost all of our queries. We do have some queries that have the ExternalSet? on the right hand side of a left join, where the solutions of the lhs need to be streamed into the ExternalSet?. We do want to be able to allow something like that even for normal joins (i.e. they need bindings as input that are resolved by another part of the query). Not having support for that in Bigdata means some rarely used queries will be very slow. In summary: running the ExternalSet? first, streaming the results in the rest of the query will be good enough for 95% of our queries. I'm more than happy with such a solution if the alternatives are more complex or take more time to implement.
* Regarding cardinality and cutoff: ExternalSet? has an interface method for cardinality. We currently always set that to a low value so that the ExternalSet? is pushed to the top of a top-down evaluation in sesame as much as possible. According to one of my colleagues it is possible to get a postgres query plan for our ExternalSet? query via JDBC. That will make it possible to provide cardinality, estimated query time, and a bunch of other statistics for the query. Cutoff seems easy enough to implement, just stop asking for solutions? Comparing ExternalSet? cardinality with Bigdata cardinality imho doesn't make sense though, unless the differences in overhead and latency costs are taken into account as well.

I don't really have enough knowledge of BigData? to understand the difference between a BOp and IAccessPathExpander, but don't you mean Option(2) where you say Option(3) raises ...? Otherwise I'm confused.

My understanding of all this is that option(2) is the complete solution, and would provide a lot of room for performance optimizations in later stadia. Option (1) is a workable solution, that would give close to optimal performance for 95% of our queries. Option (3) I don't understand enough to comment on.

Changed 2 years ago by thompsonbry

  • status changed from new to accepted

Yes, you are correct in that I have confused the labeling of options (2) and (3) in the summary.

Concerning cutoff evaluation, the purpose is to determine an estimation of the cardinality of alternative join paths when deciding the best join ordering for the query. We start with the set of access paths (called predicates). The set of predicates is a "join graph". The set of possible joins are defined by shared variables among the predicates. We sample each access path and then sample some joins based on those access path samples. This gives us an estimate of the join hit ratio for each of the initial joins. Based on that we chose the initial set of joins for the join paths. The algorithm then iteratively explores extensions of those initial join paths until it has identified the join path (aka join ordering) with the lowest estimated cost based on the query and the data.

The role of cutoff joins is to halt the production of the various joins eagerly so that a sample of their join hit ratio may be developed without causing the full join product to be materialized.

I think that it may just not make sense to attempt a native bigdata operator implementation for an external resource. The required integration level to support query optimization is likely too high, especially when crossing a JDBC connection to another database.

Likewise, I expect that performance will degrade remarkably whenever you are forced to order the external set operator into any position other than the first in the query plan. However, it is possible to request that bigdata fully materialize the intermediate solutions feeding an operator. So, if you have your external spatial join in the middle of a query plan, you could materialize the results up to that point, feed those results into your external spatial join, and then feed the results back into bigdata's query engine. But the easier way to accomplish the same effect is to partition your query such that you run a bigdata native query for the "left" side feeding the spatial join. You can then materialize those results yourself and feed them into the spatial join. Finally, you can feed the results of the spatial join into the "right" hand side of the query. The left and right sides of the query would be separate native bigdata query plans which you then stitch together with your external spatial join. That would give you the means to batch the intermediate solutions into your spatial join.

I'll proceed along the lines of option (1), which is to expose a means to feed a stream of binding sets into a bigdata query. You can use that both for "run first" query plans and for query plans where you partition the plan into a left query plan, a "fully materialized external set query", and a right query plan.

Changed 2 years ago by gjdev

Agreed.

Changed 2 years ago by thompsonbry

Raised priority. I would like to get this feature into the quads query branch release.

Changed 2 years ago by thompsonbry

  • priority changed from major to critical

Changed 2 years ago by thompsonbry

  • owner changed from thompsonbry to mrpersonick
  • status changed from accepted to assigned

Mike, Please take a look at this once you wrap up the native SPARQL operators. Thanks, Bryan

Changed 2 years ago by thompsonbry

Mike wrote:

For a single incoming BingingSet?, when you prepare a query from the BigdataSailRepositoryConnection?, you can then make calls to setBinding on that query to seed it with variable bindings. You leave off the "?".

final BigdataSailRepository repo = getRepository(); final BigdataSailRepositoryConnection cxn = repo.getReadOnlyConnection(); final TupleQuery tupleQuery =
        cxn.prepareTupleQuery(QueryLanguage.SPARQL, query); tupleQuery.setBinding("x", bdVal1); tupleQuery.setBinding("y", bdVal2); TupleQueryResult result = tupleQuery.evaluate();

Eventually, those bindings are turned into a Sesame BindingSet?, which flows through to the BigdataSailConnection?.evaluateTupleQuery method:

public CloseableIteration<? extends BindingSet, QueryEvaluationException> evaluate(
        TupleExpr tupleExpr, Dataset dataset,
        final BindingSet bindings, final boolean includeInferred)

And that in turn flows through to the
BigdataEvaluationStrategyImpl?3.evaluate method:

CloseableIteration<BindingSet, QueryEvaluationException> evaluate(  final TupleExpr tupleExpr, final BindingSet bs,  final Properties queryHints) throws QueryEvaluationException;

Which eventually finds its way to the doEvaluateNatively() methods, where it is currently unused (bug). We seed the queries with an empty binding set.

I think for the Gerjon thing, you should give him a method on the BigdataSailQuery? interface like this:

void setBindings(final CloseableIteration<BindingSet,
QueryEvaluationException> bindings);

That would allow him to pipe his results from his external GIS query right into a bigdata query. You'd then have to implement that method on BigdataSailTupleQuery?, BigdataSailGraphQuery?, and BigdataSailBooleanQuery? so that they pass the incoming iteration of BindingSet? objects into the BigdataSailConnection?.evaluate method (you'd need a new evaluate method that can accept that incoming stream).

You'd wrap the iteration of Sesame binding sets with a striterator that could convert them into bigdata binding sets by stripping the IVs off the BigdataValues?.

Then you'd just have to find a way to pipe that iteration into a pipeline operator.

So he'd use it like this:

final BigdataSailRepository repo = getRepository(); final BigdataSailRepositoryConnection cxn = repo.getReadOnlyConnection(); final TupleQuery externalQuery =
        cxn.prepareTupleQuery(QueryLanguage.SPARQL, externalQueryString); final TupleQueryResult externalResults = externalQuery.evaluate(); final TupleQuery bigdataQuery =  cxn.prepareTupleQuery(QueryLanguage.SPARQL, bigdataQueryString); bigdataQuery.setBindings(externalResults);
final TupleQueryResult results = bigdataQuery.evaluate();

Changed 2 years ago by thompsonbry

  • owner changed from mrpersonick to thompsonbry

Changed 2 years ago by thompsonbry

  • status changed from assigned to accepted

Changed 2 years ago by gjdev

Mike, there is no need to do anything at the Repository/BigdataSailQuery level. If I can somehow pass a stream/iteration of Sesame binding sets to the BigdataSailConnection?.evaluate method that will do the trick. All the magic of our IndexingSail? is done at the Sail level.

Changed 2 years ago by thompsonbry

Gerjon,

Yes, I meant the TupleQuery? (or really the AbstractQuery?). We were thinking of adding this feature to the BigdataXXXQuery classes to parallel the existing setBinding(name,value) method on AbstractQuery?. However, the BigdataXXXQuery#evaluate() method does this:

    public TupleQueryResult evaluate() throws QueryEvaluationException {
        
    	final TupleExpr tupleExpr = getParsedQuery().getTupleExpr();

        try {
        
			CloseableIteration<? extends BindingSet, QueryEvaluationException> bindingsIter;

			final BigdataSailConnection sailCon = (BigdataSailConnection) getConnection()
					.getSailConnection();

            bindingsIter = sailCon.evaluate(tupleExpr, getActiveDataset(),
                    getBindings(), getIncludeInferred(), queryHints);

			bindingsIter = enforceMaxQueryTime(bindingsIter);

			return new TupleQueryResultImpl(new ArrayList<String>(tupleExpr
					.getBindingNames()), bindingsIter);

		} catch (SailException e) {

			throw new QueryEvaluationException(e);

		}

	}

You can see that it is a pretty thin wrapping around BigdataSailConnection#evaluate?(...). We will be overriding that method to support this feature.

The core implementation of BigdataSailConnection#evaluate?() currently has this signature:

        /**
         * 
         * @param bindings
         *            Bindings which will be imposed on the initial solutions
         *            pushed into the query pipeline.
         *            
         * @param includeInferred
         *            The <i>includeInferred</i> argument is applied in two
         *            ways. First, inferences are stripped out of the
         *            {@link AccessPath}. Second, query time expansion of
         *            <code>foo rdf:type rdfs:Resource</code>, owl:sameAs, etc.
         *            <p>
         *            Note: Query time expansion can be disabled independently
         *            using {@link Options#QUERY_TIME_EXPANDER}, but not on a
         *            per-query basis.
         * 
         * @param queryHints
         *            A set of properties that are parsed from a SPARQL query.
         *            See {@link QueryHints#PREFIX} for more information.
         */
        public synchronized CloseableIteration<? extends BindingSet, QueryEvaluationException> evaluate(
                TupleExpr tupleExpr,//
                Dataset dataset,//
                BindingSet bindings,//
                final boolean includeInferred,//
                final Properties queryHints//
                ) throws SailException {

I am of two minds concerning how to override this method. Either we can simply pass in an

final CloseableIteration<BindingSet, QueryEvaluationException> bindings

object in place of the

BindingSet bindings

or we can fuse both pieces of information (by imposing the giving BindingSet? on all BindingSets? materialized from the CloseableIteration?). It seems like you would prefer for us to simply pass in the CloseableIteration? instead of the BindingSet?.

Either way the BindingSet? and/or CloseableIteration? would then be passed through to the EvaluationStrategy? (BigdataEvaluationStrategyImpl?3).

Looking over BigdataEvaluationStrategyImpl?3 and the EvaluationStrategy? interface, my inclination is to NOT modify the various evaluate() methods there but to instead set the CloseableIteration? as a field on the EvaluationStrategy? instance and find a hook the start of the query evaluation for the top-level of the query in evaluateNatively(...). My thinking here is that the CloseableIteration? is only used for the initial BindingSets? pushed into the query pipeline (see doEvaluationNatively() ~ line 1040). This code is inline below.

        // Wrap the input binding sets (or an empty binding set if there is no
        // input).
        final IAsynchronousIterator<IBindingSet[]> source = newBindingSetIterator(bs != null ? toBindingSet(bs)
                : new ListBindingSet());

	    IRunningQuery runningQuery = null;
    	try {

            // Submit query for evaluation.
            runningQuery = queryEngine.eval(queryId, query, source);

			/*
			 * Wrap up the native bigdata query solution iterator as Sesame
			 * compatible iteration with materialized RDF Values.
			 */
			return iterator(runningQuery, database, required);

You can see that we are already pushing in a stream of initial binding sets. If a BindingSet? was given, we wrap that as an IAsynchronousIterator visiting a single IBindingSet. If none was given, the IAsynchronousIterator will visit a single empty IBindingSet. If a CloseableIteration? was given, then we just need to wrap that as an IAsynchronousIterator and all should be good.

I am hesitant to change the basic visitation pattern of the EvaluationStrategy?, which passes the current BindingSet? into each method in turn since that might have unforeseen consequences. For that reason, I'd rather pass both the BindingSet? and the CloseableIteration? into BigdataSailConnection#evaluate?() and then set the CloseableIteration? as a field on the BigdataEvaluationStrategyImpl?3 and consult that field from within doEvaluateNatively().

Given the existing openrdf API and the means to specify an initial BindingSet? using AbstractQuery#setBinding?(name,value), it seems that we must either handle or disallow the condition where setBinding(...) is used to provide some initial bound values AND the CloseableIteration? is non-null. By far the easiest thing is to simply throw out an exception if both values are non-null in BigdataSailConnection#evaluate?(....) and BigdataEvaluationStrategyImpl?3#doEvaluateNatively(...).

Please let me know what you think.

Thanks,
Bryan

Changed 2 years ago by thompsonbry

One other wrinkle to consider. Evaluation occurs after we have resolved any RDF Values in the Query or the initial BindingSet? to bigdata's internal values (IVs). This is done within BigdataSailConnection#evaluate?(...):

            	final Object[] newVals = replaceValues(dataset, tupleExpr, bindings);
                dataset = (Dataset) newVals[0];
                bindings = (BindingSet) newVals[1];

We will have to take a similar step for each solution drawn from the CloseableIteration?. In order to be efficient when translating BindingSets? (openrdf) to IBindingSets (bigdata), this process must be batched in a manner similar to the BigdataSolutionResolverator?. That resolution step can be achieved using a chunked resolver pattern established within BigdataEvaluationStrategyImpl?3#doEvaluateNatively(...).

Bryan

Changed 2 years ago by thompsonbry

Given that we will have efficient (chunked/vectored) bi-directional resolution of openrdf BindingSets? and bigdata IBindingSets, it would be possible to setup an ExtensionBOp (bigdata operator) template which might be the target of the openrdf ExternalSet? operator. However, that is a more complex path and I will start with the proposed solution describe above.

Changed 2 years ago by thompsonbry

Mike, I've written the bulk translation class. Could you look at the loop below and let me know whether this is doing the right thing by using a DummyIV when the RDF Value is not known to the database? (I believe that the Values will always be know for Gerjon's use case, but this might not be true in general).

Thanks,
Bryan

        for(Binding binding : bindingSet) {
            
            final String name = binding.getName();
            
            final Value value = binding.getValue();
            
            final BigdataValue outVal = map.get(value);

            if (outVal != null) {

                final Constant<?> c;
                
                if (outVal.getIV() == null) {

                    c = new Constant(DummyIV.INSTANCE);
                    
                } else {
                    
                    c = new Constant(outVal.getIV());
                    
                }
                
                out.set(Var.var(name), c);

            }
            
        }

Changed 2 years ago by thompsonbry

The feature is implemented. See TestSetBindingSets? for an example of how to make use of this feature.

I've made the Property for the QueryHints? null if there are no query hints and modified the various classes which look things up in that Properties object.

I've backed out the changes which expose this facility in the BigdataSailQuery? interface. For now, the mechanism is available to people who directly invoke BigdataSailConnection#evaluate?(...). I prefer this approach with its minimum propagation of the API change until we get some feedback on the utility of this integration.

Committed revision r4790.

Changed 2 years ago by thompsonbry

Backed out the change which allowed queryHints to be null. We need to dynamically set the query hints some times (for example, the queryId from the NanoSparqlServer?). We can't do that if there is a null reference rather than an empty Properties object.

Committed revision r4791.

Changed 2 years ago by thompsonbry

  • status changed from accepted to closed
  • resolution set to fixed

Changed 22 months ago by thompsonbry

  • status changed from closed to reopened
  • resolution fixed deleted

Changed 22 months ago by thompsonbry

  • cc mroycsi added

We are looking at how to change the mechanism for 3rd party operator integration as part of the refactoring surrounding the SPARQL 1.1 support. Based on a discussion with mroycsi, we are exploring handling this as an extension of the SPARQL "federation" support, using the SERVICE keyword. Matt is going to take a stab at a service registry for service URIs beginning with a "java://" protocol. These services will be "native" bigdata extensions. They will be provided with a copy of the AST for the graph group pattern which they are to evaluate and an IBindingSet containing input bindings. The service will implement a closeable iterator pattern, which will provide for asynchronous interrupt. The solutions will be drained by bigdata as part of normal query evaluation.

We will support such "java" services both within the named subquery extension (part of the query prolog), in the standard graph pattern not triples section, and in subselects. The advantage of named subqueries is that they are run once before the start of the main query. SubSelect? queries may either be pipelined (run for each solution which flows through the query engine) or run-once, in which case they are internally translated into a named query with an include of the named solution set. People who desire an integration with a service which can deliver IBindingSets containing IVs to bigdata should implement this native service interface and register their service on a ServiceRegistry? within bigdata.

The more general case of services which communicate at the level of SPARQL binding sets supports services running in another process or on another machine. The SPARQL 1.1 WD does not provide a mechanism specially for communicating bindings to a SERVICE, but we could rewrite the group graph pattern to include BIND(x AS y) clauses. Doing so would make it possible for a remote service to be integrated into pipelined evaluation. However, as a first step, we would only support "run-once" evaluation for such truly remote / cross process services. That is, they would be turned into named subquery / include patterns.

Named subqueries are an Anzo DCS extension. An example is provided below:

     WITH {
       SELECT ?x WHERE {?x rdf:type foaf:Person}
     } AS %namedSet1
     SELECT ?x ?o
      WHERE {
        ?x rdfs:label ?o .
        INCLUDE %namedSet1
     }

The WITH clause introduces the named subquery, which is associated with the name "%namedSet1" in this example. The INCLUDE clause references the named solution set and causes it to be joined into the main query at that location within the graph pattern.

The development branch already supports this named subquery pattern in the AST model, as well as standard SPARQL 1.1 subquery. These features are not yet available at the SPARQL layer. We are currently working to port various query optimizations from the old code based into the new AST model, at which point we will cut over from the existing Sesame TupleExpr? integration into the new AST model.

Changed 22 months ago by thompsonbry

  • version changed from QUADS_QUERY_BRANCH to TERMS_REFACTOR_BRANCH

mroycsi has committed an initial version of the Java service invocation API.

I got something very basic working last nite based on a simple serviceCall interface:

public interface ServiceCall {
    /**
     * Evaluate the service call given the running query
     * @param runningQuery
     * @return Iterator over a set of solution binding sets
     */
    public IAsynchronousIterator<IBindingSet[]> call(IRunningQuery runningQuery); }

I also added a simple ServiceOp based on the NamedSubqueryOp with inserts the results of that call into a hash like the named subquery.

The ast2bop finds all the serviceNodes and creates the ServiceOps for them like the NamedSubqueries, and when a serviceNode is found within the system, it references the hashtable with the same mechanism as the namedsubquery.

Committed Revision: r5160.

Changed 22 months ago by thompsonbry

  • status changed from reopened to accepted

Changed 22 months ago by thompsonbry

Fiddling with the SPARQL SERVICE integration to better understand it and define how it will work. I need to modify the sparql.jjt file to support federation before I can really get into this. I will probably do that by updating to openrdf 2.5-dev.

Changed 22 months ago by thompsonbry

Wrote an AST optimizer to recognize the presence of the search magic predicates in a named subquery (and to reject them if they are found in the main WHERE clause). The magic predicates are extracted, validated, and replaced by SERVICE AST node for the search. The magic predicates are provided to that search service as its graph pattern. The code for actually doing the search has been migrated to this API, but not yet tested. The code probably does not yet handle the case where the subject is bound (i.e., the search variable is actually a constant), though I have marked this issue in a few places.

Various cleanup in AST node toString(indent) methods.

Added isConstant() isVariable() and isFunction() to ValueExpressionNode?.

Updated to revision r5167.

Changed 22 months ago by mrpersonick

Rather than create a new ticket, I will add to this ticket.

In the release 1.0 branch, I changed the BigdataSailConnection? and BigdataEvaluationStrategyImpl?3 to handle an incoming stream of bigdata IBindingSets in addition to its current handling of a stream of Sesame BindingSets?. I did this by changing the method signatures to accept an Object rather than a CloseableIteration?, and then verifying the type later.

This feature needs to be captured in the terms branch.

Changed 22 months ago by thompsonbry

Implemented and integrated a "extension" service suitable for integrating data from "external" sources.

Bug fix to DataSetSummary?. It was checking for unknown graphs with iv != null but must use iv.isNull(). Before this change it would recognize a URI with a 0L TermId? as "known".

Bug fix to an AST builder test for GROUP BY which had the wrong expected data for the assignment node.

Bug fix to the handling of unions when generating the AST from the parse tree. A UnionNode? is now always embedded in a JoinGroupNode?. This can cause nesting such as JoinGroupNode?( JoinGroupNode?( UnionNode?(...))). The unit tests of the parse tree to AST transform have been updated to expect this. Such semantically inoperative join groups are now eliminated by a new AST optimizer.

This change should also fix the exception thrown for Bigdata2ASTSPARQLSyntaxTest."syntax-union-02.rq"

com.bigdata.rdf.sparql.ast.UnionNode? cannot be cast to com.bigdata.rdf.sparql.ast.JoinGroupNode?

at com.bigdata.rdf.sparql.ast.UnionNode?.addChild(UnionNode?.java:11)
at com.bigdata.rdf.sail.sparql.GroupGraphPattern?.buildGroup(GroupGraphPattern?.java:180)
at com.bigdata.rdf.sail.sparql.GroupGraphPatternBuilder?.visit(GroupGraphPatternBuilder?.java:390)
at com.bigdata.rdf.sail.sparql.GroupGraphPatternBuilder?.visit(GroupGraphPatternBuilder?.java:77)

Added an ASTOptimizer to eliminate join groups which are the parent of a single join group child. This required the application of the leftOrEmpty(left) pattern in Rule2BOpUtility.

StartOp? is unnecessary and is no longer generated. I had to apply the leftOrEmpty(left) pattern in a number of places in both AST2BOpUtility and Rule2BOpUtility.

Added required shallow and deep constructors for all AST classes.

Replaced ServiceOp? with ServiceCallJoinOp?. ServiceOp? was an "at-once" operator. We will use a pipelined join for a service call so we can pipeline the named subquery in which it appears with the total result set materialized onto a named solution set.

Removed the 'lex' parameter to the ServiceFactory?. Use AbstractTripleStore?.getLexiconRelation().getNamespace() if you need this, or more likely just AbstractTripleStore?.getValueFactory() or #getLexiconRelation().

Added data driven unit tests for bd:search.

At this point, full text search queries are not backward compatible as the bd:search magic predicate must appear in a named subquery. However, I think that we can deal with this by first rewriting the search magic predicates into a SERVICE call and then lifting the SERVICE call out into a named subquery. That should provide transparent backward compatibility.

If the "external" service is not IV aware, then it should use the BigdataOpenRDFBindingSetsResolverator to efficiently resolve Sesame BindingSets? to bigdata IBindingSets.

Committed revision r5174.

Changed 22 months ago by thompsonbry

Rewrote the ASTEmptyGroupOptimizer and added some unit tests. It was identifying the situations in which it should perform a rewrite but it was failing to effect the rewrite.

Modified the ASTSearchOptimizer to recognize magic search predicates within the main WHERE clause as well as within named subqueries.

Added an AST optimizer to lift out ServiceNodes? which appear in contexts in which they would be evaluated more than once. Such ServiceNodes? are now lifted into a named subquery and replaced by the include of the named subquery solution set.

JoinGroupNode?.toString(int) was not displaying the graph context. This was fixed by testing for that class in GroupNodeBase? and explicitly incorporating that information into the rendered AST string.

Cleaned up toString(int) for NamedSubqueriesNode? and NamedSubqueryRoot?.

Committed revision r5175.

Changed 16 months ago by thompsonbry

I have brought the SPARQL parser up to openrdf 2.6.3. This includes the SERVICE and BINDINGS clause productions. Those productions are not yet hooked into the bigdata AST. That is the next step. We also want to provide bi-directional SERVICE invocation using materialized RDF/XML values against a within JVM service (rather than bigdata IVs).

Changed 16 months ago by thompsonbry

I have added the grammar productions to handle the BINDINGS clause, added support for the BINDINGS clause to the AST model, and written unit tests for the BINDINGS clause handling at the parser to AST translation layer. The BINDINGS clause is NOT yet being examined by the query engine. There is TCK coverage for this so we do not need to write more tests at the AST evaluation level, but we do need to modify the query plan generator. I have added a utility class to compute summary statistics for the BINDINGS clause and a test suite for that utility class. However, this too is not yet in use.

I am working through the invocation of a Sesame aware in JVM SERVICE. We will present it with an openrdf BindingSet? whose bindings are BigdataValue? objects with their IVs set. Obviously, any variables projected into the SERVICE need to be materialized first so we can present the SERVICE with the materialized BigdataValue? objects - this suggests that we need to compute a materialization requirement for the SERVICE call. Coming out of the service, we need to resolve the openrdf bindings to bigdata IBindingSets. My initial take on this is to use the IV on the BigdataValue? IF the binding is a BigdataValue? and otherwise to generate a mock IV and then set its value cache to the RDF Value (using BigdataValueFactory#asValue?() to ensure that it is turned into a BigdataValue? for the store).

SERVICE expressions involving a URI constant are accepted and parsed into the AST. There is now a basic test suite to verify the invocation of within JVM services, including both services which are bigdata aware and services which just talk openrdf.

We still need to handle cases where the service reference is an expression which evaluates to a constant and the case where it is a variable not known to be bound. However, those cases are not strictly necessary for the purpose of integrating an "internal" service.

What SHOULD work at this point is:

- Bigdata "internal" SERVICEs.
- Openrdf "internal" SERVICEs.

What DOES NOT work at this point is:

- Remote SERVICE invocations (we are not integrated yet with openrdf's federation support).
- A serviceRef which is a variable.
- The BINDINGS clause (not integrated in the query plan generator).

Committed revision r6053.

Changed 16 months ago by thompsonbry

I have modified the ServiceCallJoin? to resolve openrdf BindingSets? against the lexicon when the SERVICE is not a bigdata native service implementation. This is done using the same logic that handles batch resolution of the binding sets for normal query evaluation.

Committed Revision r6055.

Changed 16 months ago by thompsonbry

  • status changed from accepted to closed
  • resolution set to fixed

I have vectored the internal service evaluation code for both bigdata and openrdf "services". The ServiceCallJoin? now uses a JVM hash join internally to correlate the source solutions with the solutions returned from the service.

If the serviceRef is a variable, then the source solutions are grouped by the effective service URI and then vectored to each target service. The code parallelizes the service calls across different target services.

I have added a TIMEOUT annotation for the ServiceCallJoin?. The default timeout is MAX_LONG milliseconds. There is no means yet to specify this timeout from SPARQL. A simple syntax for this can be easily imagined [SERVICE SILENT TIMEOUT 1000 ?uri {}]. However, there are LOT of things you might want to control (HTTP POST versus GET, SPARQL results format type, etc., etc.). Query hints would seem to be the more general purpose way to handle that stuff. In order to keep the query hints out of the SPARQL query that we send to a remote SPARQL end point, the hints would have to go OUTSIDE of the SERVICE and bind to the PRIOR "join".

I have added a factory for remote services and an interface for remote service calls, and a default implementation which will evaluate a SERVICE clause against a remote SPARQL end point. The generated query is formed from the text "image" of the original SERVICE's graph pattern. This remote SPARQL end point functionality is not yet tested - I want to add support for BINDINGS in the query plan generator first so we can test this against bigdata. One known problem is that the prefix declarations from the original query are not being attached to the ServiceNode? and therefore are not present in the generated query.

As the functionality is now present for 3rd party service integrations which are not "bigdata aware", I am going to close this issue and continue work under [2] (remote SERVICE evaluation) and [3] (BINDINGS clause support).

[1] https://sourceforge.net/apps/trac/bigdata/ticket/267 (Support evaluation of 3rd party operators)
[2] https://sourceforge.net/apps/trac/bigdata/ticket/449 (SPARQL 1.1 Federation)
[3] https://sourceforge.net/apps/trac/bigdata/ticket/501 (SPARQL 1.1 BINDINGS clause is ignored)

Committed Revision r6056.

Changed 15 months ago by thompsonbry

Modified the RemoteRepository? to use an pattern for the connect options which can be overridden.

Integrated access to the ClientConnectionManager? into the QueryEngine?. The ClientConnectionManager? is lazily obtained from a factory and is shutdown() when the QueryEngine? is shutdown.

Refactored the RemoteServiceCallImpl? to use the RemoteRepository?, which is now backed by the http components package. This required a change in the ServiceRegistry? API.

The RemoteRepository? now uses more sophisticated defaults for the Accept header. There is a new class to support this (AcceptHeaderFactory?).

Committed revision r6195.

Changed 15 months ago by thompsonbry

Changed how we handle the configuration of the ClientConnectionManager? and removed all of the RemoteRepository? constructors except the one in which the caller specifies both the ClientConnectionManager? and the Executor. These resources MUST be explicitly managed. Bigdata does that internally by hanging them off of the QueryEngine? and the IIndexManager respectively. Remote applications need to provide their own management for those resources.

Changed the ServiceFactory#create?() method, replacing the parameters with an interface to make furture versioning of this information painless. Added access to the ClientConnectionManager? and the IServiceOptions so these are now both available to the ServiceCall? (assuming that the ServiceFactory? implementation passes them through).

Committed revision r6208.

Note: See TracTickets for help on using tickets.