Thread: [Exist-development] optimizing positional predicates for fast paging

eXist-db is a feature rich Open Source native XML database

Brought to you by: deliriumsky, dizzzz, windauer, wolfgang_m

exist-development

[Exist-development] optimizing positional predicates for fast paging

From: Michael S. <so...@if...> - 2009-12-13 20:33:57

I wonder whether eXist is currently optimizing xquery evaluations that use
positional predicates, or similar.  It doesn't seem like it, based on some
poking around in the code, and I think this could be a very fruitful area.
What I mean is an expression like:

collection("/")/ (//elem) [position() = (1 to 10)]

- when there are a lot of //elem, evaluation can be short-cut by ignoring
all but the first 10: ie the first "page" of results.  Currently it
*appears* to me as if the entire (//elem) node sequence is being
constructed, and then filtered.

In web applications, a great many queries are limited in exactly this way:
[position() = ($n to $n + $page-size)], so it seems like an area that merits
attention.  And the query-writer doesn't have enough expressive power to
push the positional predicate into the innermost expression, especially if
they are passing paging parameters via a wrapping API layer like REST.  

There is also a problem when coding FLWOR expressions (which you have to do
in order to sort the results), where the only option for paging is to wrap
the entire FLWOR in a subsequence function call (or postfix a predicate like
above).  Some kind of internal expression rewriting that pushes the paging
predicate into the FLWOR evaluation could yield a huge performance
improvement in such cases.  Even when results are sorted, one might see a
performance gain by using a value index scan.

I started to look into implementing this myself, but I wanted to see whether
any work had been done already, or if anyone had any other thoughts about
whether this is worthwhile?

-Mike

Re: [Exist-development] optimizing positional predicates for fast paging

From: Wolfgang M. <wol...@ex...> - 2009-12-14 08:37:12

> I wonder whether eXist is currently optimizing xquery evaluations that use
> positional predicates, or similar.

I did implement shortcuts for the expensive axes: following and
preceding. For example, LocationStep.getFollowing checks the property
hasPositionalPredicate. If set to true, the method evaluates the
predicate in advance to get its numeric value. This is then forwarded
to the NodeSet method which calculates the join:

return currentSet.selectFollowing(contextSet, position, contextId);

We could apply similar optimizations to other axes, though their
execution paths are more complex as you have more possibilities to
choose from.

However, my positional predicate optimization would only work for
//elem[1], not //elem[position() = 1]. But I guess it would be
possible to add a rewrite rule here to automatically simplify the
expression.

> collection("/")/ (//elem) [position() = (1 to 10)]

Note that using subsequence instead of the predicate can be a lot
faster sometimes.

Wolfgang

Re: [Exist-development] optimizing positional predicates for fast paging

From: Michael S. <so...@if...> - 2009-12-15 13:01:08

Thanks for your answer, Wolfgang - that's just the sort of thing I was
wondering about.

I walked through the logic in selectFollowing: I can see it would be
possible, but complicated to implement something similar for other axes and
for a somewhat more complex predicate (compare position against a constant
sequence).  Maybe that's worth doing - I'm not sure though.

Frankly the main cases I am interested in are not really axis selections,
but ordered sequences of nodes from different documents where the ordering
is determined either by a value (that may be indexed) or by the lucene
score.  These correspond to searches in our applications where the results
are ordered alphabetically by title, or by relevance.  So I asked the other
question mostly just to try and understand better what kind of optimization
strategies you are using.

I'll keep looking and let you know if I have anything helpful to offer!

I'm also curious about the answer to Joe's question re: the difference
between subsequence and predicate: if you have a moment could you elucidate?

-Mike

Also: my e-mails to this list seem to need moderation.  If you'd rather not
be bothered about that, I could move this discussion over to the exist-open
list?

> -----Original Message-----
> From: Wolfgang Meier [mailto:wol...@ex...] 
> Sent: Monday, December 14, 2009 3:37 AM
> To: Michael Sokolov
> Cc: exi...@li...
> Subject: Re: [Exist-development] optimizing positional 
> predicates for fast paging
> 
> > I wonder whether eXist is currently optimizing xquery 
> evaluations that 
> > use positional predicates, or similar.
> 
> I did implement shortcuts for the expensive axes: following 
> and preceding. For example, LocationStep.getFollowing checks 
> the property hasPositionalPredicate. If set to true, the 
> method evaluates the predicate in advance to get its numeric 
> value. This is then forwarded to the NodeSet method which 
> calculates the join:
> 
> return currentSet.selectFollowing(contextSet, position, contextId);
> 
> We could apply similar optimizations to other axes, though 
> their execution paths are more complex as you have more 
> possibilities to choose from.
> 
> However, my positional predicate optimization would only work 
> for //elem[1], not //elem[position() = 1]. But I guess it 
> would be possible to add a rewrite rule here to automatically 
> simplify the expression.
> 
> > collection("/")/ (//elem) [position() = (1 to 10)]
> 
> Note that using subsequence instead of the predicate can be a 
> lot faster sometimes.
> 
> Wolfgang
>

Re: [Exist-development] optimizing positional predicates for fast paging

From: Joe W. <jo...@gm...> - 2009-12-14 12:46:10

Mike and Wolfgang,

A great topic!  I'd certainly welcome optimizations like this.  I
wonder:  If a FLWOR expression has an 'order by' clause, is it even
possible to optimize the expression's positional predicate (or
subsequence)?  Or does 'order by' make optimizations impossible?

>> collection("/")/ (//elem) [position() = (1 to 10)]
>
> Note that using subsequence instead of the predicate can be a lot
> faster sometimes.

I didn't realize that.  Can you generalize about when subsequence
would be faster?  And would you say it's always at least as fast as
the predicate?

Joe

Re: [Exist-development] optimizing positional predicates for fast paging

From: Wolfgang M. <wol...@ex...> - 2009-12-15 22:00:18

> A great topic!  I'd certainly welcome optimizations like this.  I
> wonder:  If a FLWOR expression has an 'order by' clause, is it even
> possible to optimize the expression's positional predicate (or
> subsequence)?  Or does 'order by' make optimizations impossible?

The "order by" has to be processed to determine the position of each
item in the result sequence. It would be an option to use the range
index as Mike suggested to speed up the sorting.

>> Note that using subsequence instead of the predicate can be a lot
>> faster sometimes.
>
> I didn't realize that.  Can you generalize about when subsequence
> would be faster?  And would you say it's always at least as fast as
> the predicate?

Mike used position() within the predicate: [position() = (1 to 10)].
This will indeed be slower than a corresponding subsequence($input, 1,
10). Reason: the predicate has to be evaluated once for every item in
the context sequence. This doesn't make a big difference if you have a
simple expression like $a[position() = (1 to 10)], but it could result
in a substantial performance loss for more complex expressions
involving a path expression like a//*/b[position() = (1 to 10)].

To be on the safe side, I would recommend subsequence() if you need to
select *more than one* item.

Wolfgang

Re: [Exist-development] optimizing positional predicates for fast paging

From: Joe W. <jo...@gm...> - 2011-01-31 12:29:17

Wolfgang,

Gary's e-mail had me searching the archives, and I came across this
interesting point you raised:

On Tue, Dec 15, 2009 at 5:00 PM, Wolfgang Meier <wol...@ex...> wrote:
>>> Note that using subsequence instead of the predicate can be a lot
>>> faster sometimes.
>>
>> I didn't realize that.  Can you generalize about when subsequence
>> would be faster?  And would you say it's always at least as fast as
>> the predicate?
>
> Mike used position() within the predicate: [position() = (1 to 10)].
> This will indeed be slower than a corresponding subsequence($input, 1,
> 10). Reason: the predicate has to be evaluated once for every item in
> the context sequence. This doesn't make a big difference if you have a
> simple expression like $a[position() = (1 to 10)], but it could result
> in a substantial performance loss for more complex expressions
> involving a path expression like a//*/b[position() = (1 to 10)].
>
> To be on the safe side, I would recommend subsequence() if you need to
> select *more than one* item.

To confirm, is this still the case?  Or is the query optimizer now
taking care of this case?

If it is still the case, it seems to me this would really be worth
adding to the Performance Tuning article - do you agree?

Thanks,
Joe

Re: [Exist-development] optimizing positional predicates for fast paging

From: Wolfgang M. <wol...@ex...> - 2011-02-01 09:01:43

Hi Joe,

No, the optimizer does not handle this (yet). We should indeed add it to the
tuning guide.

Wolfgang

Am 31.01.2011 13:29 schrieb "Joe Wicentowski" <jo...@gm...>:
> Wolfgang,
>
> Gary's e-mail had me searching the archives, and I came across this
> interesting point you raised:
>
> On Tue, Dec 15, 2009 at 5:00 PM, Wolfgang Meier <wol...@ex...>
wrote:
>>>> Note that using subsequence instead of the predicate can be a lot
>>>> faster sometimes.
>>>
>>> I didn't realize that.  Can you generalize about when subsequence
>>> would be faster?  And would you say it's always at least as fast as
>>> the predicate?
>>
>> Mike used position() within the predicate: [position() = (1 to 10)].
>> This will indeed be slower than a corresponding subsequence($input, 1,
>> 10). Reason: the predicate has to be evaluated once for every item in
>> the context sequence. This doesn't make a big difference if you have a
>> simple expression like $a[position() = (1 to 10)], but it could result
>> in a substantial performance loss for more complex expressions
>> involving a path expression like a//*/b[position() = (1 to 10)].
>>
>> To be on the safe side, I would recommend subsequence() if you need to
>> select *more than one* item.
>
> To confirm, is this still the case? Or is the query optimizer now
> taking care of this case?
>
> If it is still the case, it seems to me this would really be worth
> adding to the Performance Tuning article - do you agree?
>
> Thanks,
> Joe