From: Michael S. <so...@if...> - 2009-12-13 20:33:57
|
I wonder whether eXist is currently optimizing xquery evaluations that use positional predicates, or similar. It doesn't seem like it, based on some poking around in the code, and I think this could be a very fruitful area. What I mean is an expression like: collection("/")/ (//elem) [position() = (1 to 10)] - when there are a lot of //elem, evaluation can be short-cut by ignoring all but the first 10: ie the first "page" of results. Currently it *appears* to me as if the entire (//elem) node sequence is being constructed, and then filtered. In web applications, a great many queries are limited in exactly this way: [position() = ($n to $n + $page-size)], so it seems like an area that merits attention. And the query-writer doesn't have enough expressive power to push the positional predicate into the innermost expression, especially if they are passing paging parameters via a wrapping API layer like REST. There is also a problem when coding FLWOR expressions (which you have to do in order to sort the results), where the only option for paging is to wrap the entire FLWOR in a subsequence function call (or postfix a predicate like above). Some kind of internal expression rewriting that pushes the paging predicate into the FLWOR evaluation could yield a huge performance improvement in such cases. Even when results are sorted, one might see a performance gain by using a value index scan. I started to look into implementing this myself, but I wanted to see whether any work had been done already, or if anyone had any other thoughts about whether this is worthwhile? -Mike |
From: Wolfgang M. <wol...@ex...> - 2009-12-14 08:37:12
|
> I wonder whether eXist is currently optimizing xquery evaluations that use > positional predicates, or similar. I did implement shortcuts for the expensive axes: following and preceding. For example, LocationStep.getFollowing checks the property hasPositionalPredicate. If set to true, the method evaluates the predicate in advance to get its numeric value. This is then forwarded to the NodeSet method which calculates the join: return currentSet.selectFollowing(contextSet, position, contextId); We could apply similar optimizations to other axes, though their execution paths are more complex as you have more possibilities to choose from. However, my positional predicate optimization would only work for //elem[1], not //elem[position() = 1]. But I guess it would be possible to add a rewrite rule here to automatically simplify the expression. > collection("/")/ (//elem) [position() = (1 to 10)] Note that using subsequence instead of the predicate can be a lot faster sometimes. Wolfgang |
From: Michael S. <so...@if...> - 2009-12-15 13:01:08
|
Thanks for your answer, Wolfgang - that's just the sort of thing I was wondering about. I walked through the logic in selectFollowing: I can see it would be possible, but complicated to implement something similar for other axes and for a somewhat more complex predicate (compare position against a constant sequence). Maybe that's worth doing - I'm not sure though. Frankly the main cases I am interested in are not really axis selections, but ordered sequences of nodes from different documents where the ordering is determined either by a value (that may be indexed) or by the lucene score. These correspond to searches in our applications where the results are ordered alphabetically by title, or by relevance. So I asked the other question mostly just to try and understand better what kind of optimization strategies you are using. I'll keep looking and let you know if I have anything helpful to offer! I'm also curious about the answer to Joe's question re: the difference between subsequence and predicate: if you have a moment could you elucidate? -Mike Also: my e-mails to this list seem to need moderation. If you'd rather not be bothered about that, I could move this discussion over to the exist-open list? > -----Original Message----- > From: Wolfgang Meier [mailto:wol...@ex...] > Sent: Monday, December 14, 2009 3:37 AM > To: Michael Sokolov > Cc: exi...@li... > Subject: Re: [Exist-development] optimizing positional > predicates for fast paging > > > I wonder whether eXist is currently optimizing xquery > evaluations that > > use positional predicates, or similar. > > I did implement shortcuts for the expensive axes: following > and preceding. For example, LocationStep.getFollowing checks > the property hasPositionalPredicate. If set to true, the > method evaluates the predicate in advance to get its numeric > value. This is then forwarded to the NodeSet method which > calculates the join: > > return currentSet.selectFollowing(contextSet, position, contextId); > > We could apply similar optimizations to other axes, though > their execution paths are more complex as you have more > possibilities to choose from. > > However, my positional predicate optimization would only work > for //elem[1], not //elem[position() = 1]. But I guess it > would be possible to add a rewrite rule here to automatically > simplify the expression. > > > collection("/")/ (//elem) [position() = (1 to 10)] > > Note that using subsequence instead of the predicate can be a > lot faster sometimes. > > Wolfgang > |
From: Joe W. <jo...@gm...> - 2009-12-14 12:46:10
|
Mike and Wolfgang, A great topic! I'd certainly welcome optimizations like this. I wonder: If a FLWOR expression has an 'order by' clause, is it even possible to optimize the expression's positional predicate (or subsequence)? Or does 'order by' make optimizations impossible? >> collection("/")/ (//elem) [position() = (1 to 10)] > > Note that using subsequence instead of the predicate can be a lot > faster sometimes. I didn't realize that. Can you generalize about when subsequence would be faster? And would you say it's always at least as fast as the predicate? Joe |
From: Wolfgang M. <wol...@ex...> - 2009-12-15 22:00:18
|
> A great topic! I'd certainly welcome optimizations like this. I > wonder: If a FLWOR expression has an 'order by' clause, is it even > possible to optimize the expression's positional predicate (or > subsequence)? Or does 'order by' make optimizations impossible? The "order by" has to be processed to determine the position of each item in the result sequence. It would be an option to use the range index as Mike suggested to speed up the sorting. >> Note that using subsequence instead of the predicate can be a lot >> faster sometimes. > > I didn't realize that. Can you generalize about when subsequence > would be faster? And would you say it's always at least as fast as > the predicate? Mike used position() within the predicate: [position() = (1 to 10)]. This will indeed be slower than a corresponding subsequence($input, 1, 10). Reason: the predicate has to be evaluated once for every item in the context sequence. This doesn't make a big difference if you have a simple expression like $a[position() = (1 to 10)], but it could result in a substantial performance loss for more complex expressions involving a path expression like a//*/b[position() = (1 to 10)]. To be on the safe side, I would recommend subsequence() if you need to select *more than one* item. Wolfgang |
From: Joe W. <jo...@gm...> - 2011-01-31 12:29:17
|
Wolfgang, Gary's e-mail had me searching the archives, and I came across this interesting point you raised: On Tue, Dec 15, 2009 at 5:00 PM, Wolfgang Meier <wol...@ex...> wrote: >>> Note that using subsequence instead of the predicate can be a lot >>> faster sometimes. >> >> I didn't realize that. Can you generalize about when subsequence >> would be faster? And would you say it's always at least as fast as >> the predicate? > > Mike used position() within the predicate: [position() = (1 to 10)]. > This will indeed be slower than a corresponding subsequence($input, 1, > 10). Reason: the predicate has to be evaluated once for every item in > the context sequence. This doesn't make a big difference if you have a > simple expression like $a[position() = (1 to 10)], but it could result > in a substantial performance loss for more complex expressions > involving a path expression like a//*/b[position() = (1 to 10)]. > > To be on the safe side, I would recommend subsequence() if you need to > select *more than one* item. To confirm, is this still the case? Or is the query optimizer now taking care of this case? If it is still the case, it seems to me this would really be worth adding to the Performance Tuning article - do you agree? Thanks, Joe |
From: Wolfgang M. <wol...@ex...> - 2011-02-01 09:01:43
|
Hi Joe, No, the optimizer does not handle this (yet). We should indeed add it to the tuning guide. Wolfgang Am 31.01.2011 13:29 schrieb "Joe Wicentowski" <jo...@gm...>: > Wolfgang, > > Gary's e-mail had me searching the archives, and I came across this > interesting point you raised: > > On Tue, Dec 15, 2009 at 5:00 PM, Wolfgang Meier <wol...@ex...> wrote: >>>> Note that using subsequence instead of the predicate can be a lot >>>> faster sometimes. >>> >>> I didn't realize that. Can you generalize about when subsequence >>> would be faster? And would you say it's always at least as fast as >>> the predicate? >> >> Mike used position() within the predicate: [position() = (1 to 10)]. >> This will indeed be slower than a corresponding subsequence($input, 1, >> 10). Reason: the predicate has to be evaluated once for every item in >> the context sequence. This doesn't make a big difference if you have a >> simple expression like $a[position() = (1 to 10)], but it could result >> in a substantial performance loss for more complex expressions >> involving a path expression like a//*/b[position() = (1 to 10)]. >> >> To be on the safe side, I would recommend subsequence() if you need to >> select *more than one* item. > > To confirm, is this still the case? Or is the query optimizer now > taking care of this case? > > If it is still the case, it seems to me this would really be worth > adding to the Performance Tuning article - do you agree? > > Thanks, > Joe |