From: Wolfgang M. <wol...@ex...> - 2012-05-09 12:40:55
|
Hi Ron, > Am I understanding correctly that this is the desired behaviour for the > 'full' query variant: > > collection("/db/test")/*/conversion[to-currency = $from-code][year > =$year]/rate > ...but but this optimization (unintendedly) fails in the 'variable' variant: > > $activity-transform:conversion-rates[to-currency = $from-code][year > =$year]/rate > ? Yes. I tried to improve the query optimizer to automatically inline expressions in cases like this, but it is difficult. > I'm intrigued, since in my apps I tend to 'abstract' common paths in complex > queries into 're-usable' variables, e.g.: This can be more efficient in some cases, e.g. if you are in a loop and have to transform query results for output. But you want to avoid this where the node sets can be large. When optimizing queries, the primary goal has to be: reduce you node sets as early as possible! Let's assume a data set with 100,000 documents containing 1 million <address> elements in /db/data. The user wants to find some <address> where <name> = "Doe". There are only two matching addresses in the db. You could formulate the query using variables: let $data := collection("/db/data") let $addresses := $data/address return $addresses[name = "Doe"] If we look at the size of the generated node sets, we have 100,000 document references in $data and 1 million references to address elements in $addresses. Finally, $addresses[name = "Doe"] picks 2 address elements out of the 1 million addresses. Now compare this to the simple XPath: collection("/db/data")/address[name = "Doe"]. eXist's optimizer will first skip collection("/db/data")/address and starts by looking up name = "Doe". This requires 1 index access. eXist then goes bottom up to find the 2 address element containing name. If you add up the processed node sets, we have 1,100,002 node references in memory for the query using let, whereas the second query can be evaluated by passing around 2 node references! The performance win is often HUGE!!! Does this mean you need to rewrite every expression? No. Once I have limited my initial data set to 2 address elements, it doesn't matter how complex my code for processing the query result is. What costs performance is the primary selection which reduces the initial node set to those nodes the user actually wants. This selection should be done as early as possible. For the optimizer it is important that it sees a complete XPath expression with a defined context and one or more filters, e.g. a[b = "c"], a[. = "b"], a[b/c = "d"]. $a[b = "c"] should be optimized as well, but it will always be less efficient. > This looks more efficient than using the full paths for each $field > expression, but actually the opposite is true? For the initial selection, yes. Once you know you're operating on smaller sets, it does make sense to keep repeatedly used things in variables. Wolfgang |