Re: [Exist-open] only Basic indexing

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Ron,

> Am I understanding correctly that this is the desired behaviour for the
> 'full' query variant:
>
>    collection("/db/test")/*/conversion[to-currency = $from-code][year
> =$year]/rate
> ...but but this optimization (unintendedly) fails in the 'variable' variant:
>
>    $activity-transform:conversion-rates[to-currency = $from-code][year
> =$year]/rate
> ?

Yes. I tried to improve the query optimizer to automatically inline
expressions in cases like this, but it is difficult.

> I'm intrigued, since in my apps I tend to 'abstract' common paths in complex
> queries into 're-usable' variables, e.g.:

This can be more efficient in some cases, e.g. if you are in a loop
and have to transform query results for output. But you want to avoid
this where the node sets can be large. When optimizing queries, the
primary goal has to be: reduce you node sets as early as possible!
Let's assume a data set with 100,000 documents containing 1 million
<address> elements in /db/data. The user wants to find some <address>
where <name> = "Doe". There are only two matching addresses in the db.
You could formulate the query using variables:

let $data := collection("/db/data")
let $addresses := $data/address
return
    $addresses[name = "Doe"]

If we look at the size of the generated node sets, we have 100,000
document references in $data and 1 million references to address
elements in $addresses. Finally, $addresses[name = "Doe"] picks 2
address elements out of the 1 million addresses.

Now compare this to the simple XPath:
collection("/db/data")/address[name = "Doe"]. eXist's optimizer will
first skip collection("/db/data")/address and starts by looking up
name = "Doe". This requires 1 index access. eXist then goes bottom up
to find the 2 address element containing name.

If you add up the processed node sets, we have 1,100,002 node
references in memory for the query using let, whereas the second query
can be evaluated by passing around 2 node references! The performance
win is often HUGE!!!

Does this mean you need to rewrite every expression? No. Once I have
limited my initial data set to 2 address elements, it doesn't matter
how complex my code for processing the query result is. What costs
performance is the primary selection which reduces the initial node
set to those nodes the user actually wants. This selection should be
done as early as possible.

For the optimizer it is important that it sees a complete XPath
expression with a defined context and one or more filters, e.g. a[b =
"c"], a[. = "b"], a[b/c = "d"]. $a[b = "c"] should be optimized as
well, but it will always be less efficient.

> This looks more efficient than using the full paths for each $field
> expression, but actually the opposite is true?

For the initial selection, yes. Once you know you're operating on
smaller sets, it does make sense to keep repeatedly used things in
variables.

Wolfgang

Re: [Exist-open] only Basic indexing

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] only Basic indexing