Re: [Exist-open] eXist optimisation and large TEI collections

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Stuart,

I just pulled your project from git, but to be honest I found it rather
hard to get everything in place to get it running. After changing
everything such that I can work with less ram (I'm on 32 bit on my home
machine), I finally got it up and running, but the experiments have
issues. Here i see it starts the exist instance again and again. As I'm
a little limited in time right now, I stopped getting this up as I
assume it is due to some assumptions on the system in the makefile.
Therefore, I tried to focused on the things I could comment on without
actually testing it myself. If you could provide a makefile that takes
just the URI and password of an exist-db instance without all this setup
stuff I'd be happy to test it, though. However, I tried to assemble a
few suggestions  (just suggestions) based on my experience. I do not
claim that they are fit to solve all your problems.

> xhtml with XSLT in the browser for display. I'm using 64 bit Java on a
> 64 bit Ubuntu and giving Java 6GB of RAM (of 8GB physical RAM
> available).

That is huge. We work with below 2GB and sometimes dozens of concurrent
users and rather complex queries (filtering based on structural and
indexed criteria, intersection of result-sets etc.) and that is
sufficient. I don't know if you hit the memory-limit, but I doubt it.
One difference may be that we put our results in a session variable on
the server and serialize only the portion that fit's in the browser.
There you can navigate from result-page to result-page. Some queries
would yield a million hits that would obviously take some time and
memory to serialize. In our experience, this is an area where
significant performance improvements may be gained.

> I have several GB of XML I'd like to put into He Kupu Tawhito, but I'm
> having difficulty scaling above a couple of dozen MB of XML. Others in
> the TEI community have reported similar issues, informally. 

I would assume that you should certainly not be limited to a couple of
dozen MB. There has to be something wrong, then.

> with Jens Østergaard Petersen convinced me that it may be worth having
> another go at scaling this, and thus this message.

I remember your harsh statement about the "failure to scale up" on TEI-L
(without any hedging, indicating that your queries might have a share in
this). Mhh. Let's see, I guess it is still possible to improve them, though.

> The best example I can give is this graph which shows query time vs
> the size of the contents, both with and without collection.xconf.
>
> https://spreadsheets.google.com/ccc?key=0AtkIjlDqC2H4dFBQM0V4SjVTdTdlQnFfWWpDaks5d0E&hl=en_GB&authkey=CJDPuY8F

As far as I have seen from your makefile, this is the overall time of
the request to the query. While in practice this is what actually
matters, it is interesting to ask what the reason might be. As I was not
able to quickly test it with the makefile, I assume from the code that
the resulting page is of moderate size (it only fetches the first 15
whatever-it-might-be, right?)? If the time is spent selecting the
elements from the database it is probably an indicator that indexes are
either not used or the query optimizer was not able to optimize your query.

>
> The 50 chapter mark represents 5.9 MB of TEI/XML in a single file.

That's small. The performance you see is exceptionally bad.

> (1) In my makefile I'm loading my files using:
>
> $(EXIST_HOME)/bin/client.sh
> uri=xmldb:exist://localhost:8081/exist/xmlrpc -m
> /db/system/config/db/he_kupu_tawhito/ -p collection.xconf --no-gui
> $(EXIST_HOME)/bin/client.sh
> uri=xmldb:exist://localhost:8081/exist/xmlrpc -m /db/he_kupu_tawhito/
> -p korero/www.biblegateway.com-sampler/import.words.xml --no-gui
>
> Will that result in that collection.xconf applying to import.words.xml ?

the procedure looks fine to me.

> (2) My collection.xconf and main xQuery are at:
>
> https://github.com/stuartyeates/He-Kupu-Tawhito/blob/master/collection.xconf

I think you could use the qname index instead of path index for all the
indexes you defined. The query optimizer will do a much better job, then.

> https://github.com/stuartyeates/He-Kupu-Tawhito/blob/master/kupu.xql

It could be worthwhile using a fulltext index for some cases (e.g.
multiple values in @corresp). Maybe an appropriate fulltext index with
the whitespace-analyzer, as I don't know what the standard analyzer
would do with the "#", and ft:contains could help especially for the
cases where @corresp holds several values. But, from your query I would
assume that a qname or ngram index should be sufficient.

>     <create qname="lemma" type="xs:string"/>

you don't use this anywhere, right?

>     <create qname="@xml:id" type="xs:string"/>
>     <create qname="@xml:lang" type="xs:string"/>
>     <create qname="@lemma" type="xs:string"/>

you should really make these qname (not path) indexes. I changed it
accordingly. An index for @corresp you make use of in your query is
missing. It could be useful to define a qname and an ngram or fulltext
index here, depending on the kind of queries you want to perform.

> (3) I've tinkered a reasonable amount with my xquery, but I won't
> profess to being an expert:
>
> https://github.com/stuartyeates/He-Kupu-Tawhito/blob/master/kupu.xql
>
> Is there anything obvious I'm doing wrong?

I found that it sometimes helps to break multiple predicates in several
let expressions.

> let $words := $this//w[@lemma=$kupu][@xml:lang=$reo]/@xml:id

let $words := $this//w[@lemma eq $kupu]
let $words := $words[@xml:lang eq $reo]

try to do the most restrictive predicate first. I think it is a good
habit to use eq when not intending to deal with sequences ($kupu and
$reo are just string values, right?). Shouldn't affect performance,
here, though.

> let $thisid := $this/@xml:id
> let $thishash := concat('#', $thisid)

 you don't use these anywhere later. Why define them then?

> let $others :=
> //p[contains($this/@corresp,@xml:id)][(concat('#',@xml:id)=$this/@corresp)
> or (concat('#',$this/@xml:id)=@corresp)] |
> //p[contains(@corresp,$this/@xml:id)][(concat('#',@xml:id)=$this/@corresp)
> or (concat('#',$this/@xml:id)=@corresp)]

I would suspect that this line is causing much trouble! You can split it
as the one above and then:

- contains($this/@corresp,@xml:id).
this does not use indexes! maybe something like "@xml:id =
tokenize(this/@corresp, ' ')"  (note that I do not use "eq" here) or do
you use fn:contains because of a prepended "#"? Then maybe something
along the lines of "@xml:id = (for $i in tokenize(this/@corresp, ' ')
return substring-before($i,'#'))". I assumed that you do this for 15
$this and do not have a huge number of @corresp values. You can make use
of the index on @xml:id, here.

- concat('#',@xml:id)=$this/@corresp
this does not use indexes either. If you have indexes defined for
@xml:id it would not use them because of the concat.
The way you defined it I would suspect that exist had, in worst case, to
do a concat on all tei:p elements you have. (linear scaling, then). Try
maybe "@xml:id eq substring-after($this/@corresp,'#')" for useing the
index on @xml:id instead.

- concat('#',$this/@xml:id)=@corresp
similar here. Here it is the missing index on @corresp, though.

The second part of the union has the same issues. The second predicate
is exactly the same, for the first just swap @xml:id and @corresp in my
first explanation.

> sing an index to to pull a <p/> out of 500 MB single file slower
> than pulling the same <p/> out of a 50 KB file sitting in a collection
> of 10K files?

I would not expect the big file to be slower.

> (5) TEI uses the standard xml:id and xml:lang tags and I make
> extensive use of both of these. I run xmllint over my input files to
> check for duplicate xml:ids. Does eXist have any special support for
> these? are there any common traps?

You can define indexes just as on any other attribute. I don't know what
you mean with special support, though.

> (6) Currently I don't take any steps to update the indexes. Does eXist
> build the indexes listed in the collection.xconf as documents are
> loaded?

Yes, eXist updates the indexes when you store documents. Hence, this
shouldn't be a problem. I think the issue is your query.

> (7) Are there other common traps and pitfalls that I should be
> checking? I've read http://exist.sourceforge.net/tuning.html

read http://exist.sourceforge.net/indexing.html

I think you will find some very important things there. Especially on
the kinds of indexes certain functions (e.g. fn:contains) benefit from
and why qname indexes are preferable.

I hope i figured correctly what you are trying to do. If I am wrong with
some of my assumptions or solutions, please feel free to correct me. I
hope you find something useful in this response.

cheers,
Stefan

Re: [Exist-open] eXist optimisation and large TEI collections

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] eXist optimisation and large TEI collections