From: Stefan M. <ste...@un...> - 2011-05-01 12:28:39
|
Hi Stuart, I just pulled your project from git, but to be honest I found it rather hard to get everything in place to get it running. After changing everything such that I can work with less ram (I'm on 32 bit on my home machine), I finally got it up and running, but the experiments have issues. Here i see it starts the exist instance again and again. As I'm a little limited in time right now, I stopped getting this up as I assume it is due to some assumptions on the system in the makefile. Therefore, I tried to focused on the things I could comment on without actually testing it myself. If you could provide a makefile that takes just the URI and password of an exist-db instance without all this setup stuff I'd be happy to test it, though. However, I tried to assemble a few suggestions (just suggestions) based on my experience. I do not claim that they are fit to solve all your problems. > xhtml with XSLT in the browser for display. I'm using 64 bit Java on a > 64 bit Ubuntu and giving Java 6GB of RAM (of 8GB physical RAM > available). That is huge. We work with below 2GB and sometimes dozens of concurrent users and rather complex queries (filtering based on structural and indexed criteria, intersection of result-sets etc.) and that is sufficient. I don't know if you hit the memory-limit, but I doubt it. One difference may be that we put our results in a session variable on the server and serialize only the portion that fit's in the browser. There you can navigate from result-page to result-page. Some queries would yield a million hits that would obviously take some time and memory to serialize. In our experience, this is an area where significant performance improvements may be gained. > I have several GB of XML I'd like to put into He Kupu Tawhito, but I'm > having difficulty scaling above a couple of dozen MB of XML. Others in > the TEI community have reported similar issues, informally. I would assume that you should certainly not be limited to a couple of dozen MB. There has to be something wrong, then. > with Jens Østergaard Petersen convinced me that it may be worth having > another go at scaling this, and thus this message. I remember your harsh statement about the "failure to scale up" on TEI-L (without any hedging, indicating that your queries might have a share in this). Mhh. Let's see, I guess it is still possible to improve them, though. > The best example I can give is this graph which shows query time vs > the size of the contents, both with and without collection.xconf. > > https://spreadsheets.google.com/ccc?key=0AtkIjlDqC2H4dFBQM0V4SjVTdTdlQnFfWWpDaks5d0E&hl=en_GB&authkey=CJDPuY8F As far as I have seen from your makefile, this is the overall time of the request to the query. While in practice this is what actually matters, it is interesting to ask what the reason might be. As I was not able to quickly test it with the makefile, I assume from the code that the resulting page is of moderate size (it only fetches the first 15 whatever-it-might-be, right?)? If the time is spent selecting the elements from the database it is probably an indicator that indexes are either not used or the query optimizer was not able to optimize your query. > > The 50 chapter mark represents 5.9 MB of TEI/XML in a single file. That's small. The performance you see is exceptionally bad. > (1) In my makefile I'm loading my files using: > > $(EXIST_HOME)/bin/client.sh > uri=xmldb:exist://localhost:8081/exist/xmlrpc -m > /db/system/config/db/he_kupu_tawhito/ -p collection.xconf --no-gui > $(EXIST_HOME)/bin/client.sh > uri=xmldb:exist://localhost:8081/exist/xmlrpc -m /db/he_kupu_tawhito/ > -p korero/www.biblegateway.com-sampler/import.words.xml --no-gui > > Will that result in that collection.xconf applying to import.words.xml ? the procedure looks fine to me. > (2) My collection.xconf and main xQuery are at: > > https://github.com/stuartyeates/He-Kupu-Tawhito/blob/master/collection.xconf I think you could use the qname index instead of path index for all the indexes you defined. The query optimizer will do a much better job, then. > https://github.com/stuartyeates/He-Kupu-Tawhito/blob/master/kupu.xql It could be worthwhile using a fulltext index for some cases (e.g. multiple values in @corresp). Maybe an appropriate fulltext index with the whitespace-analyzer, as I don't know what the standard analyzer would do with the "#", and ft:contains could help especially for the cases where @corresp holds several values. But, from your query I would assume that a qname or ngram index should be sufficient. > <create qname="lemma" type="xs:string"/> you don't use this anywhere, right? > <create qname="@xml:id" type="xs:string"/> > <create qname="@xml:lang" type="xs:string"/> > <create qname="@lemma" type="xs:string"/> you should really make these qname (not path) indexes. I changed it accordingly. An index for @corresp you make use of in your query is missing. It could be useful to define a qname and an ngram or fulltext index here, depending on the kind of queries you want to perform. > (3) I've tinkered a reasonable amount with my xquery, but I won't > profess to being an expert: > > https://github.com/stuartyeates/He-Kupu-Tawhito/blob/master/kupu.xql > > Is there anything obvious I'm doing wrong? I found that it sometimes helps to break multiple predicates in several let expressions. > let $words := $this//w[@lemma=$kupu][@xml:lang=$reo]/@xml:id let $words := $this//w[@lemma eq $kupu] let $words := $words[@xml:lang eq $reo] try to do the most restrictive predicate first. I think it is a good habit to use eq when not intending to deal with sequences ($kupu and $reo are just string values, right?). Shouldn't affect performance, here, though. > let $thisid := $this/@xml:id > let $thishash := concat('#', $thisid) you don't use these anywhere later. Why define them then? > let $others := > //p[contains($this/@corresp,@xml:id)][(concat('#',@xml:id)=$this/@corresp) > or (concat('#',$this/@xml:id)=@corresp)] | > //p[contains(@corresp,$this/@xml:id)][(concat('#',@xml:id)=$this/@corresp) > or (concat('#',$this/@xml:id)=@corresp)] I would suspect that this line is causing much trouble! You can split it as the one above and then: - contains($this/@corresp,@xml:id). this does not use indexes! maybe something like "@xml:id = tokenize(this/@corresp, ' ')" (note that I do not use "eq" here) or do you use fn:contains because of a prepended "#"? Then maybe something along the lines of "@xml:id = (for $i in tokenize(this/@corresp, ' ') return substring-before($i,'#'))". I assumed that you do this for 15 $this and do not have a huge number of @corresp values. You can make use of the index on @xml:id, here. - concat('#',@xml:id)=$this/@corresp this does not use indexes either. If you have indexes defined for @xml:id it would not use them because of the concat. The way you defined it I would suspect that exist had, in worst case, to do a concat on all tei:p elements you have. (linear scaling, then). Try maybe "@xml:id eq substring-after($this/@corresp,'#')" for useing the index on @xml:id instead. - concat('#',$this/@xml:id)=@corresp similar here. Here it is the missing index on @corresp, though. The second part of the union has the same issues. The second predicate is exactly the same, for the first just swap @xml:id and @corresp in my first explanation. > sing an index to to pull a <p/> out of 500 MB single file slower > than pulling the same <p/> out of a 50 KB file sitting in a collection > of 10K files? I would not expect the big file to be slower. > (5) TEI uses the standard xml:id and xml:lang tags and I make > extensive use of both of these. I run xmllint over my input files to > check for duplicate xml:ids. Does eXist have any special support for > these? are there any common traps? You can define indexes just as on any other attribute. I don't know what you mean with special support, though. > (6) Currently I don't take any steps to update the indexes. Does eXist > build the indexes listed in the collection.xconf as documents are > loaded? Yes, eXist updates the indexes when you store documents. Hence, this shouldn't be a problem. I think the issue is your query. > (7) Are there other common traps and pitfalls that I should be > checking? I've read http://exist.sourceforge.net/tuning.html read http://exist.sourceforge.net/indexing.html I think you will find some very important things there. Especially on the kinds of indexes certain functions (e.g. fn:contains) benefit from and why qname indexes are preferable. I hope i figured correctly what you are trying to do. If I am wrong with some of my assumptions or solutions, please feel free to correct me. I hope you find something useful in this response. cheers, Stefan |