From: Ron V. d. B. <ron...@ka...> - 2004-03-02 16:43:49
|
Hi, I keep on stumbling into problems with highly unpredictable query performance. I have a collection of 1748 documents. The following structure illustrates the part of my data I want to query: <TEI.2> ... <letHeading> <author>Stijn Streuvels</author> <addressee>Joris Lannoo</addressee> <placeLet>Ingooigem</placeLet> <dateLet>1947-05-12</dateLet> </letHeading> ... </TEI.2> The following queries illustrate the kind of tests I run for some <author> - <addressee> pairs: collection('/db/tinyDALF')/TEI.2//letHeading[author&=3D'streuvels' and addressee&=3D'lannoo'] collection('/db/tinyDALF')/TEI.2//letHeading[near(author,'streuvels') and near(addressee, 'lannoo')] collection('/db/tinyDALF')/TEI.2//letHeading[match-all(author,'streuvels'= ) and match-all(addressee, 'lannoo')] collection('/db/tinyDALF')/TEI.2//letHeading[author&=3D'streuvels' and addressee&=3D'joris lannoo'] collection('/db/tinyDALF')/TEI.2//letHeading[near(author,'streuvels') and near(addressee, 'joris lannoo')] collection('/db/tinyDALF')/TEI.2//letHeading[match-all(author,'streuvels'= ) and match-all(addressee, 'joris', 'lannoo')] collection('/db/tinyDALF')/TEI.2//letHeading[author&=3D'stijn streuvels' and addressee&=3D'lannoo'] collection('/db/tinyDALF')/TEI.2//letHeading[near(author,'stijn streuvels') and near(addressee, 'lannoo')] collection('/db/tinyDALF')/TEI.2//letHeading[match-all(author, 'stijn', 'streuvels') and match-all(addressee, 'lannoo')] collection('/db/tinyDALF')/TEI.2//letHeading[author&=3D'stijn streuvels' and addressee&=3D'joris lannoo'] collection('/db/tinyDALF')/TEI.2//letHeading[near(author,'stijn streuvels') and near(addressee, 'joris lannoo')] collection('/db/tinyDALF')/TEI.2//letHeading[match-all(author, 'stijn', 'streuvels') and match-all(addressee, 'joris', 'lannoo')] These tests show instability in the number of documents returned, on the following domains: * the different functions used: often the near() and match-all() functions return too few results, whereas the '&=3D' function mostly (but not always) performs well. * the different arguments passed to the functions: often the queries with multiple keyword strings performed poorer * different versions of my data: ran on a flat collection structure (all data under 1 collection), the tests produce more accurate results than those ran on a version containing some 30-odd of the documents under a separate subcollection. * different collections of the same data (indexed completely identical under different collection names). However, once indexed, the same tests produce the same results within the same version of eXist. * different snapshots of eXist: I've copied the indexed *.dbx files into the /WEB-INF/data folder of the CVS version, the stable 1.0 verion and the latest snapshots (20040225, 20040227, 20040302), producing strongly varying results for the tests. I have no clue what causes this instability; some issues I can think about are: * functioning of string functions * combination of test conditions with the "and" operator * indexing troubles I have 3 concrete questions: 1) I feel quite isolated with my problem, but do realise that these errors are particularly clear since those parts of my data are very controlable. Are there perhaps other users out there with more data-centric parts who (have) observe(d) similar phenomena querying their data? 2) I remember a posting some time ago (2004-01-21) about queries on the mondial.xml file returning unexpected results. Since these problems seem to be quite related, maybe their solutions too? Was it purely a matter of debugging erroneous functions? 3) I index my data using the client.bat script in GUI mode, by just creating collections and selecting the appropriate files / directories. Are there perhaps some specific settings that I may have overlooked and that may cause such problems? One (relative) point of relief is that these problems do not seem to occur on the same data indexed and queried with the eXist-0.9.2 version (using the XPath query interface, of course). This points to causes related to changes in the indexing/querying(?) code of eXist-1.0b, rather than problems inherent to my data. I tried to send a (perhaps prohibitively) large copy of my data to wol...@so... but am not sure whether it has arrived in good order. Therefore, I'll try to attach a trimmed down version of my data (which can be zipped to 704 kB) to this message, as well as a query file I used. If this is not the way to do it, I'd be happy to follow alternatives suggested. I'd warmly appreciate any reaction, Ron --=20 Ron Van den Branden Wetenschappelijk attach=E9 Centrum voor Teksteditie en Bronnenstudie (CTB) Koninklijke Academie voor Nederlandse Taal- en Letterkunde (KANTL) Koningstraat 18 / b-9000 Gent / Belgium e-mail : ron...@ka... http://www.kantl.be/ctb/staff/ron.htm |