From: Michael B. <mbe...@mb...> - 2005-05-17 10:34:27
|
OK I have now added a DEBUG with "NOT using..." to the else branch of the relevant method in FunMatches.java, snd recompiled, and on a query which with 2005-04-01 definitely used the range index, a build from CVS of 2005-05-14 now appears not to be doing so. Stored in db/system/config/db/and/sources I have sources.xconf, which reads <?xml version="1.0" encoding="UTF-8"?> <collection xmlns="http://exist-db.org/collection-config/1.0" xmlns:exist="http://exist.sourceforge.net/NS/exist"> <index xmlns:x="http://www.foo.com"> <fulltext default="all"> <exclude path="//front"/> <exclude path="//back"/> <exclude path="//note"/> </fulltext> <create path="//lg/l" type="xs:string"/> </index> </collection> but for on a query against that collection on the appropriate path exist:/and/sources>find //lg/l [matches(.,'fesauntes')] //lg/l [matches(.,'fesauntes')] found 1 hits in 432ms. I see logged 2005-05-17 11:26:20,353 [main] DEBUG (XQuery.java [compile]:111) - Query diagnostics: /ROOT/descendant-or-self::lg/child::l[matches(self::node(), "fesauntes")] 2005-05-17 11:26:20,354 [main] DEBUG (XQuery.java [compile]:113) - Compilation took 4 2005-05-17 11:26:20,355 [main] DEBUG (XQueryContext.java [getStaticallyKnownDocuments]:478) - reading collection /db/and/sources 2005-05-17 11:26:20,416 [main] DEBUG (FunMatches.java [evalWithIndex]:187) - NOT Using index ... 2005-05-17 11:26:20,775 [main] DEBUG (LocalXPathQueryService.java [execute]:350) - query took 420 ms. Help!!! Michael Beddow |
From: Wolfgang M. <wol...@ex...> - 2005-05-17 12:36:47
|
> OK I have now added a DEBUG with "NOT using..." to the else branch of the > relevant method in FunMatches.java, snd recompiled, and on a query which > with 2005-04-01 definitely used the range index, a build from CVS of > 2005-05-14 now appears not to be doing so. I checked this with the current CVS. I stored the following configuration into /db/system/config/db/mondial (slightly modified from samples/mondial.xconf): <?xml version="1.0" encoding="ISO-8859-1"?> <collection xmlns="http://exist-db.org/collection-config/1.0"> <!-- Defines a bunch of numeric indexes on the mondial collection. This file should be stored into /db/system/config/db/mondial. --> <index> <fulltext default="all" attributes="yes"> <exclude path="/mondial//name"/> </fulltext> <create path="/mondial//population" type="xs:integer"/> <create path="/mondial//population_growth" type="xs:double"/> <create path="/mondial//infant_mortality" type="xs:double"/> <create path="/mondial//inflation" type="xs:double"/> <create path="/mondial//name" type="xs:string"/> </index> </collection> I then store mondial.xml into /db/mondial/mondial.xml. Here are the commands I used: bin/client.sh -l -m /db/system/config/db/mondial -p mondial.xconf bin/client.sh -l -m /db/mondial -p mondial.xml I start the client again and execute a query: //province[fn:matches(name, ".*Dh.*")] As expected, the log output shows: 17 May 2005 14:30:30,527 [Thread-4] DEBUG (FunMatches.java [evalWithIndex]:179) - Using index ... 17 May 2005 14:30:30,716 [Thread-4] DEBUG (LocalXPathQueryService.java [execute]:350) - query took 333 ms. I tried various kinds of queries, but the index is always used. Something strange is going on here ... Wolfgang |
From: Michael B. <mbe...@mb...> - 2005-05-17 16:04:32
|
> I tried various kinds of queries, but the index is always used. > > Something strange is going on here ... > Indeed.... I eventually gave up trying to get range indexing to work on my existing collections with the current build. I zapped my entire data directory, then stored one document, preceded by a corresponding xconf file defining a range index. (This recreated the setup I first used to test mixed content string range matching with the April 1 build: I fetched both the data doc and the xconf afresh from the filestore, not from my eXist data backup store.) The good news is that according to the logs the range index is being used again now (this is without changing the binaries, and the document and the xconf are in the same places as before). The slightly worse news is that the mixed content handling is behaving oddly. exist:/db/and/sources>find //lg/l [matches(.,'fesauntes')] //lg/l [matches(.,'fesauntes')] found 0 hits in 914ms. If I step back up a level in the XPath to deactivate the range index, I get the expected match find //lg[matches(.,'fesauntes')] found 1 hits in 42ms. show 1 <lg> [non-matching <l>s left out of results for clarity] <l lang="ME" id="ANH-002-0050-L0017-M">A nye of fesau<expan rend="italic">n</expan>tes, a coveye of p<expan rend="italic">er</expan>dryz,</l> [non-matching <l>s left out of results for clarity] </lg> So the string "fesauntes", split as it is across more than one element, is no longer found by the range index, at least not with a term I would have expected to work. However, if I make that //lg/l [matches(.,' fesauntes')] (i.e with a leading space before the f) I get the match on the range index: //lg/l [matches(.,' fesauntes')] found 1 hits in 57ms. show 1 <l lang="ME" id="ANH-002-0050-L0017-M">A nye of fesau<expan rend="italic">n</expan>tes, a coveye of p<expan rend="italic">er</expan>dryz,</l> Very strange indeed. I will now restore the other 60,000 odd documents to my collections, and see what happens then. Michael |
From: Michael B. <mbe...@mb...> - 2005-05-17 17:22:42
|
My collections are now all back in place, and range-indexing is still working. There must have been something in the actual *.dbx files I had before that was blocking it. BUT (a) I still find that in order to get match on a word via the range index, I have to precede it by a space (or a \b) (b) I am observing a performance difference on fulltext vs range indices of a similar order to what Sjur reported if I use literal matches (though not if I use regexes) Case 1a exist:/db/and/sources>find //lg/l [match-any(.,'rei')] //lg/l [match-any(.,'rei')] found 914 hits in 39ms. Case 1b (leading space used to ensure match) exist:/db/and/sources>find //lg/l [matches(.,' rei')] //lg/l [matches(.,' rei')] found 913 hits in 246ms. Obviously these times vary from run to run as is usual with Java programs involving string manipulation, but the order of difference between full-text and range-indexed is pretty constant. If I use regex notation to let me employ the same matching expression for both approaches, the timing difference becomes much smaller, and the range index matches come out in front, because the fulltext version slows down dramatically (I'm not really clear why, since a fixed string shouldn't between \b markers shouldn't be particularly expensive for the matching engine). Case 2a exist:/db/and/sources>find //lg/l [match-any(.,'\brei\b')] //lg/l [match-any(.,'\brei\b')] found 394 hits in 346ms. Case 2b exist:/db/and/sources>find //lg/l [matches(.,'\brei\b')] //lg/l [matches(.,'\brei\b')] found 396 hits in 248ms. I've not yet been able to track down the differences in the hit totals. They should be identical in both cases, but they are in any case not wildly divergent. Michael |
From: Michael B. <mbe...@mb...> - 2005-05-17 19:41:58
|
It's rather ironic that after holding forth here recently about the true nature and function of the "except" operator, I was still under the illusion that tracking down the source of the hit count discrepancy between fulltext index and range index hits in my last post would be time-consuming. In fact of course, as our cat disdainfully pointed out to me by a quick prod of her tail, pinpointing where to look is a ridiculously easy set operation. find //lg/l [matches(.,'\brei\b')] except //lg/l [match-any(.,'\brei\b')] found 2 hits in 577ms. Even better, on inspecting those 2 hits, the essential correctness of eXist stands vindicated. My xconf file for that collection tells the fulltext indexer not to index notes. And the two hits the range-index match finds that the fulltext match doesn't return are in the text content of note elements. I.e. the discrepancy shows entirely correct behaviour. Now: if that little problem re the need for a leading space in the range index match term can just get cleared up, I will be too usefully occupied developing code to keep on responding to my own posts, so everyone will be happy. Michael Beddow |
From: Wolfgang M. <wol...@gm...> - 2005-05-18 06:49:30
|
Hi Michael, > Case 1b (leading space used to ensure match) > exist:/db/and/sources>find //lg/l [matches(.,' rei')] > //lg/l [matches(.,' rei')] > found 913 hits in 246ms. This points to another conceptual problem in the regex handling. I will try to clean up the whole regexp and range index stuff and present a fixed version soon. Wolfgang |
From: Wolfgang M. <wol...@gm...> - 2005-05-24 17:45:29
|
> This points to another conceptual problem in the regex handling. I > will try to clean up the whole regexp and range index stuff and > present a fixed version soon. I have now fixed this problem. To speed-up index lookups, eXist used the first characters of the pattern to limit the portion of the btree that needs to be scanned. However, in XQuery, regular expressions match if any substring matches, so the first character of the pattern does not need to be the first character in the string. eXist now uses the first characters only if the regex starts with an "^" an= chor. Wolfgang |