From: Ron V. d. B. <ron...@ka...> - 2010-09-30 15:35:57
|
Hi, I'm investigating how feasible it is to implement some kind of faceted searching with the current state of eXist (tested against 1.5 trunk). Revisiting an older thread (<http://exist-open.markmail.org/thread/g4qdfvpt562jdfh2>) and checking how it is done in eXist's bibliographic webapp demo, it appears that the util:index-keys() function provides optimal access to quick index terms lookup. After experimenting with this function, I noticed some differences that could motivate an approach for faceted search, but I'm not too sure my understanding is correct. Hence, I'm looking for advice or comments on my analysis. My observations are based on following test document ('/db/test/test.xml'): <test> <p>this is the first paragraph in a test document</p> <p>this is the second paragraph in a test document</p> </test> ...and following index configuration ('/db/system/config/db/test/collection.xconf'): <collection xmlns="http://exist-db.org/collection-config/1.0"> <index xmlns:xlink="http://www.w3.org/1999/xlink"> <fulltext default="none" attributes="no"/> <lucene> <text match="//p"/> </lucene> <create path="//p" type="xs:string"/> </index> </collection> ...and following basic xquery: declare function local:term-callback($term as xs:string, $data as xs:int+) as element() { <entry> <term>{$term}</term> <frequency>{$data[1]}</frequency> <documents>{$data[2]}</documents> <position>{$data[3] </position> </entry> }; let $pool := collection('/db/test')//p let $local:key := util:function(xs:QName("local:term-callback"), 2) return util:index-keys($pool,'', $local:key, 1000) Depending on the index used, util:index-keys() returns different results: * when run against a range index (util:index-keys($pool,'', $local:key, 1000)), the complete indexed strings for those nodes are returned: <entry> <term>this is the first paragraph in a test document</term> <frequency>1</frequency> <documents>1</documents> <position>1</position> </entry> <entry> <term>this is the second paragraph in a test document</term> <frequency>1</frequency> <documents>1</documents> <position>2</position> </entry> * when run against the lucene FT index (util:index-keys($pool,'', $local:key, 1000), 'lucene-index'), the tokenized words are returned: <entry> <term>document</term> <frequency>2</frequency> <documents>1</documents> <position>1</position> </entry> <entry> <term>first</term> <frequency>1</frequency> <documents>1</documents> <position>2</position> </entry> <entry> <term>paragraph</term> <frequency>2</frequency> <documents>1</documents> <position>3</position> </entry> <entry> <term>second</term> <frequency>1</frequency> <documents>1</documents> <position>4</position> </entry> <entry> <term>test</term> <frequency>2</frequency> <documents>1</documents> <position>5</position> </entry> This would suggest that the util:index-keys() function on a range index can be used as an index-based (hence much faster) alternative for distinct-values(). Hence, it could make sense to index both nodes twice: -range index for quick lookup of distinct values for a node (when e.g. a list of name facets is to be retrieved) -lucene FT index for quick lookup of individual terms (when e.g. a list of index terms is to be retrieved) Still, some questions remain: -The documentation on util:index-keys() is quite sparse. Apart from the function reference, http://demo.exist-db.org/exist/indexing.xml#N10557 is the only place where its behaviour is illustrated. Is it correct to infer from the documentation that util:index-keys() without fifth parameter naming the index by default looks in the range index, while one has to specify 'lucene-index' explicitly as 5th parameter to make it look in the lucene FT index? -I noticed (even with this very simple example) that running util:index-keys() against the lucene FT index is significantly slower (2000 -3000ms in the java admin client) than against the range index (0-10ms). Is this normal behaviour or is something wrong with my index configuration? Any hints much appreciated, Kind regards, Ron -- Ron Van den Branden Wetenschappelijk attaché / Project Officer Centrum voor Teksteditie en Bronnenstudie - CTB (KANTL) Centre for Scholarly Editing and Document Studies Koninklijke Academie voor Nederlandse Taal- en Letterkunde Royal Academy of Dutch Language and Literature Koningstraat 18 / b-9000 Gent / Belgium tel: +32 9 265 93 51 / fax: +32 9 265 93 49 E-mail : ron...@ka... www.kantl.be/ctb |