From: Wolfgang M. <wol...@gm...> - 2007-10-30 10:34:05
|
Hi Kai, > Here some examples (docs + index conf attached). The following query yields 11 hits with > the "old" configuration and 5 hits with the qname configuration: The differing query results were caused by differences in whitespace handling and text tokenization between the standard full text index and the index configured by QName. Fixing this problem wasn't that easy, especially since the old indexer code was a bit chaotic. I thus decided to migrate the relevant parts of the code to our new modularized indexing framework, which has a much cleaner design (the switch to the new architecture was planned for later this year, but now we have already done a part of it). The index configuration is now more consistent: <collection xmlns="http://exist-db.org/collection-config/1.0"> <index> <fulltext default="none"> <!-- 1. tokenizer splits text at element boundaries --> <include path="/elem"/> <create qname="elem"/> <!-- 2. ignore element boundaries, index as mixed content --> <include path="/elem" content="mixed"/> <create qname="elem" content="mixed"/> </fulltext> </index> </collection> Without the content="mixed" attribute, the tokenizer will split the text at element boundaries, i.e. <p><span>un</span><span>expected</span></p> will result in 2 tokens in the index: "un" and "expected". If you add content="mixed", "unexpected" will be treated as 1 token! For your use case - query the entire <content> element with all subelements - you should create an index without the content="mixed" attribute. Wolfgang |