From: Christian W. <cwi...@gm...> - 2020-12-02 09:40:49
|
Dear eXist users, I am trying to improve search recall for my application, which contains premodern Chinese text. My documents are set up to have <tei:seg> elements for every phrase, which are in turn contained by tei:p elements. At the moment, I am simple defining a ngram index on tei:seg, but that of course limits the matches to the contents of one tei:seg element. To overcome this limitation, I am defining a ngram index on tei:p as well, in the expectation that the ngrams will be constructed by concating the tei:seg elements that make up a paragraph. So for example: <tei:p><tei:seg>ABCD.</tei:seg><tei:seg>EFGH</tei:seg></tei:p> With such a text, I would expect to be able to search for "CD.EF" and find one match. However there is no match for //ngram:contains(., "CD.EF"), also not with //ngram:wildcard-contains(., "CD.EF") The reason for this assumption is that the documentation for the ngram module says: "Note: a ngram match on mixed content may span multiple nodes. " (this is in the documentation for the ngram:filter-matches function). Since there are no parameters when setting up an ngram index, I would expect that elements with mixed content like the tei:p element would be able to find a term across tei:seg elements. Is this a bug or am I missing something? Any help appreciated, Christian |