[Exist-open] ngram index with mixed content

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dear eXist users,

I am trying to improve search recall for my application, which contains 
premodern Chinese text.  My documents are set up to have <tei:seg> 
elements for every phrase, which are in turn contained by tei:p 
elements. At the moment, I am simple defining a ngram index on tei:seg, 
but that of course limits the matches to the contents of one tei:seg 
element.  To overcome this limitation, I am defining a ngram index on 
tei:p as well, in the expectation that the ngrams will be constructed by 
concating the tei:seg elements that make up a paragraph.  So for example:

<tei:p><tei:seg>ABCD.</tei:seg><tei:seg>EFGH</tei:seg></tei:p>

With such a text, I would expect to be able to search for "CD.EF" and 
find one match. However there is no match for

//ngram:contains(., "CD.EF"), also not with //ngram:wildcard-contains(., 
"CD.EF")

The reason for this assumption is that the documentation for the ngram 
module says:

"Note: a ngram match on mixed content may span multiple nodes. " (this 
is in the documentation for the ngram:filter-matches function).

Since there are no parameters when setting up an ngram index, I would 
expect that elements with mixed content like the tei:p element would be 
able to find a term across tei:seg elements.

Is this a bug or am I missing something?  Any help appreciated,

Christian

[Exist-open] ngram index with mixed content

eXist-db is a feature rich Open Source native XML database

[Exist-open] ngram index with mixed content