From: Joe W. <jo...@gm...> - 2020-12-02 19:01:54
|
Hi Christian, Could you please try: //tei:p[ngram:contains(., "CD.EF")] Instead of: //ngram:contains(., "CD.EF") Joe On Wed, Dec 2, 2020 at 4:41 AM Christian Wittern <cwi...@gm...> wrote: > Dear eXist users, > > I am trying to improve search recall for my application, which contains > premodern Chinese text. My documents are set up to have <tei:seg> > elements for every phrase, which are in turn contained by tei:p > elements. At the moment, I am simple defining a ngram index on tei:seg, > but that of course limits the matches to the contents of one tei:seg > element. To overcome this limitation, I am defining a ngram index on > tei:p as well, in the expectation that the ngrams will be constructed by > concating the tei:seg elements that make up a paragraph. So for example: > > <tei:p><tei:seg>ABCD.</tei:seg><tei:seg>EFGH</tei:seg></tei:p> > > With such a text, I would expect to be able to search for "CD.EF" and > find one match. However there is no match for > > //ngram:contains(., "CD.EF"), also not with //ngram:wildcard-contains(., > "CD.EF") > > The reason for this assumption is that the documentation for the ngram > module says: > > "Note: a ngram match on mixed content may span multiple nodes. " (this > is in the documentation for the ngram:filter-matches function). > > Since there are no parameters when setting up an ngram index, I would > expect that elements with mixed content like the tei:p element would be > able to find a term across tei:seg elements. > > Is this a bug or am I missing something? Any help appreciated, > > Christian > > > > > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > |