Re: [Exist-open] ngram index with mixed content

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Christian,

Could you please try:

  //tei:p[ngram:contains(., "CD.EF")]

Instead of:

  //ngram:contains(., "CD.EF")

Joe

On Wed, Dec 2, 2020 at 4:41 AM Christian Wittern <cwi...@gm...> wrote:

> Dear eXist users,
>
> I am trying to improve search recall for my application, which contains
> premodern Chinese text.  My documents are set up to have <tei:seg>
> elements for every phrase, which are in turn contained by tei:p
> elements. At the moment, I am simple defining a ngram index on tei:seg,
> but that of course limits the matches to the contents of one tei:seg
> element.  To overcome this limitation, I am defining a ngram index on
> tei:p as well, in the expectation that the ngrams will be constructed by
> concating the tei:seg elements that make up a paragraph.  So for example:
>
> <tei:p><tei:seg>ABCD.</tei:seg><tei:seg>EFGH</tei:seg></tei:p>
>
> With such a text, I would expect to be able to search for "CD.EF" and
> find one match. However there is no match for
>
> //ngram:contains(., "CD.EF"), also not with //ngram:wildcard-contains(.,
> "CD.EF")
>
> The reason for this assumption is that the documentation for the ngram
> module says:
>
> "Note: a ngram match on mixed content may span multiple nodes. " (this
> is in the documentation for the ngram:filter-matches function).
>
> Since there are no parameters when setting up an ngram index, I would
> expect that elements with mixed content like the tei:p element would be
> able to find a term across tei:seg elements.
>
> Is this a bug or am I missing something?  Any help appreciated,
>
> Christian
>
>
>
>
> _______________________________________________
> Exist-open mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-open
>

Re: [Exist-open] ngram index with mixed content

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] ngram index with mixed content