Re: [Exist-open] lucene queries in xml form

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Dear David,

Just to be sure: are "За", "морския" and "таралеж" in the same field?

Each word (according to the Lucene definition) (except stopwords) is inserted as a term in the index and a word (according to the Lucene definition) has no space anywhere. If you query for a phrase, using Lucene query syntax, you query for a sequence of terms, not for a string of space-separated words, so space-normalization do not enter into the equation.

Lucene searches by default are for a complete match, so you have to use special (wildcard) syntax to search for substrings, not to get exact matches. I do not see the immediate benefits of using XML query syntax for this.

Cheers,

Jens

On Aug 10, 2013, at 12:44 PM, "Birnbaum, David J" <dj...@pi...> wrote:

> Dear Dmitriy (cc eXist-open),
> 
> I am looking for an exact match, except that some long strings may have
> been wrapped during editing (pretty-printing in <oXygen/>), and therefore
> could have what for my purposes would be unwanted, extra white space
> between words. Since applying normalize-space() to all of the potential
> targets at query time would atomize, negating the benefit of using a range
> index, I asked on the list a few days ago whether there might be a way to
> build a range index that would use the results of applying
> normalize-space() when indexing, so that it wouldn't have to be done at
> query time (cf., in XSLT I can build a key using the results of applying
> normalize-space() to the elements to be returned). Dannes responded to
> that earlier post and suggested that within an eXist environment the
> Lucene index might be better for my purposes than a range index.
> 
> In production a user will input a search term to be matched not only
> exactly, but exhaustively, that is, not as a substring, so what I was
> working toward was using the <regex> capability of Lucene in eXist, so
> that I could wrap the user's input string in "^" and "$" delimiters. When
> my experimentation with <regex> failed to return results, I wasn't sure
> where the problem lay, so I tried to simplify the task, for
> trouble-shooting purposes, but using <phrase> instead of <regex>.
> According to http://exist-db.org/exist/apps/doc/lucene.xml, <phrase>
> "[s]earches for a group of terms occurring in the correct order. The
> element may either contain explicit <term> elements or text content. Text
> will be automatically tokenized into a sequence of terms." Changing the
> relevant part of the code to:
> 
> <query>
>        <phrase>
>            <term>За</term>
>            <term>морския</term>
>            <term>таралеж</term>
>        </phrase>
>    </query>
> 
> also produces no results. Changing it to:
> 
>    <query>
>        <phrase>морския</phrase>
>    </query>
> 
> 
> does produce results.
> 
> So:
> 
> 1. A one-word phrase seems to work. What have I misunderstood about how to
> get beyond one word, to find a longer phrase?
> 
> 2. Is this a sensible strategy in the first place? My goal, as I mention
> above, is to find an exact (except for white-space-normalization),
> complete (exhaustive) match in way that doesn't have to atomize everything
> and apply the normalize-space() function to each potential target
> individually at query time.
> 
> Thanks,
> 
> David
> __ 
> 
> From:  Dmitriy Shabanov <sha...@gm...>
> Date:  Saturday, August 10, 2013 12:14 PM
> To:  David Birnbaum <dj...@pi...>
> Cc:  "exi...@li..." <exi...@li...>
> Subject:  Re: [Exist-open] lucene queries in xml form
> 
> 
> If it was tokenized (it's by default) then each token become term. Are you
> looking for *exactly* match?
> 
> On Sat, Aug 10, 2013 at 2:03 PM, Birnbaum, David J <dj...@pi...>
> wrote:
> 
> Dear existentialists,
> 
> I'm puzzled by the results of the following query, which I'm running in
> eXide (Version: 2.1dev, SVN Revision: 18374, Build: 20130416). I've
> constructed Lucene (whitespace analyzer) and range indexes for the <bg>
> element.
> 
> let $testDoc := doc('/db/repertorium/aux/titles_cyrillic.xml')//title
> let $cyrillic := 'За морския таралеж'
> let $query :=
>    <query>
>        <phrase>За морския таралеж</phrase>
>    </query>
> return
>    <results>
>        <lucene_raw>{$testDoc[ft:query(bg, $cyrillic)]}</lucene_raw>
>        <lucene_query>{$testDoc[ft:query(bg, $query)]}</lucene_query>
>        <range>{$testDoc[contains(bg,$cyrillic)]}</range>
>    </results>
> 
> The results are:
> 
> <results>
> <lucene_raw>
> <title>
> <--  xinei  -->
> <bg>Физиолог. За морския таралеж</bg>
> <en>Physiologos. About the sea urchin</en>
> <ru>Физиолог. О морском еже</ru>
> </title>
> </lucene_raw>
> <lucene_query/>
> <range>
> <title>
> <--  xinei  -->
> <bg>Физиолог. За морския таралеж</bg>
> <en>Physiologos. About the sea urchin</en>
> <ru>Физиолог. О морском еже</ru>
> </title>
> </range>
> </results>
> 
> That <lucene_raw> comes back with results tells me that the Lucene index
> is accessible. That <range> comes back with results tells me that the
> exact string occurs in a <bg> element. In that case, shouldn't the Lucene
> index be able to find that exact string when it is expressed as a
> <phrase>? I modeled the structure of the formatted query on the "cauldron
> boil" example at http://exist-db.org/exist/apps/doc/lucene.xml.
> 
> 
> 
> 
> -- 
> Dmitriy Shabanov
> 
> 
> ------------------------------------------------------------------------------
> Get 100% visibility into Java/.NET code with AppDynamics Lite!
> It's a free troubleshooting tool designed for production.
> Get down to code-level detail for bottlenecks, with <2% overhead. 
> Download for free and get started troubleshooting in minutes. 
> http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
> _______________________________________________
> Exist-open mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-open

Re: [Exist-open] lucene queries in xml form

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] lucene queries in xml form