From: Jens Ø. P. <oe...@gm...> - 2013-08-10 14:00:38
|
Dear David, Just to be sure: are "За", "морския" and "таралеж" in the same field? Each word (according to the Lucene definition) (except stopwords) is inserted as a term in the index and a word (according to the Lucene definition) has no space anywhere. If you query for a phrase, using Lucene query syntax, you query for a sequence of terms, not for a string of space-separated words, so space-normalization do not enter into the equation. Lucene searches by default are for a complete match, so you have to use special (wildcard) syntax to search for substrings, not to get exact matches. I do not see the immediate benefits of using XML query syntax for this. Cheers, Jens On Aug 10, 2013, at 12:44 PM, "Birnbaum, David J" <dj...@pi...> wrote: > Dear Dmitriy (cc eXist-open), > > I am looking for an exact match, except that some long strings may have > been wrapped during editing (pretty-printing in <oXygen/>), and therefore > could have what for my purposes would be unwanted, extra white space > between words. Since applying normalize-space() to all of the potential > targets at query time would atomize, negating the benefit of using a range > index, I asked on the list a few days ago whether there might be a way to > build a range index that would use the results of applying > normalize-space() when indexing, so that it wouldn't have to be done at > query time (cf., in XSLT I can build a key using the results of applying > normalize-space() to the elements to be returned). Dannes responded to > that earlier post and suggested that within an eXist environment the > Lucene index might be better for my purposes than a range index. > > In production a user will input a search term to be matched not only > exactly, but exhaustively, that is, not as a substring, so what I was > working toward was using the <regex> capability of Lucene in eXist, so > that I could wrap the user's input string in "^" and "$" delimiters. When > my experimentation with <regex> failed to return results, I wasn't sure > where the problem lay, so I tried to simplify the task, for > trouble-shooting purposes, but using <phrase> instead of <regex>. > According to http://exist-db.org/exist/apps/doc/lucene.xml, <phrase> > "[s]earches for a group of terms occurring in the correct order. The > element may either contain explicit <term> elements or text content. Text > will be automatically tokenized into a sequence of terms." Changing the > relevant part of the code to: > > <query> > <phrase> > <term>За</term> > <term>морския</term> > <term>таралеж</term> > </phrase> > </query> > > also produces no results. Changing it to: > > <query> > <phrase>морския</phrase> > </query> > > > does produce results. > > So: > > 1. A one-word phrase seems to work. What have I misunderstood about how to > get beyond one word, to find a longer phrase? > > 2. Is this a sensible strategy in the first place? My goal, as I mention > above, is to find an exact (except for white-space-normalization), > complete (exhaustive) match in way that doesn't have to atomize everything > and apply the normalize-space() function to each potential target > individually at query time. > > Thanks, > > David > __ > > From: Dmitriy Shabanov <sha...@gm...> > Date: Saturday, August 10, 2013 12:14 PM > To: David Birnbaum <dj...@pi...> > Cc: "exi...@li..." <exi...@li...> > Subject: Re: [Exist-open] lucene queries in xml form > > > If it was tokenized (it's by default) then each token become term. Are you > looking for *exactly* match? > > On Sat, Aug 10, 2013 at 2:03 PM, Birnbaum, David J <dj...@pi...> > wrote: > > Dear existentialists, > > I'm puzzled by the results of the following query, which I'm running in > eXide (Version: 2.1dev, SVN Revision: 18374, Build: 20130416). I've > constructed Lucene (whitespace analyzer) and range indexes for the <bg> > element. > > let $testDoc := doc('/db/repertorium/aux/titles_cyrillic.xml')//title > let $cyrillic := 'За морския таралеж' > let $query := > <query> > <phrase>За морския таралеж</phrase> > </query> > return > <results> > <lucene_raw>{$testDoc[ft:query(bg, $cyrillic)]}</lucene_raw> > <lucene_query>{$testDoc[ft:query(bg, $query)]}</lucene_query> > <range>{$testDoc[contains(bg,$cyrillic)]}</range> > </results> > > The results are: > > <results> > <lucene_raw> > <title> > <-- xinei --> > <bg>Физиолог. За морския таралеж</bg> > <en>Physiologos. About the sea urchin</en> > <ru>Физиолог. О морском еже</ru> > </title> > </lucene_raw> > <lucene_query/> > <range> > <title> > <-- xinei --> > <bg>Физиолог. За морския таралеж</bg> > <en>Physiologos. About the sea urchin</en> > <ru>Физиолог. О морском еже</ru> > </title> > </range> > </results> > > That <lucene_raw> comes back with results tells me that the Lucene index > is accessible. That <range> comes back with results tells me that the > exact string occurs in a <bg> element. In that case, shouldn't the Lucene > index be able to find that exact string when it is expressed as a > <phrase>? I modeled the structure of the formatted query on the "cauldron > boil" example at http://exist-db.org/exist/apps/doc/lucene.xml. > > > > > -- > Dmitriy Shabanov > > > ------------------------------------------------------------------------------ > Get 100% visibility into Java/.NET code with AppDynamics Lite! > It's a free troubleshooting tool designed for production. > Get down to code-level detail for bottlenecks, with <2% overhead. > Download for free and get started troubleshooting in minutes. > http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open |