From: Ron V. d. B. <ron...@ka...> - 2011-08-09 14:30:24
|
Hi all, I am further exploring the technique employed by the KWIC module, to construct the context for displaying search hits from the <exist:match> elements highlighting the individual search results. All goes well for indididual search terms, but I figured out some interesting phenomena when phrases are passed to ft:query(). In one way, phrase searches are interesting because they produce a single <exist:match> wrapper around the entire phrase. However, when phrase searches are broadened to a proximity search, apparently NONE of the phrase terms is highlighted when they are non-adjacent. (For documentation, see <http://demo.exist-db.org/exist/lucene.xml#d39580e756>.) For example, consider following query (based on the Shakespeare sample files shipped with eXist, equal results in eXist-1.4.x and trunk): //TITLE[ft:query(., '"tomb belonging"')]/util:expand(.) or its XML syntax counterpart: //TITLE[ft:query(.,<query><phrase>tomb belonging</phrase></query>)]/util:expand(.) Both return: <TITLE>SCENE III. A churchyard; in it a<exist:match xmlns:exist="http://exist.sourceforge.net/NS/exist">tomb belonging</exist:match> to the Capulets.</TITLE> OTOH, when 2 non-adjacent terms are searched for in a proximity search, no <exist:match> wrappers are injected: //TITLE[ft:query(., '"churchyard belonging"~5')]/util:expand(.) or its XML syntax counterpart: //TITLE[ft:query(.,<query><phrase slop="5">churchyard belonging</phrase></query>)]/util:expand(.) Both return: <TITLE>SCENE III. A churchyard; in it a tomb belonging to the Capulets.</TITLE> The XML syntax provides a workaround, by isolating each search term in a <term> element: //TITLE[ft:query(.,<query><phrase slop="5"><term>churchyard</term> <term>belonging</term></phrase></query>)]/util:expand(.) which returns: <TITLE>SCENE III. A<exist:match xmlns:exist="http://exist.sourceforge.net/NS/exist">churchyard</exist:match>; in it a tomb<exist:match xmlns:exist="http://exist.sourceforge.net/NS/exist">belonging</exist:match> to the Capulets.</TITLE> But then, if this syntax is used for a real phrase search (with adjacent terms), this results in separate <exist:match> search highlights as well: //TITLE[ft:query(.,<query><phrase><term>tomb</term> <term>belonging</term></phrase></query>)]/util:expand(.) which returns: <TITLE>SCENE III. A churchyard; in it a<exist:match xmlns:exist="http://exist.sourceforge.net/NS/exist">tomb</exist:match> <exist:match xmlns:exist="http://exist.sourceforge.net/NS/exist">belonging</exist:match> to the Capulets.</TITLE> (BTW, by isolating search terms in <term> elements, <phrase> thus effectively behaves like <near>, with or without <term>: //TITLE[ft:query(.,<query><near>tomb belonging</near></query>)]/util:expand(.) returns: <TITLE>SCENE III. A churchyard; in it a<exist:match xmlns:exist="http://exist.sourceforge.net/NS/exist">tomb</exist:match> <exist:match xmlns:exist="http://exist.sourceforge.net/NS/exist">belonging</exist:match> to the Capulets.</TITLE> Of course, when constructing a query in XML syntax, one could differentiate between -real phrase searches (searching for adjacent terms): <phrase> without further <term> elements -proximity searches (searching for possibly non-adjacent terms): <phrase> (or <near>) with each term in a separate <term> element Still, I'm wondering if it wouldn't make sense to make the 'chaining' of adjacent search hits configurable, so that both <near> and <phrase> searches could return single <exist:match> elements for adjacent search terms, when told so? Alternatively, would it be possible to leave the current behaviour, but make <phrase> searches highlight non-adjacent terms too? Do others perhaps have useful experience with this? Kind regards, Ron -- Ron Van den Branden Wetenschappelijk attaché / Senior Researcher Reviews Editor LLC. The Journal of Digital Scholarship in the Humanities Centrum voor Teksteditie en Bronnenstudie - CTB (KANTL) Centre for Scholarly Editing and Document Studies Koninklijke Academie voor Nederlandse Taal- en Letterkunde Royal Academy of Dutch Language and Literature Koningstraat 18 / b-9000 Gent / Belgium tel: +32 9 265 93 51 / fax: +32 9 265 93 49 E-mail : ron...@ka... http://www.kantl.be/ctb |