|
From: Ron V. d. B. <ron...@ka...> - 2011-09-09 09:00:36
|
Hi Wolfgang, On donderdag 8 september 2011 17:47:06, Wolfgang Meier wrote: > > Instead of just highlighting the match, we could tag all preceding and > following tokens without much additional cost. This could probably > also include the relative position of the token to the match, so you > would end up with something like<context pos="-1">...</context> > <match>...</match> <context pos="+1">...</context>. > This sounds great! Such a 'native' segmentation would undoubtedly perform much faster. Additionally, I guess it would facilitate further interaction with such collocation data as well. For example, if a collocation table shows that "great" occurs at position 3 after the search term "eXist", I can imagine that users would want a link from there to "exact proximity searches", where "eXist" occurs exactly 3 words before "great". That's something the Lucene search syntax doesn't support, does it? > > I suppose Lucene does store the total number of words per indexed > document somewhere (it should be relevant for computing weights), so > we could add a function to retrieve it. > Ditto: would be very useful! > P.S.: I plan to integrate your improved version of the kwic module. I > just wanted to test it on some of my existing apps first to see if it > breaks backwards compatibility or not. That's nice to hear. Please make sure to test the version at <http://www.kantl.be/ctb/download/kwic.xql>, which has some improvements and fixes some dumb errors, compared to the one I posted on eXist-open. Kind regards, Ron |