From: Wolfgang M. <wol...@ex...> - 2011-09-08 15:47:22
|
Hi, > With eXist's util:expand() function for 'materializing' search results > in their context nodes, it is possible to build nice KWIC displays (as > aptly illustrated in eXist's own KWIC display module). Recently [1] I > explored ways of linguistically exploiting such KWIC searches (mostly by > adding sorting possibilities for the left and right contexts). One step > further would be the construction of collocation tables for all words > occurring in in certain contexts of search words. Yet, I think I am > stumbling on eXist's current limitations here, since what would make > such data really useful, information on the (relative) occurrence of > words at certain context positions, is currently not available in eXist > (and would be prohibitively expensive to compute). After reading your post recently, I thought about how we could better support collocation tables and similar features. One possibility would be to provide additional information when expanding the search results: to find the full text match position, eXist always needs to tokenize the text again by passing it through Lucene's analyzer. Instead of just highlighting the match, we could tag all preceding and following tokens without much additional cost. This could probably also include the relative position of the token to the match, so you would end up with something like <context pos="-1">...</context> <match>...</match> <context pos="+1">...</context>. > Actually, this is > also the case for a simple 'frequency list': while the util:index-keys() > function does allow one to construct a list of all indexed terms, with > the number of their absolute occurrences, I think it is ratios linguists > are interested in most. That would require additional information on the > total number of words occurring in the collection being queried. I suppose Lucene does store the total number of words per indexed document somewhere (it should be relevant for computing weights), so we could add a function to retrieve it. Wolfgang P.S.: I plan to integrate your improved version of the kwic module. I just wanted to test it on some of my existing apps first to see if it breaks backwards compatibility or not. |