From: Quim S. <qs...@gt...> - 2001-10-17 14:29:42
|
> -----Mensaje original----- > De: htd...@li... > [mailto:htd...@li...]En nombre de Tod Thomas > Enviado el: miercoles, 17 de octubre de 2001 15:23 > Para: htd...@li... > Asunto: [htdig-dev] Bayesian Algorithm Part II [snip] > > A common marketing thread for a number of the vendors was the > utilization of the 'bayesian algorithm'. Some vendors, like Autonomy, do use this kind of techniques in their search engines. 'Bayesian', or probabilistic, models are computed from indexed documents. The longer the query, the better the results. As with vector-space similarity measures, this is particularly useful when searching 'documents similar to this' or when classifiyng documents. Results tend to be poorer when using short queries. > I have to admit that those products tended to outperform > the others, particularly when it came to language specific > searches. It was almost as if the search could understand the concept behind > the request, not just the words. In fact, what is evaluated is the 'blend' of the words, i.e. the more or less roughly estimated probability of finding a given set of words in each document. > > That's when I began to wonder if anybody in the htdig > development community had looked into implementing 'bayesian' > searching, or if htdig could do 'it', hence my vague post. > The htdig databases are postitively not prepared to deal with such techniques, IMHO. They are not intended to. The power of htdig is based in 'classical' boolean queries. > My theory was that most new market trends (worth paying > attention to) are usually already, or quickly will be, reflected in the open > source development community. Entering deep water -- opinion unreliable from this point :) My perception is eventually the inverse. Open-source has been traditionally being bound to research and innovation. It's now being used by companies as an innovation channel, so that market trends emerge later from there... > I suspected this might be the case with htdig > too. Given the individual response that I got there is at least some > interest so by elaborating a little on my original post maybe I can learn more. Comments? > Nearly related to probabilistic stuff is Xapian, a.k.a. Omseek, a.k.a. Omsee, a.k.a. Open Muscat, an open-source project intended as a probabilistic search-engine framework. Initially financed by Brightstation, was some time ago left to its own. Now lives in Sourceforge. More in the research field, there's libbow/rainbow by Andrew McCallum et al. from CMU, including bayesian classifiers, vector-space algorithms, and other nice artifacts. Here at gtd, we're experimenting internally with some new vector-space based search and classification algorithms. What we have does look quite promising, but AFAIK it's not to be open-sourced -- by now. -- Quim |