RE: [htdig-dev] Bayesian Algorithm Part II

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

> -----Mensaje original-----
> De: htd...@li...
> [mailto:htd...@li...]En nombre de Tod Thomas
> Enviado el: miercoles, 17 de octubre de 2001 15:23
> Para: htd...@li...
> Asunto: [htdig-dev] Bayesian Algorithm Part II
[snip]
>
> A common marketing thread for a number of the vendors was the
> utilization of the 'bayesian algorithm'.

Some vendors, like Autonomy, do use this kind of techniques in their search
engines.
'Bayesian', or probabilistic, models are computed from indexed documents.
The longer the query, the better the results. As with vector-space
similarity measures, this is particularly useful when searching 'documents
similar to this' or when classifiyng documents. Results tend to be poorer
when using short queries.

> I have to admit that those products tended to outperform
> the others, particularly when it came to language specific
> searches.  It was almost as if the search could understand the concept
behind
> the request, not just the words.

In fact, what is evaluated is the 'blend' of the words, i.e. the more or
less roughly estimated probability of finding a given set of words in each
document.

>
> That's when I began to wonder if anybody in the htdig
> development community had looked into implementing 'bayesian'
> searching, or if htdig could do 'it', hence my vague post.
>

The htdig databases are postitively not prepared to deal with such
techniques, IMHO. They are not intended to. The power of htdig is based in
'classical' boolean queries.

> My theory was that most new market trends (worth paying
> attention to) are usually already, or quickly will be, reflected in the
open
> source development community.

Entering deep water -- opinion unreliable from this point :)
My perception is eventually the inverse. Open-source has been traditionally
being bound to research and innovation. It's now being used by companies as
an innovation channel, so that market trends emerge later from there...

> I suspected this might be the case with htdig
> too.  Given the individual response that I got there is at least some
> interest so by elaborating a little on my original post maybe I can learn
more.  Comments?
>

Nearly related to probabilistic stuff is Xapian, a.k.a. Omseek, a.k.a.
Omsee, a.k.a. Open Muscat, an open-source project intended as a
probabilistic search-engine framework. Initially financed by Brightstation,
was some time ago left to its own. Now lives in Sourceforge.

More in the research field, there's libbow/rainbow by Andrew McCallum et al.
from CMU, including bayesian classifiers, vector-space algorithms, and other
nice artifacts.

Here at gtd, we're experimenting internally with some new vector-space based
search and classification algorithms. What we have does look quite
promising, but AFAIK it's not to be open-sourced -- by now.

--
Quim