|
From: Wagner T. <wa...@wt...> - 2000-09-06 23:49:05
|
Igor,
That's the way I was thinking. I'll think more deeply on this and come back
in the next days (tomorrow is holliday in Brazil). I'll try the most open
and flexible way I can.
Cheers,
Wagner.
> -----Original Message-----
> From: gru...@li...
> [mailto:gru...@li...]On Behalf Of Igor
> Stojanovski
> Sent: Quarta-feira, 6 de Setembro de 2000 20:05
> To: gru...@li...
> Subject: [Grub-develop] Wagner: Ranking mechanism for the words
>
>
> To Wagner:
>
> -----Original Message-----
> From: Wagner Teixeira [mailto:wa...@wt...]
> Sent: Wednesday, September 06, 2000 1:58 PM
> To: Igor Stojanovski
> Subject: RE: Ranking system
>
>
> > We are getting close to the point when the server will actually have the
> > capability to index. In order to do this, we need a module
> that will take
> > the contents of a page (as a stream of bytes), and do a ranking based on
> > preset parameters. It will have to be able to distract the
> > significant text
> > from the HTML tags to base the ranking upon.
>
> Ok, I understand the point. I'll try to make a generic class that
> allows in
> the future ranking of other kind of documents, like MSWord, PDF,
> PostScript,
> etc.
> [ozra] That's a good idea. But focus on HTML for now.
>
>
> What I understood is: I'll make a parser that will return all
> relevant words
> with its ranking position (0 to 10, for instance), so the caller can index
> the word with its ranking position. The user will get this
> position when he
> searches and chooses the best documents based on engine's ranking, right?
> [ozra] Yes. Say, we will not process beyond the first 1000 words of any
> particular document.
>
> This may be little hard to swallow, but it's my suggestion on how
> the Ranker
> may be implemented. This is not set in stone, and you may want
> to implement
> it in a different way. I am sure you will get at least a few tips from my
> suggestion.
>
> Sample use:
>
> {
> // this is a class that implements the module;
> // process_word, cumpute_weight, sum_weights are
> // function pointers (see below)
> // second argument could be CUMULATIVE and NONCUMULATIVE (see below)
> Ranker rank( CUMULATIVE, process_word, cumpute_weight, sum_weights );
>
> rank.begin_doc( "http://www.yahoo.com", 203314002 ); // URL ID
>
> while ( more input from a client ) {
>
> rank.process( buffer from the input stream );
> }
>
> rank.end_doc();
> ...
> }
>
>
> ==============
> About process_word()
>
> // this struct is used as a parameter in process_word (see below)
> struct word_info {
>
> char *the_word; // probably word ID would be better
> char *from_url; // probably URL ID would be better
> enum word_type type; // ex: REGULAR, TITLE, ANCHOR, etc.
> // tells where the current word is positioned
> // if CUMULATIVE, then valid values are only
> // REGULAR and TITLE, which means that at least
> // one occurance of the word is contained
> // in the title (to enable searching
> by for URLs
> // by title).
> unsigned short position;
> unsigned short weight;
> unsigned short occurance;
> };
>
>
> // this function is called for every single word that is encountered in a
> doc
> // info is the parameter which gives all needed info about the word in
> process
> int process_word( struct word_info *info ) {
>
> // do an insert to the database about the word,
> // dont' worry about that
>
> // if return == -1, then fatal error occured with the DB
> }
>
>
> ==============
> About cumpute_weight()
>
> This is user defined function which computes weight for a single word. It
> may use the word_info struct, where the type and position is
> given, and the
> weigth is then computed.
>
> For example:
>
> void compute_weigth( struct word_info *info ) {
>
> switch ( info->type ) {
> case REGULAR:
> info->weight = 1;
> case TITLE:
> info->weight = 5;
> }
>
> info->weight += ( info->postition > 100 ) ? 0 :
> ( 100 - info->position ) / 10;
> }
>
>
> ==============
> About (NON)CUMULATIVE
>
> If CUMULATIVE is chosen, then Ranker will not repeat same words,
> but rather
> build a summary for each unique one, and then call process_word(). Weight
> will be (then) the sum of all weights. NONCUMULATIVE will call
> process_word() for each single occurrence, and sum_weights() will not be
> used.
>
> ==============
> About sum_weights()
>
> This function will be called only if CUMULATIVE is used, in order
> to compute
> relevance for multiple instances of the same word. For example, if a word
> is contained thousand times and we simply sum up the weights of the words,
> then we would get really big weight for that particular word in the
> document. Instead we may want to implement a function that will add less
> and less weight as the occurrences of a word grow. The example
> below simply
> doesn't add any more relevance to a word after the tenth occurrence. Note
> that this is not an attempt to rid of spamming.
>
>
> // current_total_weight is total weigth for a single word
> void sum_weights( int *current_total_weigth,
> struct word_info *next_word_info )
> {
> // example code:
> // don't add any cumulative weight if word appeared
> // more than 10 times
> if ( next_word_info->occurance > 10 )
> return;
>
> *current_total_weight += next_word_info->weight;
> }
>
> This functions will be called pretty often. For the sake of
> efficiency, we
> may consider making them macros, which would not break
> modularity, and make
> them compile inline.
>
> ==============
> Here is a sample, using the functions above:
>
> <html>
> <head>
> <title>The Title</title>
> </head>
> <body>
> <p>The United <b>States</b></p>
> </body>
> </html>
>
> NONCUMULATIVE: process_word() will be called with following info:
> (occurance is probably not important in this case)
>
> The -- position = 0, type = TITLE, weight = 5 + 10 = 15
> Title -- position = 1, type = TITLE, weight = 5 + 9 = 14
> The -- position = 2, type = REGULAR, weight = 1 + 9 = 10
> United -- position = 3, type = REGULAR, weight = 1 + 9 = 10
> States -- position = 4, type = REGULAR, weight = 1 + 9 = 10
>
> CUMULATIVE: process_word() will be called with following info:
> (watch, postition has two values; this will be needed for search
> on phrases)
>
> The -- occurances = 2, position = (0, 2), type = TITLE, weight = 15+10=
> 25
> Title -- occurances = 1, position = 1, type = TITLE, weight = 14
> United -- occurances = 1, position = 3, type = REGULAR, weight = 10
> States -- occurances = 1, position = 4, type = REGULAR, weight = 10
>
> The words to be ranked and indexed are located either between the TITLE,
> HEAD, or possible the META tag.
>
> That's it now. The way the database is set right now, I think that the
> first way (NONCUMULATIVE) would work better for it, and it is easier to
> implement. Give it a thought and let me know what you think.
>
> Cheers,
>
> ozra.
>
> --------------------------------------------------------------
> Igor Stojanovski Grub.Org Inc.
> Chief Technical Officer 5100 N. Brookline #830
> Oklahoma City, OK 73112
> oz...@gr... Voice: (405) 917-9894
> http://www.grub.org Fax: (405) 848-5477
> --------------------------------------------------------------
>
> _______________________________________________
> Grub-develop mailing list
> Gru...@li...
> http://lists.sourceforge.net/mailman/listinfo/grub-develop
|