RE: [Grub-develop] Wagner: Ranking mechanism for the words

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Igor,

That's the way I was thinking. I'll think more deeply on this and come back
in the next days (tomorrow is holliday in Brazil). I'll try the most open
and flexible way I can.

Cheers,

Wagner.

> -----Original Message-----
> From: gru...@li...
> [mailto:gru...@li...]On Behalf Of Igor
> Stojanovski
> Sent: Quarta-feira, 6 de Setembro de 2000 20:05
> To: gru...@li...
> Subject: [Grub-develop] Wagner: Ranking mechanism for the words
>
>
> To Wagner:
>
> -----Original Message-----
> From: Wagner Teixeira [mailto:wa...@wt...]
> Sent: Wednesday, September 06, 2000 1:58 PM
> To: Igor Stojanovski
> Subject: RE: Ranking system
>
>
> > We are getting close to the point when the server will actually have the
> > capability to index.  In order to do this, we need a module
> that will take
> > the contents of a page (as a stream of bytes), and do a ranking based on
> > preset parameters.  It will have to be able to distract the
> > significant text
> > from the HTML tags to base the ranking upon.
>
> Ok, I understand the point. I'll try to make a generic class that
> allows in
> the future ranking of other kind of documents, like MSWord, PDF,
> PostScript,
> etc.
> [ozra] That's a good idea.  But focus on HTML for now.
>
>
> What I understood is: I'll make a parser that will return all
> relevant words
> with its ranking position (0 to 10, for instance), so the caller can index
> the word with its ranking position. The user will get this
> position when he
> searches and chooses the best documents based on engine's ranking, right?
> [ozra] Yes.  Say, we will not process beyond the first 1000 words of any
> particular document.
>
> This may be little hard to swallow, but it's my suggestion on how
> the Ranker
> may be implemented.  This is not set in stone, and you may want
> to implement
> it in a different way. I am sure you will get at least a few tips from my
> suggestion.
>
> Sample use:
>
> {
>   // this is a class that implements the module;
>   // process_word, cumpute_weight, sum_weights are
>   //    function pointers (see below)
>   // second argument could be CUMULATIVE and NONCUMULATIVE (see below)
>   Ranker rank( CUMULATIVE, process_word, cumpute_weight, sum_weights );
>
>   rank.begin_doc( "http://www.yahoo.com", 203314002 ); // URL ID
>
>   while ( more input from a client ) {
>
>     rank.process( buffer from the input stream );
>   }
>
>   rank.end_doc();
>   ...
> }
>
>
> ==============
> About process_word()
>
> // this struct is used as a parameter in process_word (see below)
> struct word_info {
>
> 	char *the_word;  // probably word ID would be better
> 	char *from_url;  // probably URL ID would be better
> 	enum word_type type; // ex: REGULAR, TITLE, ANCHOR, etc.
> 	                     // tells where the current word is positioned
> 	                     // if CUMULATIVE, then valid values are only
> 	                     // REGULAR and TITLE, which means that at least
> 	                     // one occurance of the word is contained
> 	                     // in the title (to enable searching
> by for URLs
> 	                     // by title).
> 	unsigned short position;
> 	unsigned short weight;
> 	unsigned short occurance;
> };
>
>
> // this function is called for every single word that is encountered in a
> doc
> // info is the parameter which gives all needed info about the word in
> process
> int process_word( struct word_info *info ) {
>
> 	// do an insert to the database about the word,
> 	// dont' worry about that
>
> 	// if return == -1, then fatal error occured with the DB
> }
>
>
> ==============
> About cumpute_weight()
>
> This is user defined function which computes weight for a single word.  It
> may use the word_info struct, where the type and position is
> given, and the
> weigth is then computed.
>
> For example:
>
> void compute_weigth( struct word_info *info ) {
>
> 	switch ( info->type ) {
> 	case REGULAR:
> 	  info->weight = 1;
> 	case TITLE:
> 	  info->weight = 5;
> 	}
>
> 	info->weight += ( info->postition > 100 ) ? 0 :
> 	                ( 100 - info->position ) / 10;
> }
>
>
> ==============
> About (NON)CUMULATIVE
>
> If CUMULATIVE is chosen, then Ranker will not repeat same words,
> but rather
> build a summary for each unique one, and then call process_word().  Weight
> will be (then) the sum of all weights.  NONCUMULATIVE will call
> process_word() for each single occurrence, and sum_weights() will not be
> used.
>
> ==============
> About sum_weights()
>
> This function will be called only if CUMULATIVE is used, in order
> to compute
> relevance for multiple instances of the same word.  For example, if a word
> is contained thousand times and we simply sum up the weights of the words,
> then we would get really big weight for that particular word in the
> document.  Instead we may want to implement a function that will add less
> and less weight as the occurrences of a word grow.  The example
> below simply
> doesn't add any more relevance to a word after the tenth occurrence.  Note
> that this is not an attempt to rid of spamming.
>
>
> // current_total_weight is total weigth for a single word
> void sum_weights( int *current_total_weigth,
> 	  struct word_info *next_word_info )
> {
> 	// example code:
> 	// don't add any cumulative weight if word appeared
> 	// more than 10 times
> 	if ( next_word_info->occurance > 10 )
> 	  return;
>
> 	*current_total_weight += next_word_info->weight;
> }
>
> This functions will be called pretty often.  For the sake of
> efficiency, we
> may consider making them macros, which would not break
> modularity, and make
> them compile inline.
>
> ==============
> Here is a sample, using the functions above:
>
> <html>
>   <head>
>     <title>The Title</title>
>   </head>
>   <body>
>     <p>The United <b>States</b></p>
>   </body>
> </html>
>
> NONCUMULATIVE: process_word() will be called with following info:
> (occurance is probably not important in this case)
>
> The    -- position = 0, type = TITLE,   weight = 5 + 10 = 15
> Title  -- position = 1, type = TITLE,   weight = 5 + 9  = 14
> The    -- position = 2, type = REGULAR, weight = 1 + 9  = 10
> United -- position = 3, type = REGULAR, weight = 1 + 9  = 10
> States -- position = 4, type = REGULAR, weight = 1 + 9  = 10
>
> CUMULATIVE: process_word() will be called with following info:
> (watch, postition has two values; this will be needed for search
> on phrases)
>
> The    -- occurances = 2, position = (0, 2), type = TITLE, weight = 15+10=
> 25
> Title  -- occurances = 1, position = 1,      type = TITLE, weight = 14
> United -- occurances = 1, position = 3,      type = REGULAR, weight = 10
> States -- occurances = 1, position = 4,      type = REGULAR, weight = 10
>
> The words to be ranked and indexed are located either between the TITLE,
> HEAD, or possible the META tag.
>
> That's it now.  The way the database is set right now, I think that the
> first way (NONCUMULATIVE) would work better for it, and it is easier to
> implement.  Give it a thought and let me know what you think.
>
> Cheers,
>
> ozra.
>
> --------------------------------------------------------------
> Igor Stojanovski                                 Grub.Org Inc.
> Chief Technical Officer                 5100 N. Brookline #830
>                                        Oklahoma City, OK 73112
> oz...@gr...                            Voice: (405) 917-9894
> http://www.grub.org                        Fax: (405) 848-5477
> --------------------------------------------------------------
>
> _______________________________________________
> Grub-develop mailing list
> Gru...@li...
> http://lists.sourceforge.net/mailman/listinfo/grub-develop