[Grub-develop] Wagner: Ranking mechanism for the words

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

To Wagner:

-----Original Message-----
From: Wagner Teixeira [mailto:wa...@wt...]
Sent: Wednesday, September 06, 2000 1:58 PM
To: Igor Stojanovski
Subject: RE: Ranking system

> We are getting close to the point when the server will actually have the
> capability to index.  In order to do this, we need a module that will take
> the contents of a page (as a stream of bytes), and do a ranking based on
> preset parameters.  It will have to be able to distract the
> significant text
> from the HTML tags to base the ranking upon.

Ok, I understand the point. I'll try to make a generic class that allows in
the future ranking of other kind of documents, like MSWord, PDF, PostScript,
etc.
[ozra] That's a good idea.  But focus on HTML for now.

What I understood is: I'll make a parser that will return all relevant words
with its ranking position (0 to 10, for instance), so the caller can index
the word with its ranking position. The user will get this position when he
searches and chooses the best documents based on engine's ranking, right?
[ozra] Yes.  Say, we will not process beyond the first 1000 words of any
particular document.

This may be little hard to swallow, but it's my suggestion on how the Ranker
may be implemented.  This is not set in stone, and you may want to implement
it in a different way. I am sure you will get at least a few tips from my
suggestion.

Sample use:

{
  // this is a class that implements the module;
  // process_word, cumpute_weight, sum_weights are
  //    function pointers (see below)
  // second argument could be CUMULATIVE and NONCUMULATIVE (see below)
  Ranker rank( CUMULATIVE, process_word, cumpute_weight, sum_weights );

  rank.begin_doc( "http://www.yahoo.com", 203314002 ); // URL ID

  while ( more input from a client ) {

    rank.process( buffer from the input stream );
  }

  rank.end_doc();
  ...
}

==============
About process_word()

// this struct is used as a parameter in process_word (see below)
struct word_info {

	char *the_word;  // probably word ID would be better
	char *from_url;  // probably URL ID would be better
	enum word_type type; // ex: REGULAR, TITLE, ANCHOR, etc.
	                     // tells where the current word is positioned
	                     // if CUMULATIVE, then valid values are only
	                     // REGULAR and TITLE, which means that at least
	                     // one occurance of the word is contained
	                     // in the title (to enable searching by for URLs
	                     // by title).
	unsigned short position;
	unsigned short weight;
	unsigned short occurance;
};

// this function is called for every single word that is encountered in a
doc
// info is the parameter which gives all needed info about the word in
process
int process_word( struct word_info *info ) {

	// do an insert to the database about the word,
	// dont' worry about that

	// if return == -1, then fatal error occured with the DB
}

==============
About cumpute_weight()

This is user defined function which computes weight for a single word.  It
may use the word_info struct, where the type and position is given, and the
weigth is then computed.

For example:

void compute_weigth( struct word_info *info ) {

	switch ( info->type ) {
	case REGULAR:
	  info->weight = 1;
	case TITLE:
	  info->weight = 5;
	}

	info->weight += ( info->postition > 100 ) ? 0 :
	                ( 100 - info->position ) / 10;
}

==============
About (NON)CUMULATIVE

If CUMULATIVE is chosen, then Ranker will not repeat same words, but rather
build a summary for each unique one, and then call process_word().  Weight
will be (then) the sum of all weights.  NONCUMULATIVE will call
process_word() for each single occurrence, and sum_weights() will not be
used.

==============
About sum_weights()

This function will be called only if CUMULATIVE is used, in order to compute
relevance for multiple instances of the same word.  For example, if a word
is contained thousand times and we simply sum up the weights of the words,
then we would get really big weight for that particular word in the
document.  Instead we may want to implement a function that will add less
and less weight as the occurrences of a word grow.  The example below simply
doesn't add any more relevance to a word after the tenth occurrence.  Note
that this is not an attempt to rid of spamming.

// current_total_weight is total weigth for a single word
void sum_weights( int *current_total_weigth,
	  struct word_info *next_word_info )
{
	// example code:
	// don't add any cumulative weight if word appeared
	// more than 10 times
	if ( next_word_info->occurance > 10 )
	  return;

	*current_total_weight += next_word_info->weight;
}

This functions will be called pretty often.  For the sake of efficiency, we
may consider making them macros, which would not break modularity, and make
them compile inline.

==============
Here is a sample, using the functions above:

<html>
  <head>
    <title>The Title</title>
  </head>
  <body>
    <p>The United <b>States</b></p>
  </body>
</html>

NONCUMULATIVE: process_word() will be called with following info:
(occurance is probably not important in this case)

The    -- position = 0, type = TITLE,   weight = 5 + 10 = 15
Title  -- position = 1, type = TITLE,   weight = 5 + 9  = 14
The    -- position = 2, type = REGULAR, weight = 1 + 9  = 10
United -- position = 3, type = REGULAR, weight = 1 + 9  = 10
States -- position = 4, type = REGULAR, weight = 1 + 9  = 10

CUMULATIVE: process_word() will be called with following info:
(watch, postition has two values; this will be needed for search on phrases)

The    -- occurances = 2, position = (0, 2), type = TITLE, weight = 15+10=
25
Title  -- occurances = 1, position = 1,      type = TITLE, weight = 14
United -- occurances = 1, position = 3,      type = REGULAR, weight = 10
States -- occurances = 1, position = 4,      type = REGULAR, weight = 10

The words to be ranked and indexed are located either between the TITLE,
HEAD, or possible the META tag.

That's it now.  The way the database is set right now, I think that the
first way (NONCUMULATIVE) would work better for it, and it is easier to
implement.  Give it a thought and let me know what you think.

Cheers,

ozra.

--------------------------------------------------------------
Igor Stojanovski                                 Grub.Org Inc.
Chief Technical Officer                 5100 N. Brookline #830
                                       Oklahoma City, OK 73112
oz...@gr...                            Voice: (405) 917-9894
http://www.grub.org                        Fax: (405) 848-5477
--------------------------------------------------------------