From: Wagner T. <wa...@wt...> - 2000-09-06 23:49:05
|
Igor, That's the way I was thinking. I'll think more deeply on this and come back in the next days (tomorrow is holliday in Brazil). I'll try the most open and flexible way I can. Cheers, Wagner. > -----Original Message----- > From: gru...@li... > [mailto:gru...@li...]On Behalf Of Igor > Stojanovski > Sent: Quarta-feira, 6 de Setembro de 2000 20:05 > To: gru...@li... > Subject: [Grub-develop] Wagner: Ranking mechanism for the words > > > To Wagner: > > -----Original Message----- > From: Wagner Teixeira [mailto:wa...@wt...] > Sent: Wednesday, September 06, 2000 1:58 PM > To: Igor Stojanovski > Subject: RE: Ranking system > > > > We are getting close to the point when the server will actually have the > > capability to index. In order to do this, we need a module > that will take > > the contents of a page (as a stream of bytes), and do a ranking based on > > preset parameters. It will have to be able to distract the > > significant text > > from the HTML tags to base the ranking upon. > > Ok, I understand the point. I'll try to make a generic class that > allows in > the future ranking of other kind of documents, like MSWord, PDF, > PostScript, > etc. > [ozra] That's a good idea. But focus on HTML for now. > > > What I understood is: I'll make a parser that will return all > relevant words > with its ranking position (0 to 10, for instance), so the caller can index > the word with its ranking position. The user will get this > position when he > searches and chooses the best documents based on engine's ranking, right? > [ozra] Yes. Say, we will not process beyond the first 1000 words of any > particular document. > > This may be little hard to swallow, but it's my suggestion on how > the Ranker > may be implemented. This is not set in stone, and you may want > to implement > it in a different way. I am sure you will get at least a few tips from my > suggestion. > > Sample use: > > { > // this is a class that implements the module; > // process_word, cumpute_weight, sum_weights are > // function pointers (see below) > // second argument could be CUMULATIVE and NONCUMULATIVE (see below) > Ranker rank( CUMULATIVE, process_word, cumpute_weight, sum_weights ); > > rank.begin_doc( "http://www.yahoo.com", 203314002 ); // URL ID > > while ( more input from a client ) { > > rank.process( buffer from the input stream ); > } > > rank.end_doc(); > ... > } > > > ============== > About process_word() > > // this struct is used as a parameter in process_word (see below) > struct word_info { > > char *the_word; // probably word ID would be better > char *from_url; // probably URL ID would be better > enum word_type type; // ex: REGULAR, TITLE, ANCHOR, etc. > // tells where the current word is positioned > // if CUMULATIVE, then valid values are only > // REGULAR and TITLE, which means that at least > // one occurance of the word is contained > // in the title (to enable searching > by for URLs > // by title). > unsigned short position; > unsigned short weight; > unsigned short occurance; > }; > > > // this function is called for every single word that is encountered in a > doc > // info is the parameter which gives all needed info about the word in > process > int process_word( struct word_info *info ) { > > // do an insert to the database about the word, > // dont' worry about that > > // if return == -1, then fatal error occured with the DB > } > > > ============== > About cumpute_weight() > > This is user defined function which computes weight for a single word. It > may use the word_info struct, where the type and position is > given, and the > weigth is then computed. > > For example: > > void compute_weigth( struct word_info *info ) { > > switch ( info->type ) { > case REGULAR: > info->weight = 1; > case TITLE: > info->weight = 5; > } > > info->weight += ( info->postition > 100 ) ? 0 : > ( 100 - info->position ) / 10; > } > > > ============== > About (NON)CUMULATIVE > > If CUMULATIVE is chosen, then Ranker will not repeat same words, > but rather > build a summary for each unique one, and then call process_word(). Weight > will be (then) the sum of all weights. NONCUMULATIVE will call > process_word() for each single occurrence, and sum_weights() will not be > used. > > ============== > About sum_weights() > > This function will be called only if CUMULATIVE is used, in order > to compute > relevance for multiple instances of the same word. For example, if a word > is contained thousand times and we simply sum up the weights of the words, > then we would get really big weight for that particular word in the > document. Instead we may want to implement a function that will add less > and less weight as the occurrences of a word grow. The example > below simply > doesn't add any more relevance to a word after the tenth occurrence. Note > that this is not an attempt to rid of spamming. > > > // current_total_weight is total weigth for a single word > void sum_weights( int *current_total_weigth, > struct word_info *next_word_info ) > { > // example code: > // don't add any cumulative weight if word appeared > // more than 10 times > if ( next_word_info->occurance > 10 ) > return; > > *current_total_weight += next_word_info->weight; > } > > This functions will be called pretty often. For the sake of > efficiency, we > may consider making them macros, which would not break > modularity, and make > them compile inline. > > ============== > Here is a sample, using the functions above: > > <html> > <head> > <title>The Title</title> > </head> > <body> > <p>The United <b>States</b></p> > </body> > </html> > > NONCUMULATIVE: process_word() will be called with following info: > (occurance is probably not important in this case) > > The -- position = 0, type = TITLE, weight = 5 + 10 = 15 > Title -- position = 1, type = TITLE, weight = 5 + 9 = 14 > The -- position = 2, type = REGULAR, weight = 1 + 9 = 10 > United -- position = 3, type = REGULAR, weight = 1 + 9 = 10 > States -- position = 4, type = REGULAR, weight = 1 + 9 = 10 > > CUMULATIVE: process_word() will be called with following info: > (watch, postition has two values; this will be needed for search > on phrases) > > The -- occurances = 2, position = (0, 2), type = TITLE, weight = 15+10= > 25 > Title -- occurances = 1, position = 1, type = TITLE, weight = 14 > United -- occurances = 1, position = 3, type = REGULAR, weight = 10 > States -- occurances = 1, position = 4, type = REGULAR, weight = 10 > > The words to be ranked and indexed are located either between the TITLE, > HEAD, or possible the META tag. > > That's it now. The way the database is set right now, I think that the > first way (NONCUMULATIVE) would work better for it, and it is easier to > implement. Give it a thought and let me know what you think. > > Cheers, > > ozra. > > -------------------------------------------------------------- > Igor Stojanovski Grub.Org Inc. > Chief Technical Officer 5100 N. Brookline #830 > Oklahoma City, OK 73112 > oz...@gr... Voice: (405) 917-9894 > http://www.grub.org Fax: (405) 848-5477 > -------------------------------------------------------------- > > _______________________________________________ > Grub-develop mailing list > Gru...@li... > http://lists.sourceforge.net/mailman/listinfo/grub-develop |