From: Igor S. <oz...@gr...> - 2000-09-06 23:01:17
|
To Wagner: -----Original Message----- From: Wagner Teixeira [mailto:wa...@wt...] Sent: Wednesday, September 06, 2000 1:58 PM To: Igor Stojanovski Subject: RE: Ranking system > We are getting close to the point when the server will actually have the > capability to index. In order to do this, we need a module that will take > the contents of a page (as a stream of bytes), and do a ranking based on > preset parameters. It will have to be able to distract the > significant text > from the HTML tags to base the ranking upon. Ok, I understand the point. I'll try to make a generic class that allows in the future ranking of other kind of documents, like MSWord, PDF, PostScript, etc. [ozra] That's a good idea. But focus on HTML for now. What I understood is: I'll make a parser that will return all relevant words with its ranking position (0 to 10, for instance), so the caller can index the word with its ranking position. The user will get this position when he searches and chooses the best documents based on engine's ranking, right? [ozra] Yes. Say, we will not process beyond the first 1000 words of any particular document. This may be little hard to swallow, but it's my suggestion on how the Ranker may be implemented. This is not set in stone, and you may want to implement it in a different way. I am sure you will get at least a few tips from my suggestion. Sample use: { // this is a class that implements the module; // process_word, cumpute_weight, sum_weights are // function pointers (see below) // second argument could be CUMULATIVE and NONCUMULATIVE (see below) Ranker rank( CUMULATIVE, process_word, cumpute_weight, sum_weights ); rank.begin_doc( "http://www.yahoo.com", 203314002 ); // URL ID while ( more input from a client ) { rank.process( buffer from the input stream ); } rank.end_doc(); ... } ============== About process_word() // this struct is used as a parameter in process_word (see below) struct word_info { char *the_word; // probably word ID would be better char *from_url; // probably URL ID would be better enum word_type type; // ex: REGULAR, TITLE, ANCHOR, etc. // tells where the current word is positioned // if CUMULATIVE, then valid values are only // REGULAR and TITLE, which means that at least // one occurance of the word is contained // in the title (to enable searching by for URLs // by title). unsigned short position; unsigned short weight; unsigned short occurance; }; // this function is called for every single word that is encountered in a doc // info is the parameter which gives all needed info about the word in process int process_word( struct word_info *info ) { // do an insert to the database about the word, // dont' worry about that // if return == -1, then fatal error occured with the DB } ============== About cumpute_weight() This is user defined function which computes weight for a single word. It may use the word_info struct, where the type and position is given, and the weigth is then computed. For example: void compute_weigth( struct word_info *info ) { switch ( info->type ) { case REGULAR: info->weight = 1; case TITLE: info->weight = 5; } info->weight += ( info->postition > 100 ) ? 0 : ( 100 - info->position ) / 10; } ============== About (NON)CUMULATIVE If CUMULATIVE is chosen, then Ranker will not repeat same words, but rather build a summary for each unique one, and then call process_word(). Weight will be (then) the sum of all weights. NONCUMULATIVE will call process_word() for each single occurrence, and sum_weights() will not be used. ============== About sum_weights() This function will be called only if CUMULATIVE is used, in order to compute relevance for multiple instances of the same word. For example, if a word is contained thousand times and we simply sum up the weights of the words, then we would get really big weight for that particular word in the document. Instead we may want to implement a function that will add less and less weight as the occurrences of a word grow. The example below simply doesn't add any more relevance to a word after the tenth occurrence. Note that this is not an attempt to rid of spamming. // current_total_weight is total weigth for a single word void sum_weights( int *current_total_weigth, struct word_info *next_word_info ) { // example code: // don't add any cumulative weight if word appeared // more than 10 times if ( next_word_info->occurance > 10 ) return; *current_total_weight += next_word_info->weight; } This functions will be called pretty often. For the sake of efficiency, we may consider making them macros, which would not break modularity, and make them compile inline. ============== Here is a sample, using the functions above: <html> <head> <title>The Title</title> </head> <body> <p>The United <b>States</b></p> </body> </html> NONCUMULATIVE: process_word() will be called with following info: (occurance is probably not important in this case) The -- position = 0, type = TITLE, weight = 5 + 10 = 15 Title -- position = 1, type = TITLE, weight = 5 + 9 = 14 The -- position = 2, type = REGULAR, weight = 1 + 9 = 10 United -- position = 3, type = REGULAR, weight = 1 + 9 = 10 States -- position = 4, type = REGULAR, weight = 1 + 9 = 10 CUMULATIVE: process_word() will be called with following info: (watch, postition has two values; this will be needed for search on phrases) The -- occurances = 2, position = (0, 2), type = TITLE, weight = 15+10= 25 Title -- occurances = 1, position = 1, type = TITLE, weight = 14 United -- occurances = 1, position = 3, type = REGULAR, weight = 10 States -- occurances = 1, position = 4, type = REGULAR, weight = 10 The words to be ranked and indexed are located either between the TITLE, HEAD, or possible the META tag. That's it now. The way the database is set right now, I think that the first way (NONCUMULATIVE) would work better for it, and it is easier to implement. Give it a thought and let me know what you think. Cheers, ozra. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- |