You can subscribe to this list here.
2000 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(14) |
Sep
(17) |
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(6) |
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2002 |
Jan
|
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2003 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2004 |
Jan
(1) |
Feb
|
Mar
(2) |
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(5) |
2006 |
Jan
(4) |
Feb
(3) |
Mar
(5) |
Apr
(1) |
May
(7) |
Jun
(16) |
Jul
(2) |
Aug
(2) |
Sep
(4) |
Oct
(20) |
Nov
(17) |
Dec
(6) |
2007 |
Jan
(34) |
Feb
(15) |
Mar
(1) |
Apr
(4) |
May
|
Jun
(1) |
Jul
(1) |
Aug
(26) |
Sep
(13) |
Oct
(1) |
Nov
(3) |
Dec
(4) |
2008 |
Jan
(6) |
Feb
(3) |
Mar
(29) |
Apr
(19) |
May
(12) |
Jun
(9) |
Jul
(23) |
Aug
(9) |
Sep
(6) |
Oct
(10) |
Nov
(31) |
Dec
(45) |
2009 |
Jan
(62) |
Feb
(11) |
Mar
(42) |
Apr
(24) |
May
(82) |
Jun
(80) |
Jul
(39) |
Aug
(12) |
Sep
(28) |
Oct
(30) |
Nov
(7) |
Dec
(4) |
2010 |
Jan
(1) |
Feb
|
Mar
(45) |
Apr
(57) |
May
(65) |
Jun
(75) |
Jul
(31) |
Aug
(45) |
Sep
(26) |
Oct
|
Nov
|
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Ben G. Wu <gg...@ab...> - 2000-09-07 09:17:09
|
My university blocks ICQ. I could you IRC or GAIM. cheers, Ben On Thu, 07 Sep 2000, Wagner Teixeira wrote: > > For those who want to reach me online, my ICQ# is 12729386 > > Igor and Kord: I think it could be specially good for you that manages the > project. > > Cheers, > > Wagner. > ---------------------------------------- Content-Type: text/html; name="unnamed" Content-Transfer-Encoding: quoted-printable Content-Description: ---------------------------------------- |
From: Wagner T. <wa...@wt...> - 2000-09-06 23:53:35
|
For those who want to reach me online, my ICQ# is 12729386 Igor and Kord: I think it could be specially good for you that manages the project. Cheers, Wagner. |
From: Wagner T. <wa...@wt...> - 2000-09-06 23:49:05
|
Igor, That's the way I was thinking. I'll think more deeply on this and come back in the next days (tomorrow is holliday in Brazil). I'll try the most open and flexible way I can. Cheers, Wagner. > -----Original Message----- > From: gru...@li... > [mailto:gru...@li...]On Behalf Of Igor > Stojanovski > Sent: Quarta-feira, 6 de Setembro de 2000 20:05 > To: gru...@li... > Subject: [Grub-develop] Wagner: Ranking mechanism for the words > > > To Wagner: > > -----Original Message----- > From: Wagner Teixeira [mailto:wa...@wt...] > Sent: Wednesday, September 06, 2000 1:58 PM > To: Igor Stojanovski > Subject: RE: Ranking system > > > > We are getting close to the point when the server will actually have the > > capability to index. In order to do this, we need a module > that will take > > the contents of a page (as a stream of bytes), and do a ranking based on > > preset parameters. It will have to be able to distract the > > significant text > > from the HTML tags to base the ranking upon. > > Ok, I understand the point. I'll try to make a generic class that > allows in > the future ranking of other kind of documents, like MSWord, PDF, > PostScript, > etc. > [ozra] That's a good idea. But focus on HTML for now. > > > What I understood is: I'll make a parser that will return all > relevant words > with its ranking position (0 to 10, for instance), so the caller can index > the word with its ranking position. The user will get this > position when he > searches and chooses the best documents based on engine's ranking, right? > [ozra] Yes. Say, we will not process beyond the first 1000 words of any > particular document. > > This may be little hard to swallow, but it's my suggestion on how > the Ranker > may be implemented. This is not set in stone, and you may want > to implement > it in a different way. I am sure you will get at least a few tips from my > suggestion. > > Sample use: > > { > // this is a class that implements the module; > // process_word, cumpute_weight, sum_weights are > // function pointers (see below) > // second argument could be CUMULATIVE and NONCUMULATIVE (see below) > Ranker rank( CUMULATIVE, process_word, cumpute_weight, sum_weights ); > > rank.begin_doc( "http://www.yahoo.com", 203314002 ); // URL ID > > while ( more input from a client ) { > > rank.process( buffer from the input stream ); > } > > rank.end_doc(); > ... > } > > > ============== > About process_word() > > // this struct is used as a parameter in process_word (see below) > struct word_info { > > char *the_word; // probably word ID would be better > char *from_url; // probably URL ID would be better > enum word_type type; // ex: REGULAR, TITLE, ANCHOR, etc. > // tells where the current word is positioned > // if CUMULATIVE, then valid values are only > // REGULAR and TITLE, which means that at least > // one occurance of the word is contained > // in the title (to enable searching > by for URLs > // by title). > unsigned short position; > unsigned short weight; > unsigned short occurance; > }; > > > // this function is called for every single word that is encountered in a > doc > // info is the parameter which gives all needed info about the word in > process > int process_word( struct word_info *info ) { > > // do an insert to the database about the word, > // dont' worry about that > > // if return == -1, then fatal error occured with the DB > } > > > ============== > About cumpute_weight() > > This is user defined function which computes weight for a single word. It > may use the word_info struct, where the type and position is > given, and the > weigth is then computed. > > For example: > > void compute_weigth( struct word_info *info ) { > > switch ( info->type ) { > case REGULAR: > info->weight = 1; > case TITLE: > info->weight = 5; > } > > info->weight += ( info->postition > 100 ) ? 0 : > ( 100 - info->position ) / 10; > } > > > ============== > About (NON)CUMULATIVE > > If CUMULATIVE is chosen, then Ranker will not repeat same words, > but rather > build a summary for each unique one, and then call process_word(). Weight > will be (then) the sum of all weights. NONCUMULATIVE will call > process_word() for each single occurrence, and sum_weights() will not be > used. > > ============== > About sum_weights() > > This function will be called only if CUMULATIVE is used, in order > to compute > relevance for multiple instances of the same word. For example, if a word > is contained thousand times and we simply sum up the weights of the words, > then we would get really big weight for that particular word in the > document. Instead we may want to implement a function that will add less > and less weight as the occurrences of a word grow. The example > below simply > doesn't add any more relevance to a word after the tenth occurrence. Note > that this is not an attempt to rid of spamming. > > > // current_total_weight is total weigth for a single word > void sum_weights( int *current_total_weigth, > struct word_info *next_word_info ) > { > // example code: > // don't add any cumulative weight if word appeared > // more than 10 times > if ( next_word_info->occurance > 10 ) > return; > > *current_total_weight += next_word_info->weight; > } > > This functions will be called pretty often. For the sake of > efficiency, we > may consider making them macros, which would not break > modularity, and make > them compile inline. > > ============== > Here is a sample, using the functions above: > > <html> > <head> > <title>The Title</title> > </head> > <body> > <p>The United <b>States</b></p> > </body> > </html> > > NONCUMULATIVE: process_word() will be called with following info: > (occurance is probably not important in this case) > > The -- position = 0, type = TITLE, weight = 5 + 10 = 15 > Title -- position = 1, type = TITLE, weight = 5 + 9 = 14 > The -- position = 2, type = REGULAR, weight = 1 + 9 = 10 > United -- position = 3, type = REGULAR, weight = 1 + 9 = 10 > States -- position = 4, type = REGULAR, weight = 1 + 9 = 10 > > CUMULATIVE: process_word() will be called with following info: > (watch, postition has two values; this will be needed for search > on phrases) > > The -- occurances = 2, position = (0, 2), type = TITLE, weight = 15+10= > 25 > Title -- occurances = 1, position = 1, type = TITLE, weight = 14 > United -- occurances = 1, position = 3, type = REGULAR, weight = 10 > States -- occurances = 1, position = 4, type = REGULAR, weight = 10 > > The words to be ranked and indexed are located either between the TITLE, > HEAD, or possible the META tag. > > That's it now. The way the database is set right now, I think that the > first way (NONCUMULATIVE) would work better for it, and it is easier to > implement. Give it a thought and let me know what you think. > > Cheers, > > ozra. > > -------------------------------------------------------------- > Igor Stojanovski Grub.Org Inc. > Chief Technical Officer 5100 N. Brookline #830 > Oklahoma City, OK 73112 > oz...@gr... Voice: (405) 917-9894 > http://www.grub.org Fax: (405) 848-5477 > -------------------------------------------------------------- > > _______________________________________________ > Grub-develop mailing list > Gru...@li... > http://lists.sourceforge.net/mailman/listinfo/grub-develop |
From: Igor S. <oz...@gr...> - 2000-09-06 23:01:17
|
To Wagner: -----Original Message----- From: Wagner Teixeira [mailto:wa...@wt...] Sent: Wednesday, September 06, 2000 1:58 PM To: Igor Stojanovski Subject: RE: Ranking system > We are getting close to the point when the server will actually have the > capability to index. In order to do this, we need a module that will take > the contents of a page (as a stream of bytes), and do a ranking based on > preset parameters. It will have to be able to distract the > significant text > from the HTML tags to base the ranking upon. Ok, I understand the point. I'll try to make a generic class that allows in the future ranking of other kind of documents, like MSWord, PDF, PostScript, etc. [ozra] That's a good idea. But focus on HTML for now. What I understood is: I'll make a parser that will return all relevant words with its ranking position (0 to 10, for instance), so the caller can index the word with its ranking position. The user will get this position when he searches and chooses the best documents based on engine's ranking, right? [ozra] Yes. Say, we will not process beyond the first 1000 words of any particular document. This may be little hard to swallow, but it's my suggestion on how the Ranker may be implemented. This is not set in stone, and you may want to implement it in a different way. I am sure you will get at least a few tips from my suggestion. Sample use: { // this is a class that implements the module; // process_word, cumpute_weight, sum_weights are // function pointers (see below) // second argument could be CUMULATIVE and NONCUMULATIVE (see below) Ranker rank( CUMULATIVE, process_word, cumpute_weight, sum_weights ); rank.begin_doc( "http://www.yahoo.com", 203314002 ); // URL ID while ( more input from a client ) { rank.process( buffer from the input stream ); } rank.end_doc(); ... } ============== About process_word() // this struct is used as a parameter in process_word (see below) struct word_info { char *the_word; // probably word ID would be better char *from_url; // probably URL ID would be better enum word_type type; // ex: REGULAR, TITLE, ANCHOR, etc. // tells where the current word is positioned // if CUMULATIVE, then valid values are only // REGULAR and TITLE, which means that at least // one occurance of the word is contained // in the title (to enable searching by for URLs // by title). unsigned short position; unsigned short weight; unsigned short occurance; }; // this function is called for every single word that is encountered in a doc // info is the parameter which gives all needed info about the word in process int process_word( struct word_info *info ) { // do an insert to the database about the word, // dont' worry about that // if return == -1, then fatal error occured with the DB } ============== About cumpute_weight() This is user defined function which computes weight for a single word. It may use the word_info struct, where the type and position is given, and the weigth is then computed. For example: void compute_weigth( struct word_info *info ) { switch ( info->type ) { case REGULAR: info->weight = 1; case TITLE: info->weight = 5; } info->weight += ( info->postition > 100 ) ? 0 : ( 100 - info->position ) / 10; } ============== About (NON)CUMULATIVE If CUMULATIVE is chosen, then Ranker will not repeat same words, but rather build a summary for each unique one, and then call process_word(). Weight will be (then) the sum of all weights. NONCUMULATIVE will call process_word() for each single occurrence, and sum_weights() will not be used. ============== About sum_weights() This function will be called only if CUMULATIVE is used, in order to compute relevance for multiple instances of the same word. For example, if a word is contained thousand times and we simply sum up the weights of the words, then we would get really big weight for that particular word in the document. Instead we may want to implement a function that will add less and less weight as the occurrences of a word grow. The example below simply doesn't add any more relevance to a word after the tenth occurrence. Note that this is not an attempt to rid of spamming. // current_total_weight is total weigth for a single word void sum_weights( int *current_total_weigth, struct word_info *next_word_info ) { // example code: // don't add any cumulative weight if word appeared // more than 10 times if ( next_word_info->occurance > 10 ) return; *current_total_weight += next_word_info->weight; } This functions will be called pretty often. For the sake of efficiency, we may consider making them macros, which would not break modularity, and make them compile inline. ============== Here is a sample, using the functions above: <html> <head> <title>The Title</title> </head> <body> <p>The United <b>States</b></p> </body> </html> NONCUMULATIVE: process_word() will be called with following info: (occurance is probably not important in this case) The -- position = 0, type = TITLE, weight = 5 + 10 = 15 Title -- position = 1, type = TITLE, weight = 5 + 9 = 14 The -- position = 2, type = REGULAR, weight = 1 + 9 = 10 United -- position = 3, type = REGULAR, weight = 1 + 9 = 10 States -- position = 4, type = REGULAR, weight = 1 + 9 = 10 CUMULATIVE: process_word() will be called with following info: (watch, postition has two values; this will be needed for search on phrases) The -- occurances = 2, position = (0, 2), type = TITLE, weight = 15+10= 25 Title -- occurances = 1, position = 1, type = TITLE, weight = 14 United -- occurances = 1, position = 3, type = REGULAR, weight = 10 States -- occurances = 1, position = 4, type = REGULAR, weight = 10 The words to be ranked and indexed are located either between the TITLE, HEAD, or possible the META tag. That's it now. The way the database is set right now, I think that the first way (NONCUMULATIVE) would work better for it, and it is easier to implement. Give it a thought and let me know what you think. Cheers, ozra. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- |
From: Rodrigo D. <ro...@vr...> - 2000-09-06 17:15:18
|
Igor Stojanovski wrote: > Me and Rodrigo have been discussing whether the classes posted on the CVS > under server/dbfront/ are OK. Rodrigo suggests that we should use mysql++ > instead of Connection.*, Result.* and Row.*. > > Rodrigo wrote: > > First of all, I really really really really think you should > consider > using mysql++ when using C++...it already has all the stuff you're redoing > here...if you want an additional abstraction layer, you can re-write your > code > to use MySQL++, although I don't think that would be necessary... > [ozra] Since we would use MySQL only at the beginning and then very likely > move to a different database, those classes would serve as an abstraction to > any DB we might switch to in the future. With these classes it would be > very easy to do. Plus, we can closely control how much extra overhead > classes cause on the program. > > about your > program, it's missing a semicolon at the end of the query string...it was > also > missing an include for stdio.h in Row.h....oh and, eeeewwww your makefile > looks > ugly...also, what the heck is %p in a printf?? I changed it to %s...still > SEGFAULTs though... > [ozra] %p is for printing pointer value in hex. Sorry I put all that > extranious crap in there -- it was all there for testing purposes. I fixed > the problem. The reason it crashed was because I wasn't using the classes > correctly. Now it's querying and getting results correctly. Go and look at > those files again. Ohhh okay... > but really, try to restructure your program to use MySQL++, > 'cause that way you're using something that we KNOW works... > [ozra] These classes are taken out of a MySQL book. They should work well > in real life, and they use a handful of MySQL C API functions within, which > makes it easy to check if they work well or not. They will make the > conversion into a different database seemless (or almost). > > Also, you're mixing C and C++ too much, that gets confusing...like, > use > cout << "text" << endl; instead of printf("text\n");, it's more clear... > [ozra] I don't personally mind mixing them. > > I'm > sending you a little MySQL++ program that I made so that you can take a look > at > all these points and see how it should be done...it's not really structured > into classes but it uses C++ classes from MySQL and other standard > C++ classes... > [ozra] You are mixing C with C++ concepts too :) Adding cin's and cout's > does not make code pure C++. I know =c) But that's impossible to avoid since many C functions are not rewritten as C++ classes...such as lstat() which I use in the program I sent... > But anyway. I am not saying your are not right about using mysql++. Look > at my points and tell me if I am still wrong by using these classes. You're not *wrong*, it's just that we can save a lot of time by using a library which is done, working and mantained by someone rather than writing one that is almost like it(mysql++ has Connection, Row, Query, and many other classes too) by ourselves... > Cheers, Max |
From: Igor S. <oz...@gr...> - 2000-09-06 16:52:47
|
Me and Rodrigo have been discussing whether the classes posted on the CVS under server/dbfront/ are OK. Rodrigo suggests that we should use mysql++ instead of Connection.*, Result.* and Row.*. Rodrigo wrote: First of all, I really really really really think you should consider using mysql++ when using C++...it already has all the stuff you're redoing here...if you want an additional abstraction layer, you can re-write your code to use MySQL++, although I don't think that would be necessary... [ozra] Since we would use MySQL only at the beginning and then very likely move to a different database, those classes would serve as an abstraction to any DB we might switch to in the future. With these classes it would be very easy to do. Plus, we can closely control how much extra overhead classes cause on the program. about your program, it's missing a semicolon at the end of the query string...it was also missing an include for stdio.h in Row.h....oh and, eeeewwww your makefile looks ugly...also, what the heck is %p in a printf?? I changed it to %s...still SEGFAULTs though... [ozra] %p is for printing pointer value in hex. Sorry I put all that extranious crap in there -- it was all there for testing purposes. I fixed the problem. The reason it crashed was because I wasn't using the classes correctly. Now it's querying and getting results correctly. Go and look at those files again. but really, try to restructure your program to use MySQL++, 'cause that way you're using something that we KNOW works... [ozra] These classes are taken out of a MySQL book. They should work well in real life, and they use a handful of MySQL C API functions within, which makes it easy to check if they work well or not. They will make the conversion into a different database seemless (or almost). Also, you're mixing C and C++ too much, that gets confusing...like, use cout << "text" << endl; instead of printf("text\n");, it's more clear... [ozra] I don't personally mind mixing them. I'm sending you a little MySQL++ program that I made so that you can take a look at all these points and see how it should be done...it's not really structured into classes but it uses C++ classes from MySQL and other standard C++ classes... [ozra] You are mixing C with C++ concepts too :) Adding cin's and cout's does not make code pure C++. But anyway. I am not saying your are not right about using mysql++. Look at my points and tell me if I am still wrong by using these classes. Cheers, ozra. |
From: Wagner T. <wa...@wt...> - 2000-09-05 23:24:03
|
Gentlemen (specially Ben and Ozra), I've just set up monitoring GRUB's CVS tree at sourceforge for my account, so as soon as you update the source code, I'll be notified. This is my try to make we use CVS as an unique source at least for CSTAT. Cheers, Wagner. |
From: Igor S. <oz...@gr...> - 2000-09-05 21:50:31
|
There seems to be a lot of confusion about the CSTAT module, which is largely my fault. Ben was finished the UNIX counterpart of it, and sent the module to me. Then, I sent this revision to Wagner. In meantime, Ben posted the code on CVS, and at about the same time Wagner finished the Windows part of the module. Just now I figured out that there are few added capabilities (max_files part, for example) on the CVS version to the one that Ben had previously e-mailed to me. I suggest that Ben gets to the point where the module is completely finished. Wagner has translated parts of the code into Windows, and he could finish the rest once Ben is done. Sorry for this inconvenience. And, yes we should use CVS as much as possible. Cheers, ozra. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- |
From: Kostadin D. <ko...@ho...> - 2000-09-05 18:04:21
|
Hi All, I have been working on the Client Communication part of the project. It is pretty much an implementation of the TCP/HTTP protocols. Recently, the library I have been modifying to fit our purposes (tcp4u), has crashed on the Linux box possibly because it has trouble with the multithreading. This in addition to the robustness of the tcp4u library (it includes even UDP and SMTP support) and its lack of support for HTTP 1.1 have made me reconsider if we should use tcp4u or something else for this purpose. I looked through Mozzila.org's netlib and W3C's libwww for this purpose. Here's what I saw about them: netlib: current version does not support a multithreaded approach, instead uses a loop call back mechanism; supports only parts of http 1.1 specs; good documentation; libwww: too big and robust; supports XML, CSS and everything else; fully supports HTTP 1.1. The reason I have outlined these here is that I was wondering if anybody knows of some other library we can look at or if you have any comments of the current/supposed functionality of this. Give me some feedback!!! Kosta _________________________________________________________________________ Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com. Share information about yourself, create your own public profile at http://profiles.msn.com. |
From: Igor S. <oz...@gr...> - 2000-09-01 22:09:30
|
I haven't done any changes to the module, and I don't think Ben has done any either. I only changed "#define DEFAULT_MAX_SIZE" for testing purposes. So, there no updated source code. The newer version of the TThread module can be found on the CVS (look at http://cvs.sourceforge.net/cgi-bin/cvsweb.cgi/client/TThread/?cvsroot=grub). But the changes does not affect how the library is used. Cheers, ozra. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- Wagner wrote: Where can I find the updated source code? |
From: Wagner T. <wa...@wt...> - 2000-09-01 21:50:24
|
Where can I find the updated source code? > -----Original Message----- > From: gru...@li... > [mailto:gru...@li...]On Behalf Of Igor > Stojanovski > Sent: Sexta-feira, 1 de Setembro de 2000 18:36 > To: gru...@li... > Subject: [Grub-develop] CSTAT > > > Good. Wagner is working on the Windows part of the module. > > > -------------------------------------------------------------- > Igor Stojanovski Grub.Org Inc. > Chief Technical Officer 5100 N. Brookline #830 > Oklahoma City, OK 73112 > oz...@gr... Voice: (405) 917-9894 > http://www.grub.org Fax: (405) 848-5477 > -------------------------------------------------------------- > > Ben wrote: > > Hi, guys, I posted the cstat module (for logging and statistic) under the > client > directory. Ignor said this module can be used in the server side as well. > The > module is only for logging at the moment. And I only tested it on my > linuxppc > box. > > cheers, > Ben > > _______________________________________________ > Grub-develop mailing list > Gru...@li... > http://lists.sourceforge.net/mailman/listinfo/grub-develop |
From: Igor S. <oz...@gr...> - 2000-09-01 21:32:46
|
Good. Wagner is working on the Windows part of the module. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- Ben wrote: Hi, guys, I posted the cstat module (for logging and statistic) under the client directory. Ignor said this module can be used in the server side as well. The module is only for logging at the moment. And I only tested it on my linuxppc box. cheers, Ben |
From: Ben G. Wu <gg...@ab...> - 2000-09-01 20:43:35
|
Hi, guys, I posted the cstat module (for logging and statistic) under the client directory. Ignor said this module can be used in the server side as well. The module is only for logging at the moment. And I only tested it on my linuxppc box. cheers, Ben |
From: Igor S. <oz...@gr...> - 2000-08-28 16:03:13
|
-----Original Message----- From: Igor Stojanovski [mailto:oz...@gr...] Sent: Monday, August 28, 2000 10:59 AM To: Doug Semig Subject: RE: [Grub-develop] CRW The main disadvantage to the current version of CRW is servicing hundreds or more connections at the time. As a single thread is responsible for one connection, we need hundreds of threads to service the connections. The problem is not the overhead for creating/destroying the threads, but rather the burden on the kernel for maintaining/scheduling them. Even though CRW has a way of maintaining maximum number of threads running at a time, I would like our server to be pretty comfortable servicing hundreds of connections at the same time. Doug Semig wrote: > It's not all that complicated. I've done this stuff before. I'm more than > capable of whipping up a program like this if you want to see it. [ozra] Go ahead and write the program. We might implement it in grub. > I might have some code for reading a configuration file laying around, > too...I wonder if I can find that. Do you need something like that as well? [ozra] That would be excellent, too. Cheers, ozra. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- |
From: Doug S. <dou...@se...> - 2000-08-27 09:42:11
|
An interesting quote from the Linux Threads FAQ ( http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html ): ...you'd better restructure your application so that it doesn't need more than, say, 100 threads. For instance, in the case of a multithreaded server, instead of creating a new thread for each connection, maintain a fixed-size pool of worker threads that pick incoming connection requests from a queue. --The Linux Threads FAQ, Question D.10 By the way, I'm mainly posting this to see how long it takes the message to get delivered through the list. But I thought it should be on topic instead of a "Test--Please Ignore" kind of post. Doug |
From: Doug S. <dou...@se...> - 2000-08-26 09:54:32
|
Nevermind about the how-to-submit-a-patch question. I've never used a web interface to submit patches before, so I wasn't expecting one...it's cool that SourceForge provides this function. They really have it going on! This makes me want to join 8,000 projects and start coding 24 hours a day! But I ****really**** hate this ever-increasing delay in these mailing lists. I subscribe to about fifteen mailing lists...it took five seconds for an e-mail to get from Austrailia to the PHP list server (where ever that is) to my mailbox here in Michigan. I have yet to see several of my posts to Grub's lists. Doug |
From: Doug S. <dou...@se...> - 2000-08-26 09:21:30
|
Well it looks like no matter what a developer does, they cannot do any worse than forking an instance of the program! I'm not very happy about this info because I believe my Cistron Radius server (that I'm running at my ISP) is based upon Livingston's, and that's how Livingston's Radius server was written. There's a little difference between the way I described and the way the netwoking book described it. I was considering putting the accept() in it's own thread. They stuck the accept() in the worker thread and protected it with a mutex. I take it they then use that thread to service the entire connection. So a worker thread wouldn't run at top speed...it would have to sometimes wait on network input. That's even a simpler model than what I was thinking about. I think I like their model because my model relies on some kind of message passing (queues, kernel messages, shared memory, whatever) and theirs just keeps everything inside the worker thread that accept()ed the connection. It's not all that complicated. I've done this stuff before. I'm more than capable of whipping up a program like this if you want to see it. I might have some code for reading a configuration file laying around, too...I wonder if I can find that. Do you need something like that as well? Speaking of which....how would we submit patches? We probably shouldn't use the lists to submit patches in case some folks have a modem connection to the 'net (which is how I'm working right now...when I'm physically not at work, I just dial in!). Do you have a "patch (at) grub (dot) org" mailbox? Or perhaps an anonymous ftp server with an incoming directory? I know that linux kernel developers just send their patches to the kernel mailing list (and to Linus). The big advantage is that their patch is immediately distributed to all the other developers that will want to try/test it. The disadvantages include the fact that it can quickly fill a mailbox that has a quota on it (mine doesn't, but there are folks that have quotas on their mailboxes), and there are folks who still dial in to the internet (like me at my home!), and it's not a real patch management scheme. Seems like SourceForge would already have some kind of a patch management scheme in place. But I don't remember seeing anything like that. I think I'll go and look through their feature requests and suggest one if it hasn't been suggested already. Doug At 03:02 PM 8/25/00 -0500, you wrote: >Doug has a point on the use of prethreaded server rather than creating a >thread on per connection basis. And I have thought that option myself. > >If you have a web site with millions of requests a day, prethreaded server >is definately necessary. However, we are not going to have NEARLY as many >connections as that. If we had 100,000 clients distributed, which may be >enough to index the whole internet every single day, and each one of them >connected every day with the server -> we would have 100,000 connections, >which will take up a minute or two in CPU time per day for thread >creation/destruction (on an average PC). Creating a new thread is faster >than fork()ing almost in order of 100 times. > >I abandoned the idea of doing a preforked/prethreaded server because it does >not provide real benefits, and it seemed more complicated to create. Plus >doing a thread-per-connection server may perform with almost identical >results as prethreaded server (see below). > ... snip the benchmarks (see archives) ... > >Give me more feedback on this. > > >Cheers, > >igor. > >-------------------------------------------------------------- >Igor Stojanovski Grub.Org Inc. >Chief Technical Officer 5100 N. Brookline #830 > Oklahoma City, OK 73112 >oz...@gr... Voice: (405) 917-9894 >http://www.grub.org Fax: (405) 848-5477 >-------------------------------------------------------------- |
From: Doug S. <dou...@se...> - 2000-08-26 08:10:19
|
What's going on at SourceForge?!? Look at these times! Igor sent this out around 8:00 pm UTC (1:00 pm his local time). Sourceforge received it almost immediately and then hung onto it for almost 12 hours! And I just checked and it's not in the list's archives yet. Are any of you getting a huge delay in these messages as well? Doug (copied from Igor's post...): Received: from mail1.sourceforge.net (HELO lists.sourceforge.net) (198.186.203.35) by sloth.c3net.net with SMTP; 26 Aug 2000 07:54:24 -0000 Received: from mail1.sourceforge.net (localhost [127.0.0.1]) by lists.sourceforge.net (8.9.3/8.9.3) with ESMTP id NAA22296; Fri, 25 Aug 2000 13:01:01 -0700 Received: from pineapple.theshop.net (pineapple.theshop.net [208.128.7.7]) by lists.sourceforge.net (8.9.3/8.9.3) with ESMTP id MAA22240 for <gru...@li...>; Fri, 25 Aug 2000 12:59:16 -0700 |
From: Igor S. <oz...@gr...> - 2000-08-25 19:59:16
|
Doug has a point on the use of prethreaded server rather than creating a thread on per connection basis. And I have thought that option myself. If you have a web site with millions of requests a day, prethreaded server is definately necessary. However, we are not going to have NEARLY as many connections as that. If we had 100,000 clients distributed, which may be enough to index the whole internet every single day, and each one of them connected every day with the server -> we would have 100,000 connections, which will take up a minute or two in CPU time per day for thread creation/destruction (on an average PC). Creating a new thread is faster than fork()ing almost in order of 100 times. I abandoned the idea of doing a preforked/prethreaded server because it does not provide real benefits, and it seemed more complicated to create. Plus doing a thread-per-connection server may perform with almost identical results as prethreaded server (see below). Here is some testing data on the performance that I have taken out of a Networking book, on performance of various types of web servers, in seconds of CPU time (this is for 5000 connections requesting 4000 bytes from the server): Solaris DUnix =============================================== ======= ====== Concurrent server, create one thread per client request...................... 18.7 4.7 Prethreaded with mutex locking to protect accept()............................ 8.6 3.5 Prethreaded with main thread calling accept().. 14.5 5.0 Concurrent server, one fork per client request. 504.2 168.9 Give me more feedback on this. Cheers, igor. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- |
From: Igor S. <oz...@gr...> - 2000-08-25 18:53:45
|
-----Original Message----- From: Wagner Teixeira [mailto:wa...@wt...] Sent: Friday, August 25, 2000 1:54 PM To: Kord Campbell Cc: Igor Stojanovski Subject: Idea I haven't catched the total picture about GRUB, but I think you'll like this: as you may know (I've told it before), I'm finishing a textual DB engine, and one problem I've been thinking about is: how large can be a database? How large can be a disk? How large can be a file in a filesystem? The answer: no matter. WTB/Fullbase works based on CVirtualStorage class, that can be local or client/server, and can handle 2^64 volumes (files OR disks), with 2^64 I/O blocks each with reasonable size each (10KB, for instance). With some math, it gives 2^128 blocks with 10KB each, or 10240 * 2^128 bytes, 10240 * 340,282,366,920,900,000,000,000,000,000,000,000,000 bytes. CVirtualStorage is implemented as local at this time and works fine. The point is: it works well for my own DB engine, but it is not ready to work with other engines, once it is an API. I don't know if it is a good solution for use with UDMSearch (I don't know UDMSearch). Well... this is my first contribution (try) :) Wagner. |
From: Doug S. <dou...@se...> - 2000-08-25 18:12:57
|
I was glancing through the server code. (Compiles and runs as advertised, too!) It looks like CRW will create an IND thread on a per-connection basis, and when the connection ends, then IND thread ends. With a lot of connections, that could be a lot of thread creation/destruction overhead. I'm just going to toss out the possibility of creating threads on a functional basis instead of a connection basis. This model has very light thread creation/destruction overhead because the threads that process the connection are created up front (or as needed if the data starts to come in too fast for the current thread pool to handle). For example: Program creates three threads to start with (so there'd be a total of four initial schedulable units running, the main program and the three threads). Initial thread #1 (LISTENER) simply listens to for connection requests, and grants them. Initial thread #2 (RECEIVER) listens for data on the incoming connections and passes it to the rest of the program via either a queue or a SysV IPC mechanism such as the kernel's message passing feature or shared memory. It also manages the worker threads (described below), creating them and removing them as necessary to handle the load. It is responsible for maintaining overall connection status. Overall, though, this thread is very small and very fast because mostly all it does is get data from the network and queue it for processing. Initial thread #3 (WORKER) is the basic worker thread that takes and processes the data from Thread #2. It is responsible for maintaining the state of the client/server protocol as required for processing. It is not responsible for any single connection...a single connection is actually "managed" through a connection state vector (a data structure) that this thread maintains. That way, there can be more than one of this kind of thread, and they are all interchangeable and identical. Building a server on a functional model like this could eliminate a lot of complex thread management and is very modular. I first saw this model used for an FTP server for Win32. Comments? I suppose I can try to code up a quickie skeleton of this weekend if you would like to see this model in action. Doug |
From: Igor S. <oz...@gr...> - 2000-08-24 21:41:48
|
Doug wrote: > Oh, good. I'm glad that the "client" stuff in the server repository is > going to be deleted. [ozra] Actually, I am not deleting it. I just accidentally imported it under the server directory (wrong spot). It still exists under the client directory. I am guessing, you had problems compiling the tcp4u library. The reason we are using it because it has several features: - It understands a subset of the HTTP protocol, the GET command in partucular. - It understands HTTP error codes returned from the web server. - It is able to follow redirects given from the web server. - It works under both UNIX and Windows. I thought it would be easier to just use something some wrote then recreating it. And I wasn't able to find anything more apporpriate. If anyone has anything better in mind, let me know. It does have a lot of extraneous crap which we will never use, like UDP sockets. We might use whatever we can out of it, and then clean up the rest of the code that we don't need in the future. > It doesn't compile unless you do several things: [ozra] As far as compiling problems, there should be none now except for one warning, which is unimportant: > * you have to #define NEED_PROTO in both main.cpp and port.h (weird stuff > if you ask me...) [ozra] It's fixed now. Download the newest version of the code. > * you have to cast szURL to a char * in the strcat() call in Kosta's change > to http4_url.c to get rid of a couple of compiler warnings [ozra] It's fixed. > * you have to cast Rc in http4_ex.c to an unsigned int where it says in the > comments that it will generate a signed/unsigned mismatch (this probably > would be safe because the bizarre network abstraction library looks > like it already handles error conditions, so Rc would be positive > by the time it got here, or zero if nothing was received) to get rid of > the signed/unsigned mismatch [ozra] This one is a problem. Howerver, first, we will not likely use this functionallity, that is, to write what we get from the HTTP response straight to a local file. Second, the functions used for writing to a local file are from the old 16-bit Windows, which I don't feel confortable with. This part needs a slight modification so that "write to a local file" functionality is removed from there altogether. I forgot about it now that you mentioned it. > * and you have to link against wsock32.lib [ozra] Yes. But if you compile the code using Visual C++ and the .dsw provided in the CVS, you don't need to be aware of that fact, as it is all handled by the project files. Right now the library tends to crash on Linux due to some bug. But it's being taken care of, too. Cheers, igor. |
From: Doug S. <dou...@se...> - 2000-08-24 06:15:58
|
Oh, good. I'm glad that the "client" stuff in the server repository is going to be deleted. It doesn't compile unless you do several things: * you have to #define NEED_PROTO in both main.cpp and port.h (weird stuff if you ask me...) * you have to cast szURL to a char * in the strcat() call in Kosta's change to http4_url.c to get rid of a couple of compiler warnings * you have to cast Rc in http4_ex.c to an unsigned int where it says in the comments that it will generate a signed/unsigned mismatch (this probably would be safe because the bizarre network abstraction library looks like it already handles error conditions, so Rc would be positive by the time it got here, or zero if nothing was received) to get rid of the signed/unsigned mismatch * and you have to link against wsock32.lib That allows it to compile cleanly AND allows it to link. At least it actually works as advertised after all of that. Of course, it's not anything but a little test suite. I'm glad you're getting rid of it. It's distracting. And it's not useful in any way, except perhaps as an exercise for a C Programming 101 class. If the the "client" in the server repository was a test to see who could get it to compile cleanly, I want a gold star for completing the assignment. Doug At 04:31 PM 8/23/00 -0500, Igor Stojanovski wrote: >Grubsters, > >I finally completed the CRW module. As I have already said, this module >work in a way similar to a web server, listening for Clients to connect to >it. Once a client connect, a new thread is started which will deal with the >Client. > >Go to the CVS repository and take a peak at it. It uses a lot of >multithreading stuff, so its prone to bugs. I tested it a little bit, and >it seemed to work fine. > >The server is located in the server directory (big surprise). You will find >a client subdirectory in there. Ignore it completely, I put it there >accidentally. I will remove it soon. > >Cheers, > >igor. > >-------------------------------------------------------------- >Igor Stojanovski Grub.Org Inc. >Chief Technical Officer 5100 N. Brookline #830 > Oklahoma City, OK 73112 >oz...@gr... Voice: (405) 917-9894 >http://www.grub.org Fax: (405) 848-5477 >-------------------------------------------------------------- |
From: Igor S. <oz...@gr...> - 2000-08-23 21:28:57
|
Grubsters, I finally completed the CRW module. As I have already said, this module work in a way similar to a web server, listening for Clients to connect to it. Once a client connect, a new thread is started which will deal with the Client. Go to the CVS repository and take a peak at it. It uses a lot of multithreading stuff, so its prone to bugs. I tested it a little bit, and it seemed to work fine. The server is located in the server directory (big surprise). You will find a client subdirectory in there. Ignore it completely, I put it there accidentally. I will remove it soon. Cheers, igor. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- |
From: Igor S. <oz...@gr...> - 2000-08-21 20:15:25
|
Ramaseshan sent me this algorithm for the crawler: Instruction from CORD Start Crawling Talk to CREQ and get the set of URLs to crawl Add the URL to the currently empty list of URLs to search. Set the state of each URL to Ready to Crawl For each URL, start a thread While the list of URLs to search is not empty, { Get the next URL in the list. Change the state of the URL to 'Crawling in Progress' //Before starting a crawl, each thread reads this state and picks up //the URL whose state is 'Ready to Crawl' //The following step can be avoided, if the server sends only URLs //whose protocol is only HTTP (FTP?) Check the URL to make sure its protocol is HTTP (FTP?) If not HTTP (FTP?) protocol Break Else See whether there's a robots.txt file at this site that includes a "Disallow" statement. If the document is disallowed for indexing Break Else Retrieve that document From the Web If unable to retrieve the document (determined by the timeouts) Change the state of the URL to 'Unable to Crawl' Break out of while loop End If Obey 'Meta Tag =Robots' and get all Links in the document if allowed Resolve links (if present in the document) to get the absolute new URLs Store those new URLs in the database Compress the page Change the state of the URL to 'Crawling Completed' End If End If } If some of the URLs could not crawled (unable to crawl, server down, page not found, etc), return those URLs to the server for rescheduling My comments on the algorithm: > Talk to CREQ and get the set of URLs to crawl [ozra] For the sake of modularity, I think that CCRW should be unaware of the CREQ's existence. It needs to only know the CDBP's interface. This interface will provide operations such as getting a URL to crawl, storing the contents of the pages, storing the newly-found URLs, etc. > Set the state of each URL to 'Ready to Crawl' [ozra] When CORD schedules CCRW to run, CREQ would have already gotten new URLs from the Server for crawling, and would have marked them as such (or do whatever to make them ready for crawling). > For each URL, start a thread [ozra] Here is my suggestion on how we should handle the crawling threads. When CORD starts CRW, it then executes in either its own thread, or perhaps not. Let me know what you think about that. I think it is important to have a mechanism that will effectively kill the crawling threads if we needed to. We could have a "magic number" on the number of threads that will crawl concurrently. Say that magic number is 10. Prior to any crawling, 10 crawler threads will be created. Then each thread operates in the loop that you described: > While the list of URLs to search is not empty, > { ... > } [ozra] Then CRW waits for all threads to finish crawling before either the main CRW thread dies, or a function call that makes CRW run returns. > Check the URL to make sure its protocol is HTTP (FTP?) > If not HTTP (FTP?) protocol > Break [ozra] I think we should have a protocol check-up. For now HTTP only will be implemented. But we should make it modular enough so that adding new protocol would be painless. [ozra] I am all for using the robots.txt and the robot meta tag during crawling. When such pages is attempted to be crawled, we should mark them "disallowed." But let's leave this task for future. Don't worry about it right know, unless you think it is necessary. > If some of the URLs could not crawled (unable to crawl, server down, page not found, etc), return those URLs to the server for rescheduling [ozra] Remember that this module will have no contact with the central Server at all. To retrieve the URLs that are due for crawling, or to store contents of the pages crawled and the URLs found, the CDBP's interface will be used. I know it is a kind of abstract to you at this point because such interface does not yet exist, but I think it's OK while you are writing the p-code. I will assign someone soon to do that. In fact, a lot of the CDBP's interface will be figured out after your p-code. Don't worry about how pages are stored or compressed. Mikhail is working on archiving the crawled pages. Just remember to use the existing code for thread, mutex, and HTTP operations. They can be all found on the CVS. If you need an additional functionality on the HTTP library, contact Kosta, as he knows it best. I know we will use time-outs on waiting for Server responses. Check to see if that functionality is OK. Cheers, igor. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- |