You can subscribe to this list here.
2000 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(14) |
Sep
(17) |
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(6) |
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2002 |
Jan
|
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2003 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2004 |
Jan
(1) |
Feb
|
Mar
(2) |
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(5) |
2006 |
Jan
(4) |
Feb
(3) |
Mar
(5) |
Apr
(1) |
May
(7) |
Jun
(16) |
Jul
(2) |
Aug
(2) |
Sep
(4) |
Oct
(20) |
Nov
(17) |
Dec
(6) |
2007 |
Jan
(34) |
Feb
(15) |
Mar
(1) |
Apr
(4) |
May
|
Jun
(1) |
Jul
(1) |
Aug
(26) |
Sep
(13) |
Oct
(1) |
Nov
(3) |
Dec
(4) |
2008 |
Jan
(6) |
Feb
(3) |
Mar
(29) |
Apr
(19) |
May
(12) |
Jun
(9) |
Jul
(23) |
Aug
(9) |
Sep
(6) |
Oct
(10) |
Nov
(31) |
Dec
(45) |
2009 |
Jan
(62) |
Feb
(11) |
Mar
(42) |
Apr
(24) |
May
(82) |
Jun
(80) |
Jul
(39) |
Aug
(12) |
Sep
(28) |
Oct
(30) |
Nov
(7) |
Dec
(4) |
2010 |
Jan
(1) |
Feb
|
Mar
(45) |
Apr
(57) |
May
(65) |
Jun
(75) |
Jul
(31) |
Aug
(45) |
Sep
(26) |
Oct
|
Nov
|
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Booker N. <ewl...@be...> - 2006-01-10 08:03:06
|
This One is Strong UP 0.50 (28.57%) Jan 9th Alone Huge PR Campaign Running for Tuesday Jan 10th We expect explosive growth thru Friday Infinex Ventures Inc. (IFNX) Current Price: 2.25 UP 0.50 (28.57%) OTC: IFNX.OB COMPANY OVERVIEW Aggressive and energetic, Infinex boasts a dynamic and diversified portfolio of operations across North America, with an eye on international expansion. Grounded in natural resource exploration, Inifinex also offers investors access to exciting new developments in the high-tech sector and the booming international real estate market. Our market based experience, tenacious research techniques, and razor sharp analytical skills allow us to leverage opportunities in emerging markets and developing technologies. Identifying these opportunities in the earliest stages allows us to accelerate business development and fully realize the companyЎ¦s true potential. Maximizing overall profitability and in turn enhancing shareholder value. Current Press Release Infinex Ventures Inc. (IFNX - News) is pleased to announce the appointment of Mr. Stefano Masullo, to its Board of Directors. Mr. Michael De Rosa the President says, "Mr. Masullo's varied background in finance, engineering and economics, as well as his experience of over 10 years as a Board member of a vast number of International companies, will make him a valuable addition to the Infinex Board. His appointment will show our commitment to the financial, engineering and business structure of our Company." Mr. Masullo attended the University of Luigi Bocconi, in Milan Italy, where he graduated in industrial, economic and financial sciences. Mr. Masullo first began his well rounded career during one of his years at University (1986-1987), where he assisted the Director of Faculty of Finance in finance and investment. |
From: Timmy C. <wtu...@cn...> - 2006-01-07 22:16:37
|
H0t st0ck for New year time!! Infinex Ventures Inc. (IFNX) Current Price: 1.75 Expected 5 day: 2.30 COMPANY OVERVIEW Aggressive and energetic, Infinex boasts a dynamic and diversified portfolio of operations across North America, with an eye on international expansion. Grounded in natural resource exploration, Inifinex also offers investors access to exciting new developments in the high-tech sector and the booming international real estate market. Our market based experience, tenacious research techniques, and razor sharp analytical skills allow us to leverage opportunities in emerging markets and developing technologies. Identifying these opportunities in the earliest stages allows us to accelerate business development and fully realize the company¡¦s true potential. Maximizing overall profitability and in turn enhancing shareholder value. Current Press Release Infinex Ventures Inc. (IFNX - News) is pleased to announce the appointment of Mr. Stefano Masullo, to its Board of Directors. Mr. Michael De Rosa the President says, "Mr. Masullo's varied background in finance, engineering and economics, as well as his experience of over 10 years as a Board member of a vast number of International companies, will make him a valuable addition to the Infinex Board. His appointment will show our commitment to the financial, engineering and business structure of our Company." Mr. Masullo attended the University of Luigi Bocconi, in Milan Italy, where he graduated in industrial, economic and financial sciences. Mr. Masullo first began his well rounded career during one of his years at University (1986-1987), where he assisted the Director of Faculty of Finance in finance and investment. |
From: Roscoe - 2006-01-04 00:56:44
|
St ock Alert iPackets International, Inc. Global Developer and Provider of a Wide Range of Wireless and Communications Solutions for Selected Enterprises Including Mine-Safety (Source: News 1/3/06) OTC: IPKL Price: .35 Huge PR For Wednesday is Underway on IPKL. Short/Day Trading Opportunity for You? Sometimes it is Bang-Zoom on These Small st ocks..As Many of You may Know Recent News: Go Read the Full Stories Now! 1)iPackets International Receives US$85,000 Down Payment for Its First iPMine Deployment in China 2)iPackets International Attends Several Mining Trade Shows and Receives Tremendous Response for Its iPMine Mine-Safety Product Watch This One Trade on Wednesday! Radar it Right Now.. _______________ Information within this email contains 4rward l00 king sta tements within meaning of Section 27A of the Sec urities Act of nineteen thirty three and Section 21B of the Se curities Exchange Act of nineteen thirty four. Any statements that express or involve discussions with respect to predictions, expectations, beliefs, plans, projections, objectives, goals, assumptions future events or performance are not statements of historical fact and may be 4rward 1o0king statements. 4rward looking statements are based on ex pectations, es timates and pr ojections at the time the statements are that involve a number of risks and uncertainties which could cause actual results or events to differ materially from those presently featured Com pany is not a reporting company under the SEC Act of 1934 and there is limited information available on the company. The Co-mpany has a nominal ca sh position.It is an operating company. The company is going to need financing. If that financing does not occur, the company may not be able to continue as a going concern in which case you could lose your in-vestment. The pu blisher of this new sletter does not represent that the information contained in this mes sage states all material facts or does not omit a material fact necessary to make the statements therein not misleading. All information provided within this e_ mail pertaining to in-vesting, st 0cks, se curities must be understood as information provided and not inv estment advice. Remember a thorough due diligence effort, including a review of a company's filings when available, should be completed prior to in_vesting. The pub lisher of this news letter advises all readers and subscribers to seek advice from a registered professional securities representative before deciding to trade in st0cks featured this e mail. None of the material within this report shall be construed as any kind of in_vestment advice or solicitation. Many of these com panies on the verge of bankruptcy. You can lose all your mony by inv esting in st 0ck. The publisher of this new sletter is not a registered in-vestment advis0r. Subscribers should not view information herein as legal, tax, accounting or inve stment advice. In com pliance with the Securities Act of nineteen thirty three, Section 17(b),The publisher of this newsletter is contracted to receive fifteen th0 usand d0 l1ars from a third party, not an off icer, dir ector or af filiate shar eh0lder for the ci rculation of this report. Be aware of an inherent conflict of interest resulting from such compensation due to the fact that this is a paid advertisement and is not without bias. All factual information in this report was gathered from public sources, including but not limited to Co mpany Press Releases. Use of the information in this e mail constitutes your acceptance of these terms. |
From: Anderson H. <bo...@di...> - 2005-12-27 17:23:51
|
KOKO PETROLEUM (KKPT) - THIS STOCK IS UNDISCOVERED S T O C K GEM Current Price: 1.20 Symbol - KKPT Watch out the stock go crazy Tuesday morning KOKO Petroleum, Inc. (KKPT) issued an update on its working interest investment in two wells in the prolific Barnett Shale Play located in northern Texas. Under the terms of the participation agreement with Rife Energy Operating, Inc. (the program's operator), KOKO Petroleum has acquired a minority working interests (approx. 10%) in the drilling and completion of two wells; the Boyd #1 and the Inglish #2 both of which have been drilled but not yet completed. The operator is in the process of setting casing on the Inglish 2 and the Boyd is awaiting a sufficient water supply to start the completion. Due to the heavy influx of major operators in the area (Encana and XTO), scheduling completions and any other types of oil field services has been very difficult. Operators in the area have had to schedule well completions three to four months in advance. This coupled with the fact that Northern Texas has experienced a major drought causing serious shortfalls of local water. Rife, as an alternative, has drilled a water well, which was the source of drilling water for the Inglish 2 and Boyd 1. Rife has five wells that have been drilled and are awaiting completions. The Barnett Shale is the largest natural gas play in Texas. It is presently producing 900 MMCF of gas per day and is considered one of the largest U.S. domestic natural gas plays with sizable, remaining resource potential. The first Barnett Shale wells were drilled and completed in the early 1980s by Mitchell Energy of Houston, Texas. According to an in-depth 2004 sector report on the Barnett Shale, developed by Morgan Stanley (MWD), the Barnett Shale play is estimated to hold reserves in the non-core area that could be as high as 150 BCF per 1,000 acres. The report estimated that because of the amount of gas available in the area, successful wells in the Barnett Shale should be economically viable in almost any gas price environment. "The well logs are very encouraging, as were the wells they offset. Our operator is very resourceful and we should have these wells completed by the end of the year," says Ted Kozub, President of KOKO Petroleum, Inc. On the Corsicana front, KOKO and its Partner have applied for the drilling permits to commence the first 15 Nacatoch wells, casing is being delivered to the site and drilling will commence upon receipt of the permits. |
From: Lenard P. <mw...@ke...> - 2005-12-22 17:58:38
|
Explosive St=ck Alert Doll Technology Group Inc. Global Manufacturer and Marketer of "Clean & Green" Products and Technology Solutions(Source: News 12/6/05) OTC: DTGP Price: .14 Huge PR Campaign Underway For Thursday's Trading **DTGP** Can You Make Some Fast Money On This One? RECENT NEWS: Go Read The Full Stories Right Nowii 1)Doll Technology Group Begins U.S. Trials of AquaBoost(TM) 2)Doll Technology Group Announces Strategic Partnership With Land and Sea Development to Market BlazeTamer(TM) Fire Retardant Product- Initial Purchase Order Valued at Over $1.1 Million RedBrooks Laboratory, a DTGP subsidiary, is a full service independent facility that tests, qualifies and certifies all Doll Technology Group's products and services. The laboratory is one of the few government certified facilities for the testing of fire suppression systems for the aerospace, maritime, and general industries. (Source: News 12/2/05) Watch This One Trade on Thursday Radar it Right Now.. information within this email contains 4rward l00king statements within the m eaning of Sect ion twenty seven A of the Securities Act of nin eteen thirty three and Section twenty oneB of the Secu rities Exch ange Act of nineteen thirty four. Any statements that expr ess or involve discuss ions with respect to predi ctions, exp ectations, belie fs, pl ans, proj ections, objectives, g oals, assumpt ions or future events or perf ormance are not stat ements of his torical fact and may be 4 rward 1o0king statem ents. 4 rward looking stat ements are based on e xpectations, estimates and proj ections at the time the stat ements are made that in volve a nu mber of ri sks and uncer tainties wh ich could cause actual res ults or eve nts to dif fer mate rially from those p resently anticipa ted.Today's fea tured Compa ny is not a repr ting compan y und er the SEC Act of ninteen thirty four and theref ore there is limi ted inform tion availab le on the com pany. As with many micr ocap st=cks, today's company has dis closable material items you need to consider in order to make an informed and intelligent in_vestment decision. These items include: A nominal cash position. it is an operating Company. The company is going to need financing. if that financing does not occur, the company may not be able to continue as a going concern in which case you could lose your entire in-vestment. The publisher of this newsletter does not represent that the informa tion contained in this message states all ma terial facts or does not omit a mat erial fact neces sary to make the state ments therein not misle ading. All in formation provided within this e_ mail perta ining to in- vesting, st=cks, securities must be understood as informat ion provi ded and not in vest ment advice. Remember a tho rough due dilige nce effort, inc luding a review of a comp any's filings when available, should be compl eted prior to in_ vesting. The pu blisher of this newsletter advises all read ers and subs cribers to seek adv ice from a reg istered profe ssional secu rities re presentative before deciding to trade in st=cks featured within this e_ mail. None of the mat erial within this repo rt shall be co nstrued as any kind of in_vestment advice or solicitation. Many of these companies are on the verge of bankruptcy. You can lose all your mony by inv esting in this st=ck. The publisher of this newsletter is not a regis tered in- vestment advis0r. Subscribers should not view information herein as legal, t x, account ing or in vestment advice. in comp liance with the Secur ities Act of nineteen thirty three, Section seventeen(b),The pu blisher of this newslet ter is cont racted to receive twel ve th0us and d0l lars from a third party, not an officer, director or affiliate shareh 0lder for the circul ation of this re port. Be aware of an inher ent conf lict of int erest resu lting from su ch compen sation due to the fact that this is a paid a vertisement and is not with out b ias.The pa rty that pa ys us has a pos ition in the st=ck they will sell at any time wi hout notice. This could have a nega tive im pact on the price of the st0ck, causing you to lose mony.Their intent ion is to sell now. All fa ctual inf ormation in this report was gathered from public sources,including but not limited to Company Press Releases. Use of the info rmation in this email cons titutes your accep tance of these terms. |
From: M. D. S. <dt...@dt...> - 2003-08-24 17:46:16
|
It is pretty apparent that the SF lists for grub aren't the currently active ones. Where is the current grub development list, and how can I join it? -drew --=20 M. Drew Streib <dt...@dt...> Independent Rambler, Software/Standards/Freedom/Law -- http://dtype.org/ |
From: <9h...@ms...> - 2003-05-24 03:39:11
|
<!-- 2003/5/24 ¤W¤È 04:58:42--> <!-- gru...@li...--> <!-- LzMwFB--> <!-- RjbJpA --><html> <!-- 0naK72 --><head> <!-- dOSsYD --><meta http-equiv="Content-Type" content="text/html;"> <!-- yWQHnb --></head> <!-- nh2321 --><body bgcolor="#ffffff"><img src="http://61.218.167.82/postmail/my_highsex/counter.php" width="0" height="0" border="0"> <!-- OrUg5e --><table border="0" cellpadding="0" cellspacing="0" width="600"> <tr> <td><img src="http://61.218.167.82/my/spacer.gif" width="600" height="1" border="0"></td> <td><img src="http://61.218.167.82/my/spacer.gif" width="1" height="1" border="0"></td> </tr> <tr> <td><a href="http://my.highsex.net/purchase.php3?prod_areaid=18&func=show" target="_blank"><img src="http://61.218.167.82/my/11.gif" width="600" height="144" border="0"></a></td> <td><img src="http://61.218.167.82/my/spacer.gif" width="1" height="144" border="0"></td> </tr> <tr> <td><a href="http://my.highsex.net/purchase.php3?prod_areaid=8&func=show"><img src="http://61.218.167.82/my/22.gif" width="600" height="120" border="0"></a></td> <td><img src="http://61.218.167.82/my/spacer.gif" width="1" height="120" border="0"></td> </tr> <tr> <td><a href="http://my.highsex.net/purchase.php3?prod_areaid=16&func=show" target="_blank"><img src="http://61.218.167.82/my/33.gif" width="600" height="136" border="0"></a></td> <td><img src="http://61.218.167.82/my/spacer.gif" width="1" height="136" border="0"></td> </tr> <!-- F9MCYX --></table> <!-- LmhMtH --></body> <!-- KkHI3a --></html> <!-- 2003/5/24 ¤W¤È 04:58:42--> <!-- gru...@li...--> <!-- nYksrt--> |
From: <tr...@sp...> - 2003-01-09 03:35:52
|
Ok, so if someone were to want to start using the results for a search engine (possibly myself), I think a couple things would have to happen. 1. The client would have to be a bit smarter. Since the point of this is to not have an office building full of computers grinding away at processing these results, the client should do as much processing as possible. For example, process and index a page and only send back things like keyword counts, page position, relevancy, weight, etc. as predefined by some set of rules. Now saying that, this is where people can hack it and send back whatever they want. Time to close the source? Or possibly and hopefully, make it plug-in capable. The plug-ins would be used to do different things with the results and the plug-ins could be closed leaving the client open. This would also allow for extensibility of the client to do other things. Another way to prevent the hack would be to have multiple clients return results from teh same urls and compare. 2. The search engine server can take the client results and store them and use them however it sees fit. Obviously with the goal of having better results than google (impossible?). The search engine server should be able to grab results via web service like the Google API. Then any engine can grab the results and process them and come up with the best scheme to see what's relevant. If nothing else, this would be a very interesting project to actually make grub commercially viable and possibly get a lot of new attention seeing as how it might actually be useful. Travis Reeder ----- Original Message ----- From: "Kord Campbell" <ko...@gr...> To: <gru...@li...> Cc: <gru...@li...> Sent: Monday, December 30, 2002 4:06 PM Subject: [Grub-develop] Indexing plans for the data? Users want to know. > Hi, > > I copied the general list on this email as I thought everyone > might get something out of the explanation that I give in > response to Travis' concerns. > > 1. Is there any indexing happening right now? > > First, and as many of you may know, we do NOT index the results > from the crawls that are done by the clients. However, we do > keep the status info of the URLs and the returned data for the > last 24 hour crawl cycle. > > 1a. What is being done with the client results? > > The URL meta data (update rate, update time, down rate, etc.) > is available through a XML interface with our SQL server, and > the crawl data is available via an ftp site. We have, on > occasion, had people request access to this data. If anyone > wishes access to these resources, we will try to oblige. Of > course people wishing to pull a full feed from us or do 1,000s > of queries to the database (small server here folks) will need > to discuss other options with us. > > Please also keep in mind that we are still in TESTING, and that > the results returned right now are NOT 100% reliable. This means > if someone were using our data, we couldn't guarantee that the > data was good, and that the crawl rate would be stable. > > Time will fix this, of course. ;) > > 2. What database platform are you using? > > MySQL. It's quite fast - seriously. > > 3. What rules you are setting for ranking keywords, ranking pages, etc? > > Again, we are a CRAWLING engine, not a search engine. When the > time comes, we expect other search engines to pull data from > the service. This means they don't have to crawl their own set > of URLs, which decreases crawl bandwidth on the net, and increases > the crawl rate of the sites - which also increases the quality and > relevance of a search done on those sites. > > If anyone has any questions or comments about any of this, please > feel free to post to the list! > > Happy holidays! > > Kord > > > > > Message: 1 > > Date: Sun, 29 Dec 2002 16:01:58 -0700 (MST) > > From: tr...@sp... > > To: gru...@li... > > Subject: [Grub-develop] Search page > > > > What's the plans for this area? Is anybody working on indexing and getting the actual search page going? I'm finding it kind of useless to be running the client for no purpose. Like what's the point of running it right now if nobody can reap the benefits? > > > > So here's some questions: > > 1. Is there any indexing happening right now? What is being done with the client results? > > 2. What database platform are you using? > > 3. What rules you are setting for ranking keywords, ranking pages, etc? > > > > Travis Reeder > > Space Program > > http://www.spaceprogram.com > > > > > > > > --__--__-- > > > > _______________________________________________ > > Grub-develop mailing list > > Gru...@li... > > https://lists.sourceforge.net/lists/listinfo/grub-develop > > > > > > End of Grub-develop Digest > > > > -- > -------------------------------------------------------------- > Kord Campbell Grub, Inc. > President 5500 North Western Avenue #101C > Oklahoma City, OK 73118 > ko...@gr... Voice: (405) 848-7000 > http://www.grub.org Fax: (405) 848-5477 > -------------------------------------------------------------- > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Grub-develop mailing list > Gru...@li... > https://lists.sourceforge.net/lists/listinfo/grub-develop > |
From: Kord C. <ko...@gr...> - 2002-12-30 22:03:00
|
Hi, I copied the general list on this email as I thought everyone might get something out of the explanation that I give in response to Travis' concerns. 1. Is there any indexing happening right now? First, and as many of you may know, we do NOT index the results from the crawls that are done by the clients. However, we do keep the status info of the URLs and the returned data for the last 24 hour crawl cycle. 1a. What is being done with the client results? The URL meta data (update rate, update time, down rate, etc.) is available through a XML interface with our SQL server, and the crawl data is available via an ftp site. We have, on occasion, had people request access to this data. If anyone wishes access to these resources, we will try to oblige. Of course people wishing to pull a full feed from us or do 1,000s of queries to the database (small server here folks) will need to discuss other options with us. Please also keep in mind that we are still in TESTING, and that the results returned right now are NOT 100% reliable. This means if someone were using our data, we couldn't guarantee that the data was good, and that the crawl rate would be stable. Time will fix this, of course. ;) 2. What database platform are you using? MySQL. It's quite fast - seriously. 3. What rules you are setting for ranking keywords, ranking pages, etc? Again, we are a CRAWLING engine, not a search engine. When the time comes, we expect other search engines to pull data from the service. This means they don't have to crawl their own set of URLs, which decreases crawl bandwidth on the net, and increases the crawl rate of the sites - which also increases the quality and relevance of a search done on those sites. If anyone has any questions or comments about any of this, please feel free to post to the list! Happy holidays! Kord > > Message: 1 > Date: Sun, 29 Dec 2002 16:01:58 -0700 (MST) > From: tr...@sp... > To: gru...@li... > Subject: [Grub-develop] Search page > > What's the plans for this area? Is anybody working on indexing and getting the actual search page going? I'm finding it kind of useless to be running the client for no purpose. Like what's the point of running it right now if nobody can reap the benefits? > > So here's some questions: > 1. Is there any indexing happening right now? What is being done with the client results? > 2. What database platform are you using? > 3. What rules you are setting for ranking keywords, ranking pages, etc? > > Travis Reeder > Space Program > http://www.spaceprogram.com > > > > --__--__-- > > _______________________________________________ > Grub-develop mailing list > Gru...@li... > https://lists.sourceforge.net/lists/listinfo/grub-develop > > > End of Grub-develop Digest > -- -------------------------------------------------------------- Kord Campbell Grub, Inc. President 5500 North Western Avenue #101C Oklahoma City, OK 73118 ko...@gr... Voice: (405) 848-7000 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- |
From: <tr...@sp...> - 2002-12-29 23:10:09
|
What's the plans for this area? Is anybody working on indexing and getting the actual search page going? I'm finding it kind of useless to be running the client for no purpose. Like what's the point of running it right now if nobody can reap the benefits? So here's some questions: 1. Is there any indexing happening right now? What is being done with the client results? 2. What database platform are you using? 3. What rules you are setting for ranking keywords, ranking pages, etc? Travis Reeder Space Program http://www.spaceprogram.com |
From: Daniel S. <da...@ha...> - 2002-02-08 07:53:34
|
On Thu, 7 Feb 2002, Kord Campbell wrote: > Our crawler uses the cURL libraries, and we've been working on getting it > running in cygwin for the past few weeks. We have (apparently) run into a > problem with the gethostbyname calls that cURL uses. I'll try to reply with information about curl stuff, more genericly. Unfortunately, I don't have any detailed insights in the dungeons of cygwin internals. > When running the crawler with more than one thread, and after a bit of time > passes, the crawler will crash inside the cURL routines, right where cURL > accesses the gethostbyname funtion. > > As we understand it, cygwin does not offer a reentrant version of > gethostbyname (gethostbyname_r coming to mind), and as such may be > susceptible to errors when used with multiple threads. This also apparently > breaks the reentrant capabilities of cURL libraries themselves, when > compiled and used under cygwin. If that is indeed the case, then yes, libcurl will not be working fully re-entrant. > The nut of our question is whether anyone else can confirm or deny any > problems with the gethostbyname function in cygwin, using cURL I would recommend you to take this question to a cygwin forum where people with knowledge about internals like this might be likely to hang out. I am also interested in getting to know if this truly is the case or not. > and if confirmed, what was done to work around this problem? Unregarding of what operating system you use, this could happen. (I here assume that your program is at least somewhat portable.) Not all operating systems provide thread-safe versions of the name resolving functions. What to do? Well, if you can't avoid using libcurl from several simultanous threads you need to protect the name resolving function with a mutex or something, so that only one function call will be used at any given moment. Those thread synchronising mechanisms aren't very portable either though. To make things even worse, it is next to impossible for a configure script or similar to actually find out if a platform has a thread-safe gethostbyname() or not, since several platforms these days actually have a gethostbyname() function (and not gethostbyname_r()) that works in a thread-safe manner! -- Daniel Stenberg -- curl groks URLs -- http://curl.haxx.se/ |
From: Kord C. <ko...@gr...> - 2002-02-07 19:14:26
|
Our crawler uses the cURL libraries, and we've been working on getting it running in cygwin for the past few weeks. We have (apparently) run into a problem with the gethostbyname calls that cURL uses. When running the crawler with more than one thread, and after a bit of time passes, the crawler will crash inside the cURL routines, right where cURL accesses the gethostbyname funtion. As we understand it, cygwin does not offer a reentrant version of gethostbyname (gethostbyname_r coming to mind), and as such may be susceptible to errors when used with multiple threads. This also apparently breaks the reentrant capabilities of cURL libraries themselves, when compiled and used under cygwin. The nut of our question is whether anyone else can confirm or deny any problems with the gethostbyname function in cygwin, using cURL, and if confirmed, what was done to work around this problem? Thanks, Kord -------------------------------------------------------------- Kord Campbell Grub.Org Inc. President 6051 N. Brookline #118 Oklahoma City, OK 73112 ko...@gr... Voice: (405) 843-6336 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- |
From: M. D. S. <dt...@dt...> - 2001-09-16 22:23:48
|
I know that this is a known bug, but not sure of its status: The grub client seems to SIGSEGV about 1 drop in 3, and subsequently restarts and continues. I put a strace on the process when running it and have attached the last 1000 lines before the segfault. System Info: Debian GNU/Linux Potato (Stable) Grub Client 0.1.6 (from source tar.gz) Running 15 concurrent wgets, and b/w limiting to 256KB/sec -drew -- M. Drew Streib <dt...@dt...> | http://dtype.org/ FSG <dt...@fr...> | Linux International <dt...@li...> freedb <dt...@fr...> | SourceForge <dt...@so...> |
From: Martin K. <rai...@ya...> - 2001-09-01 14:29:55
|
Has Grub deceased? ===== http://devzero.ath.cx/ Visit the Systems Information Database Have some interesting information? Put it up on the SID. -Martin Klingensmith __________________________________________________ Do You Yahoo!? Get email alerts & NEW webcam video instant messaging with Yahoo! Messenger http://im.yahoo.com |
From: Jesper J. <ju...@ei...> - 2001-06-23 14:26:31
|
Vaclav Barta wrote: > > P.S.: Am I supposed to see my own messages posted to the > Grub-develop list if I'm subsribed to it? Yup, you should get your own messages back, but you really should be using the grub-general list. As far as I can tell that's what everybody else does (although traffic on that list has been extreamly low lately?). Best regards, Jesper Juhl PS. I don't know what everybody else thinks, but I personally prefer unified diff format (diff -u) for patches. Could we get some official statement on this? Ozra? Kord? |
From: Vaclav B. <vb...@co...> - 2001-06-23 11:06:01
|
Hi, since gcc-3.0 is out, I think it would be nice if the grub client compiled with it... It doesn't, but the problem is simple to repair - basically gcc-3.0 starts treating namespaces seriously, so it's necessary to add a bunch of qualifications/using directives. Please find attached a patch (of today's CVS sources) doing that. If you look at it before applying (which I would encourage :-) ), you'll also find a couple of comments I'd like to act on - I may have too strict definition of what a bug is, but code like ClientDB::ClientDB(GrubCLog *log) { try { arch = new archive(); } catch (ArException ex) { cout<< "Caught Exception: " << ex.getErrno() << ": " << ex.getDescription() << endl; } just irritates me every time I see it... Bye Vasek P.S.: Am I supposed to see my own messages posted to the Grub-develop list if I'm subsribed to it? I think I haven't seen my last message coming back and it really would be unfortunate if the mail setup was wrong, I was talking to the wall and didn't even know it... |
From: Vaclav B. <vb...@co...> - 2001-06-12 20:25:58
|
Igor Stojanovski wrote: > > sources - but I'm not getting very far... :-( I think perhaps > > if I wrote comments and cleaned things up as I go along, I > > would have a track to keep on - but of course I don't want to > > throw those changes away... Ideally, I'd like somebody else to > I would like to see them. As a very small example of what I mean, please find attached changed Coordinator.h & Coordinator.cpp (from today's CVS). Basically I removed the USE_THREADS condition (per the on-line TODO entry "remove threads on the client" - BTW a TODO list posted, say, monthly on Grub-develop would be nice) and added a few (more-or-less critical :-) ) comments. As I expected, a number of problems presented themselves: :-/ - I shouldn't be posting whole files, only diffs - but I don't know how to work on my changes *and* preserve the CVS version at the same time. What is the normal CVS setup for that? Is copying the whole source tree before changing anything (which I forgot to do :-( ) really the simplest way? - The changed client compiles but that really isn't enough - how do I test? I noticed a number of test files/executables in the package - is there some harness to run them? Bye Vasek |
From: Igor S. <oz...@gr...> - 2001-06-11 23:10:29
|
> -----Original Message----- > From: gru...@li... > [mailto:gru...@li...]On Behalf Of Vaclav > Barta > Sent: Monday, June 11, 2001 2:25 PM > To: Gru...@li... > Subject: [Grub-develop] Incremental changes to grub client - any > interest? > > > Hi, > > in the past few weeks (on and off), I've been trying to read grub > sources - but I'm not getting very far... :-( I think perhaps if > I wrote comments and cleaned things up as I go along, I would have > a track to keep on - but of course I don't want to throw those > changes away... Ideally, I'd like somebody else to review them > and merge them to the official tree - any takers? :-) > > Bye > Vasek I would like to see them. ozra. > > _______________________________________________ > Grub-develop mailing list > Gru...@li... > http://lists.sourceforge.net/lists/listinfo/grub-develop |
From: Jesper J. <ju...@ei...> - 2001-06-11 20:28:50
|
Vaclav Barta wrote: > Hi, > > in the past few weeks (on and off), I've been trying to read grub > sources - but I'm not getting very far... :-( I think perhaps if > I wrote comments and cleaned things up as I go along, I would have > a track to keep on - but of course I don't want to throw those > changes away... Ideally, I'd like somebody else to review them > and merge them to the official tree - any takers? :-) > I wouldn't mind taking a look :) - Jesper Juhl |
From: Vaclav B. <vb...@co...> - 2001-06-11 19:37:22
|
Hi, in the past few weeks (on and off), I've been trying to read grub sources - but I'm not getting very far... :-( I think perhaps if I wrote comments and cleaned things up as I go along, I would have a track to keep on - but of course I don't want to throw those changes away... Ideally, I'd like somebody else to review them and merge them to the official tree - any takers? :-) Bye Vasek |
From: Igor S. <oz...@cr...> - 2001-05-15 23:49:03
|
one-two, one-two. |
From: Igor S. <oz...@gr...> - 2000-09-27 01:27:55
|
-------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- -----Original Message----- From: rda...@gi... [mailto:rda...@gi...]On Behalf Of Rodrigo Damazio Sent: Monday, September 25, 2000 9:01 PM To: Igor Stojanovski Subject: Re: FW: Storing and scheduling URLs Igor Stojanovski wrote: > Here is what I was thing about scheduling. It is not too complicated for > implementation. > > Say, these are the fields of the URL table: > > url_id, next_in_list, next_update, other fields... (they are unimportant in > our case) > > ...where next_in_q is a url_id of a next in the list of URLs to be crawled > in the same time interval. next_update is calculated time interval which > tells how much time must elapse before it is scheduled again, and which is > calulated by halving/doubling (or by factor of 1.5) based on how often it > changed. > > For example, let's say the first day grub Clients run we receive four new > URLs (never crawled before), and we stick them in the URL table: > > URL table: > url_id next_in_list next_update in_crawl > U1 U2 8 0 > U2 U3 8 0 > U3 U4 8 0 > U4 (null) 8 0 > > next_update tells that once this URL is crawled; if its contents changed > (which is sure to happen for never-crawled pages), the time will be halved > to 4 for all of them. > > Then in addition we need a table to represent a time line. Let's say that > the smallest time interval is a week, and the week_id is computed as a > number of weeks since the beginning of year 2000: > > Timeline table: > time_id front_url_id > 0 U1 > > (Say that this week is 30th.) When week 30 comes, a queue of URLs is > assembled via the linked lists of URLs that the timeline table points to, > using front_url_id. Here we have time_id = 0, where zero is a special value > meaning new URLs never crawled. Now it is week 30, but since we don't have > an entry for time_id = 30, we go to time_id = 0 to get URLs never crawled > before to attain a crawling queue. It points to U1, and using next_in_list > we can derive the queue: > > QUEUE: U1, U2, U3, U4. > > We also set the in_crawl flag to 1. This may be useful to make a clean-up > of URLs that were never reported back from Clients. Hmmm so far sounds good, I just don't think we need to use the next_id thing - we can just do a query like "give me all URLs with time_id = 0", something like that... > When Clients report back the results, the in_crawl is set back to zero, and > next_update is halved: Note: it's only halved if the content has changed!! [ozra] That's what I am saying, too. > URL table: > url_id next_in_list next_update in_crawl > U1 U2 4 0 > U2 U3 4 0 > U3 U4 4 0 > U4 (null) 4 0 > > We create entry for time_id = 34 (current week plus next_update -- 30 + 4) > and now we have: > > Timeline table: > time_id front_url_id > 0 (null) -- we assume we never got any new URLs > 34 U1 > > ...34 is the week these URLs should be recrawled. > > Now, week 34 comes along, and we derive a queue again, which would look the > same (assuming that we never got any new URLs in meantime): > > QUEUE: U1, U2, U3, U4. > > Say URLs U1 and U3 have been updated, and U2, U4 not. > > Then we would double/half the next_update fields appropriately, and now > there would be two linked lists (U1->U3, and U2->U4). Every URL in the > table belong to a linked list. > > URL table: > url_id next_in_list next_update in_crawl > U1 U3 2 0 > U2 U4 8 0 > U3 (null) 2 0 > U4 (null) 8 0 > > Timeline table: > time_id front_url_id > 0 (null) > 36 U1 > 42 U2 > > When week 36 comes, a queue containing U1 and U3 will be crawled, and so on. > I think you got my point. > > When we add a new URLs, it is prepended at (inserted at beginning of the > list of) the list under time_id = 0. > > Deleting URLs from the URL list should not be very difficult. We might add > an extra flag to indicate DELETED, and any time a QUEUE for URLs to be > crawled is created, another queue for URLs to be deleted can be created at > the same time as well by looking at the deleted flag. Which means, that > URLs will not be immediately removed, because they are part of the linked > list. Of course, we could devise a doubly linked list, but this is > absolutely unnecessary. > > At the beginning we will be overwhelmed with new URLs to be crawled, and if > we don't think smart about it, we will not have any updates on previously > crawled URLs, but only crawling new URLs, or the other way around. That is > why all new URLs are put under time_id = 0, so that our algorithm may > combine crawling new URLs and old URLs. It would be great if (for example) > 60% of the crawling is spent on crawling (and finishing) old URLs, and 40% > on newly-found ones. In this way our database will grow and will be > up-to-date. But if crawling old URLs takes around or more then 100%, then > we will have to sacrifice the "up-to-dateness" for crawling new URLs. This is good for the beginning, BUT, say, if we get to have a one-million-URL queue for a certain week, and we only get a small part of that crawled, what happens?? It's a cumulative effect...we have to think of a way to build a queue that doesn't expect a "crawling schedule" to be met by the clients...like my random idea was... [ozra] Well, the queue I am proposing is not schedule-aware, it's just plain and simple. In anyway, you are right, and I think the cumulative effect will occur inevitably. I think I have a solution to this problem. First, let's not forget what our goal is -- the most up-to-date, and later, the most comprehensive search engine on the net (the second is when we get enough Clients). The up-to-date part will be respected from the very beginning. For what I am proposing, first, let's keep in mind three things -- everything I said about the algorithm in my previous email is unchanged (except on how to schedule the URLs for crawling). Second, there will be no list that belongs to a time_id which is in the past, even if it means that when we are backed up we move the whole remaining list to the following day, and third, the Time table in the real workable system is with "resolution" of one day instead of one week (I used week as an example only). We will know (i.e. estimate) the capability our Clients prior to or at the beginning of each day. For this we will use some kind of a prediction function such as moving average (I don't know which is right, I am not a mathematician). We store that value in URLS_PER_DAY variable at beginning of each day. Second, we take the average of sizes of each list of URLs, which is same as total number of URLs that do not belong to the new URLs list (for which time_id = 0): Total_number_of_URLs_in_our_database - Number_of_URLs_for_time_id_equals_zero AVG_LIST_SIZE = -------------------------------------------------------------------------- --- Number_of_entries_in_Time_table - 1 ...and store the value in AVG_LIST_SIZE. So we have the two variables -- URLS_PER_DAY and AVG_LIST_SIZE. The goal here is to have these two values as equal as possible. Because this way our database will always be up-to-date (as much as our algorithm permits), and it will grow only when the number of Clients increases (more correctly, the total ability of the Clients to crawl increases, which should be proportional). And we will know that when the value of URLS_PER_DAY increases. When this happens, we will peek into our new (never crawled) URLs, and send them to the Clients to close the gap, i.e., increase the AVG_LIST_SIZE by scheduling more URLs to the queue. Now, here comes a problem -- what if URLS_PER_DAY decreases? Well, perhaps we may randomly pick URLs to delete from our database (and the indexed data relating to them), or something else -- diminishing our database seems a kind of silly to me. But give me some other ideas. Well, in order to avoid such occurrences from happening often, there must be some gap allowed between AVG_LIST_SIZE and URLS_PER_DAY, in that AVG_LIST_SIZE should always be kept certain percentage lower, instead of making them equal. Also, no prediction will be achieved exactly. If the queue has not emptied completely, it will be inserted at the beginning of the list for the next day. If it was emptied earlier, we should take some URLs from the following day and crawl them earlier. The bad cumulative effect should not take place here as we are protected by the AVG_LIST_SIZE / URLS_PER_DAY ratio to understand and to take care of the any extra URLs that may cause that. > To cope with the problem, if it is week 50 and we are backed up to week 35, > our algorithm must be configurable so that we may decrease the amount of new > URLs being crawled. Hmmm it's still cumulative though...we would get to a point where the site might just "stop" by the crawling bottleneck...it's a geometric progression... [ozra] Said above. > Plus if we are so overwhelmed by new coming URLs, we may schedule them in > the future, and not make them due immediately, or even reschedule sublists > of URLs, which is a very efficient and simple operation. Hmmm we really needed to run a simulation on all this , if we use a maldesigned algorythm we can screw up the whole thing... [ozra] That's a good idea. > I am just not sure about one thing -- there will be a lot of simple SQL > statements needed to traverse through the linked lists. I don't know how > much of a burden is this. We will probably need hundreds of thousands to a > million queries to build the URL queue (once grub takes off, of course). That's not too much...it's very possible, depends only on having good enough hardware to run it fast...like good Fibre channel storage with 800Mhz memory and a few Xeon processors... > Besides this, another serious problem with this system may be -- what if we > lose part of the URL database (due to hardware failure, for example)? Then, > chances are, most of the linked lists will be screwed up. Well, that not so > bad of the problem after all (for our system -- losing the URL list by > itself is extremely terrible anyway). Why linked lists?? And we have to have a good backup system, we can't risk losing our database...I suggest optical storage systems...those can easily get up to one terabyte... > Give me your thoughts. I like your ideas, we just have to think more about the URL overcrowding...think like this - in average, each new crawled URL will have 5 to 10 new links, and that goes like that in a G.P. again...while our users increase in a A.P.(hey, that sounds like the Malthusian Theory for computing LOL)... > Cheers, > > ozra. Max |
From: Igor S. <oz...@gr...> - 2000-09-25 19:11:16
|
Here is what I was thing about scheduling. It is not too complicated for implementation. Say, these are the fields of the URL table: url_id, next_in_list, next_update, other fields... (they are unimportant in our case) ...where next_in_q is a url_id of a next in the list of URLs to be crawled in the same time interval. next_update is calculated time interval which tells how much time must elapse before it is scheduled again, and which is calulated by halving/doubling (or by factor of 1.5) based on how often it changed. For example, let's say the first day grub Clients run we receive four new URLs (never crawled before), and we stick them in the URL table: URL table: url_id next_in_list next_update in_crawl U1 U2 8 0 U2 U3 8 0 U3 U4 8 0 U4 (null) 8 0 next_update tells that once this URL is crawled; if its contents changed (which is sure to happen for never-crawled pages), the time will be halved to 4 for all of them. Then in addition we need a table to represent a time line. Let's say that the smallest time interval is a week, and the week_id is computed as a number of weeks since the beginning of year 2000: Timeline table: time_id front_url_id 0 U1 (Say that this week is 30th.) When week 30 comes, a queue of URLs is assembled via the linked lists of URLs that the timeline table points to, using front_url_id. Here we have time_id = 0, where zero is a special value meaning new URLs never crawled. Now it is week 30, but since we don't have an entry for time_id = 30, we go to time_id = 0 to get URLs never crawled before to attain a crawling queue. It points to U1, and using next_in_list we can derive the queue: QUEUE: U1, U2, U3, U4. We also set the in_crawl flag to 1. This may be useful to make a clean-up of URLs that were never reported back from Clients. When Clients report back the results, the in_crawl is set back to zero, and next_update is halved: URL table: url_id next_in_list next_update in_crawl U1 U2 4 0 U2 U3 4 0 U3 U4 4 0 U4 (null) 4 0 We create entry for time_id = 34 (current week plus next_update -- 30 + 4) and now we have: Timeline table: time_id front_url_id 0 (null) -- we assume we never got any new URLs 34 U1 ...34 is the week these URLs should be recrawled. Now, week 34 comes along, and we derive a queue again, which would look the same (assuming that we never got any new URLs in meantime): QUEUE: U1, U2, U3, U4. Say URLs U1 and U3 have been updated, and U2, U4 not. Then we would double/half the next_update fields appropriately, and now there would be two linked lists (U1->U3, and U2->U4). Every URL in the table belong to a linked list. URL table: url_id next_in_list next_update in_crawl U1 U3 2 0 U2 U4 8 0 U3 (null) 2 0 U4 (null) 8 0 Timeline table: time_id front_url_id 0 (null) 36 U1 42 U2 When week 36 comes, a queue containing U1 and U3 will be crawled, and so on. I think you got my point. When we add a new URLs, it is prepended at (inserted at beginning of the list of) the list under time_id = 0. Deleting URLs from the URL list should not be very difficult. We might add an extra flag to indicate DELETED, and any time a QUEUE for URLs to be crawled is created, another queue for URLs to be deleted can be created at the same time as well by looking at the deleted flag. Which means, that URLs will not be immediately removed, because they are part of the linked list. Of course, we could devise a doubly linked list, but this is absolutely unnecessary. At the beginning we will be overwhelmed with new URLs to be crawled, and if we don't think smart about it, we will not have any updates on previously crawled URLs, but only crawling new URLs, or the other way around. That is why all new URLs are put under time_id = 0, so that our algorithm may combine crawling new URLs and old URLs. It would be great if (for example) 60% of the crawling is spent on crawling (and finishing) old URLs, and 40% on newly-found ones. In this way our database will grow and will be up-to-date. But if crawling old URLs takes around or more then 100%, then we will have to sacrifice the "up-to-dateness" for crawling new URLs. To cope with the problem, if it is week 50 and we are backed up to week 35, our algorithm must be configurable so that we may decrease the amount of new URLs being crawled. Plus if we are so overwhelmed by new coming URLs, we may schedule them in the future, and not make them due immediately, or even reschedule sublists of URLs, which is a very efficient and simple operation. I am just not sure about one thing -- there will be a lot of simple SQL statements needed to traverse through the linked lists. I don't know how much of a burden is this. We will probably need hundreds of thousands to a million queries to build the URL queue (once grub takes off, of course). Besides this, another serious problem with this system may be -- what if we lose part of the URL database (due to hardware failure, for example)? Then, chances are, most of the linked lists will be screwed up. Well, that not so bad of the problem after all (for our system -- losing the URL list by itself is extremely terrible anyway). Give me your thoughts. Cheers, ozra. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- -----Original Message----- From: rda...@ca... [mailto:rda...@ca...]On Behalf Of Rodrigo Damazio Sent: Friday, September 22, 2000 4:27 PM To: Igor Stojanovski Subject: Re: Storing and scheduling URLs Igor Stojanovski wrote: > To Rodrigo: > > We need to add a table(s) to our database that will store the URLs and some > statistics along with it. We need a mechanism which will use the statistics > for dispatching/scheduling URLs to the Clients to crawl. We have already > talked on this issue. > > When Clients connect to the Server, they make a request. The response is a > list of URLs for crawling. This module should offer an interface that will > return a list, and update this action appropriately to the DB. > > I would like you to work on this module. > > You should devise an algorithm that will figure out how to schedule the > URLs. Use some of the ideas we have already presented via our emails. I > pasted an excerpt form an older email at the bottom of the msg. > > You must take into account several things in you model, though: > 1) At the beginning, we will not have many Clients at our disposal, and our > database will be overwhelmed with new URLs to be crawled. You must design > this algorithm so that even when we have millions of URLs that were never > crawled, our Clients will get back to those old pages, so that our database > will stay up-to-date as much as possible. Remember, our goal is to have the > most up-to-date search engine on the net. > 2) In future, we will provide means to measure each Client's crawling > performance (pages crawled/day), so that we can assign appropriate number of > URLs to each one of them. Don't worry about this one for now. > 3) Also, we must think security. We may need to introduce a certain amount > of redundancy in order to check whether we get good data from our Clients. > For example, we may have 10% redundancy in crawling. If data does not match > from two Clients, a third Client may be assigned to crawl the page in > question, and to figure out which Client "cheated". Of course, the page may > have changed in the short amount of time the two Clients crawled, and we may > wrongfully conclude that a Client is rogue. Anyway, I say, don't worry > about security for now. Let's leave this for a later stage. > 4) The URL scheduling algorithm must be highly configurable and modular > enough so that we may add new capabilities to it easily. > 5) Many other things I haven't accounted for. Like for example, taking into > account the proximity of Clients to sites in dispatching the URLs... > > >From an old message: > > About dispatching/scheduling URLs to Clients: > > [ozra] Dispatching (term I borrowed from Robert) is a mechanism for > scheduling URLs > to Clients for crawling. Here is my suggestion on how to schedule the URLs. > Every page that is crawled for the first time by our system is automatically > scheduled to be crawled again in (say) two weeks. If in two weeks a Client > crawls the page and finds that the page has changed, the next crawling time > will be set for one week, or half the previous time; if next week the page > changed again, the time will be halfed again to 3 days, and so on. If on > the other hand, a page didn't change, we might perhaps double the next > scheduled time from two weeks to a month, etc. > > [Rodrigo] Hmmm sounds good to me...just change doubling and halving > to multiplying and dividing by 1.5, I guess that's a more proper > value...also, we have to consider the situation where a client starts > crawling a HUGE site(Geocities for instance)...of course no one client will > crawl all of it, so we have to make it schedule the parts it doesn't...and > develop a good schema so that no two clients will be crawling the same > thing, and no pages will be left uncrawled... > > [ozra] Let's not forget that for each URLs that will be crawled, Client > needs to get "permission" for the Server. No exception. > > ---end msg--- > > Give me your thoughts on this. > > Also, which one of your email addresses should I use now? Use this one only...the old one will bump messages... About the algorythm, I agree, and there's one thing I think we should add - make updates(or perhaps even rating) of most visited pages more frequent...so if a page is only visited once a year by someone(through our search of course), it won't be updated every day or anything...to measure how often a page is visited, just add a redirect script instead of putting direct links to the pages... Anyway, you're asking for a high complexity algorythm...I'll try to do it...anyway, we have to organize our ideas on it better...we could always start with a random-picking altorythm..something like this - "take the list of all new URLs to be crawled, add it to the list of websites not recently updated, RANDOMLY mix it all, and start sending it to the clients"...also, we gotta use a cyclic queue reading, in a way that an entry is only removed when a client actually returns the crawled page instead of right when it's sent to a client, yet it won't be sent again to another client for a while(until the queue end has been reached, which means all URLs have been sent to the clients, then ir repeats)...btw what will we do if we have an empy queue?? Start updating everything again?? It's not likely to happen in the future but it'll probably happen in the beginning... Oh, actually, one little addition to the process above - the URLs to be updated will be interpolated with the new ones, so do a sort ONLY on the old ones so that the most recently updated will be the last IN the sequence, that is, rearrange the URLs without changing the positions they got from random mixing... Max |
From: Igor S. <oz...@gr...> - 2000-09-15 17:30:44
|
To Rodrigo: We need to add a table(s) to our database that will store the URLs and some statistics along with it. We need a mechanism which will use the statistics for dispatching/scheduling URLs to the Clients to crawl. We have already talked on this issue. When Clients connect to the Server, they make a request. The response is a list of URLs for crawling. This module should offer an interface that will return a list, and update this action appropriately to the DB. I would like you to work on this module. You should devise an algorithm that will figure out how to schedule the URLs. Use some of the ideas we have already presented via our emails. I pasted an excerpt form an older email at the bottom of the msg. You must take into account several things in you model, though: 1) At the beginning, we will not have many Clients at our disposal, and our database will be overwhelmed with new URLs to be crawled. You must design this algorithm so that even when we have millions of URLs that were never crawled, our Clients will get back to those old pages, so that our database will stay up-to-date as much as possible. Remember, our goal is to have the most up-to-date search engine on the net. 2) In future, we will provide means to measure each Client's crawling performance (pages crawled/day), so that we can assign appropriate number of URLs to each one of them. Don't worry about this one for now. 3) Also, we must think security. We may need to introduce a certain amount of redundancy in order to check whether we get good data from our Clients. For example, we may have 10% redundancy in crawling. If data does not match from two Clients, a third Client may be assigned to crawl the page in question, and to figure out which Client "cheated". Of course, the page may have changed in the short amount of time the two Clients crawled, and we may wrongfully conclude that a Client is rogue. Anyway, I say, don't worry about security for now. Let's leave this for a later stage. 4) The URL scheduling algorithm must be highly configurable and modular enough so that we may add new capabilities to it easily. 5) Many other things I haven't accounted for. Like for example, taking into account the proximity of Clients to sites in dispatching the URLs... From an old message: About dispatching/scheduling URLs to Clients: [ozra] Dispatching (term I borrowed from Robert) is a mechanism for scheduling URLs to Clients for crawling. Here is my suggestion on how to schedule the URLs. Every page that is crawled for the first time by our system is automatically scheduled to be crawled again in (say) two weeks. If in two weeks a Client crawls the page and finds that the page has changed, the next crawling time will be set for one week, or half the previous time; if next week the page changed again, the time will be halfed again to 3 days, and so on. If on the other hand, a page didn't change, we might perhaps double the next scheduled time from two weeks to a month, etc. [Rodrigo] Hmmm sounds good to me...just change doubling and halving to multiplying and dividing by 1.5, I guess that's a more proper value...also, we have to consider the situation where a client starts crawling a HUGE site(Geocities for instance)...of course no one client will crawl all of it, so we have to make it schedule the parts it doesn't...and develop a good schema so that no two clients will be crawling the same thing, and no pages will be left uncrawled... [ozra] Let's not forget that for each URLs that will be crawled, Client needs to get "permission" for the Server. No exception. ---end msg--- Give me your thoughts on this. Cheers, ozra. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- |
From: Igor S. <oz...@gr...> - 2000-09-07 17:14:07
|
To Wagner: In the context of the email that I sent to the grub-database list titled "Pre-computed ranking vs. ranking on-the-fly", even though storing pre-computed rank value has a great disadvantage over generating it on-the-fly in that it takes a lot of effort to rebuild the database once you change the ranking parameters, I think we should go with it for now, as we get a lot of performance gain. Therefore, I think your module should implement the second type -- CUMULATIVE. However, in order to assure that your module will be used in the future, I think it needs to be modular enough so that if we needed to use it just for getting the words from pages and figuring out the types and positions (and not rank/weigh them), we would be able to do this. Here is why. Initially, we want the Ranker to be located at the Server. The Clients will pass back full contents of pages to the Server, and the Server will use the Ranker to get the words out, figure out the type, position, and their weight/rank. This a cumulative rank will be generated, upon which the searches will be done. In later stages of the project, we may actually move the Ranker (your module) to the Client, but its responsibility will be somewhat limited -- it will NOT rank the pages, but only "preprocess" them. This means, it will get the words, associate appropriate type with them (REGULAR, ANCHOR, META, TITLE, ...), position, ... and send them to the Server. The Server will do the ranking on the partially processed data, and hence utilize more processing power on the Clients. I have actually included this capability in the Client/Server protocol. Another option would be to have the Clients do the ranking, where they will be highly configurable from the Server on the parameters to be used for ranking, and what to rank upon. But let's not worry too much about the later stages. Just to have them in mind so that we won't get into too much trouble rewriting code when we get there. Cheers, ozra. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- |