You can subscribe to this list here.
2000 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
|
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(61) |
Jun
(76) |
Jul
(31) |
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
|
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2004 |
Jan
(1) |
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2007 |
Jan
(2) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
(1) |
2008 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(14) |
2009 |
Jan
(30) |
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(2) |
Nov
(1) |
Dec
|
2011 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: PfizerWebstore <eov...@se...> - 2010-11-11 18:34:28
|
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>Newsletter</title> </head> <body> <div align="center" style="color:#999999; font-family: Arial, Helvetica, sans-serif;"><FONT size=1>To view this email as a web page, click</font> <FONT size=1> <a href="http://vuh.medicnoray.ru/?7808D22DA18a034B6e9AB" style="color: #999999;">here.</a></font><br><br> </div> <table width="800" cellspacing="0" cellpadding="0" border="1" bordercolor="#003399" align="center" bgcolor="#FFFFFF"> <tr> <td> <table width="98%" cellpadding="0" cellspacing="0" align="center" border="0"> <tr> <td align="left" style="font-family:Arial, Helvetica, sans-serif; font-size: 11px;" valign="bottom"> Wednesday, January 2, 2010 </td> <td align="center" style="font-family:Arial, Helvetica, sans-serif; font-size: 11px;" valign="bottom"> Volume 91, Issue 9 </td> <td align="right" valign="top"> <a href="http://qeybu.medicnoray.ru/?5413A1b6f4181d12"> <span style="font-family:Arial, Helvetica, sans-serif; font-size: 11px;">Forward to a Friend</span></a> </td> </tr> </table> <hr color="#003399" width="795"> <div style="text-align: center"> <br> <a href="http://xutosy.medicnoray.ru/?a95C99c9E5566C5F699dC" target="_blank"> <img alt="Click for browsing web-store" src="http://b.medicnoray.ru/ibezeel.jpg" style="border-width: 0px"></a><br> </div> <hr color="#003399" width="795"> <table width="795" cellpadding="5" cellspacing="0" bgcolor="#FFFFFF" align="center"> <tr> <td align="center" style="font-family:Arial, Helvetica, sans-serif; font-size:11px"> ©2010 Ray moon Content EU - All rights reserved.</td> </tr> </table> </td> </tr> </table> <table width="800" cellpadding="5" cellspacing="0" bgcolor="#FFFFFF" align="center"> <tr> <td style="font-family:Arial, Helvetica, sans-serif; font-size: 10px;" align="center"> To unsubscribe from future newsletters, <a href="http://fiv.medicnoray.ru/?b1318F67C6a9f26fB707DD7&unsubscribe=gru...@li..."> <span style="font-family:Arial, Helvetica, sans-serif; font-size:10px;">go here</span></a> to change your subscription information. </td> </tr> <tr> <td style="text-align: center"> <font face="Arial, Helvetica, sans-serif" size="-2" color="#666666">This email was sent to: <b>gru...@li...</b></font><Br /> <font face="Arial, Helvetica, sans-serif" size="-2" color="#666666">We respect your right to privacy - click <a href="http://uxuwe.medicnoray.ru/?5C1Cd19F48Bb1BD3cB">here</a> to view our policy.</font> </td> </tr> </table> </td> </tr> </table> <IMG SRC="http://www.ygidizoma.com/?95958b4E5a52B2Edc4523B8d72B9B9d2C72E76e3Be8"> </body> </html> |
From: <gru...@li...> - 2009-01-06 17:41:03
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" > <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> </head> <body> <div align="left"> <table border="0" width="650" cellpadding="0"> <tr> <td valign="bottom"> <blockquote> <p align="center"><font face="Tahoma">If you are unable to see the message below, <a href="http://iwbgv.nselwiai.cn/view.php?5e1383a2867e288ef15f0200"> click here</a> to view.</font></p> </blockquote> </td> </tr> <tr> <td valign="bottom"> <blockquote> <p align="center"><font size="2" face="Arial"><br> </font><a href="http://eixqmol.nselwiai.cn/"> <img src="http://image.nselwiai.cn/up.jpg" border=0></a></p> </blockquote> </td> </tr> <tr> <td valign="bottom"> <blockquote> <blockquote> <p><font size="2" face="Arial"><br> Thank you for your interest in Adrelief<br><br>You are receiving this e-mail because you have subscribed to product updates.<br><br>If you want to unsubscribe from Adrelief Newsletter, please visit <a href="http://iwbgv.nselwiai.cn/remove.php?msgid=5e1383a2867e288ef15f0200&user=gru...@li..."> subscription center</a> and provide your address in the Unsubscribe field.<br><br> Copyright (C) 2008, Adrelief<br>8025 Maryland Ave. Saint Louis, MO 63105 </font><br> </p> </blockquote> </blockquote> </td> </tr> </table> </div> </body> </html> |
From: <gru...@li...> - 2009-01-06 08:48:56
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" > <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> </head> <body> <div align="left"> <table border="0" width="650" cellpadding="0"> <tr> <td valign="bottom"> <blockquote> <p align="center"><font face="Tahoma">If you are unable to see the message below, <a href="http://mprtfz.mizuzesof.cn/view.php?e3830973b7380f7616c6b"> click here</a> to view.</font></p> </blockquote> </td> </tr> <tr> <td valign="bottom"> <blockquote> <p align="center"><font size="2" face="Arial"><br> </font><a href="http://wsc.mizuzesof.cn/"> <img src="http://pictures.mizuzesof.cn/ace.jpg" border=0></a></p> </blockquote> </td> </tr> <tr> <td valign="bottom"> <blockquote> <blockquote> <p><font size="2" face="Arial"><br> Thank you for your interest in BroadBased Communications<br><br>You are receiving this e-mail because you have subscribed to product updates.<br><br>If you want to unsubscribe from BroadBased Communications Newsletter, please visit <a href="http://mprtfz.mizuzesof.cn/remove.php?msgid=e3830973b7380f7616c6b&user=gru...@li..."> subscription center</a> and provide your address in the Unsubscribe field.<br><br> Copyright (C) 2008, BroadBased Communications<br>5451 Robie Dr Macon, GA 31216 </font><br> </p> </blockquote> </blockquote> </td> </tr> </table> </div> </body> </html> |
From: Vegas C. V. <Jil...@us...> - 2007-12-16 09:37:26
|
Play at ClubVIP Casino and you can be sure of an exhilarating experience U.S. Players are welcome! Join right now and have the action in minutes. 750 USD Free just for you! http://growupcasino.com |
From: Ron A. <a_a...@co...> - 2007-11-05 00:41:11
|
Of whom do I have the honour? :) Forgiveness does not change the past, but it does enlarge the future. It is a kingly act to assist the fallen. Happiness lies in the joy of achievement and the thrill of creative effort. |
From: Leonard <gf...@ma...> - 2007-03-16 16:19:46
|
will be financed with a across the nation launchedThe project last month when the site a university bearing the Methodist name is utterly inappropriate."as the apparent winner Hi there! stock owner, We are a stock advancement company and we can promote any kind of stocks. We have been working on US market for 3 years. But this time it is getting harder and harder to work on US market. We are looking for grave partners in Europe with whom we can make great deal together. That proposal is for you if you have a company and you want to encourage it. We encourage stocks by pile email advertising. Stock owner We can increase the cost of your STOCK and we can increase average day trading range. We can increase price up to 200-260% in 2 weeks and also we can increase volume by 10 times each trading day. You don't have to pay anything in beforehand. First we increase the price and the range, then you pay. Payment: Our price for that is 10% from the daily capacity. We tally up it by the following formula: last price * daily volume. If you are interested in our offer please write me back to the following email: sto...@ya... have a long history questions about across the nation launched questions about |
From: Julie Z. <ur...@ma...> - 2004-06-20 19:31:03
|
The greater the difficulty the more glory in surmounting it. Skillful pilots gain their reputation from storms and tempests. Deliberation is the work of many men. Action, of one alone. A short absence is the safest. All animals, except man, know that the principal business of life is to enjoy it. There are no rewards or punishments -- only consequences. Uniform ideas originating among entire peoples unknown to each other must have a common ground of truth. Knowing what you can not do is more important than knowing what you can do. In fact, that's good taste. Anytime you suffer a setback or disappointment, put your head down and plow ahead. A skilful leech is better far, than half a hundred men of war. You have to accept whatever comes and the only important thing is that you meet it with the best you have to give. Remember: there are no small parts, only small actors. If you don't make things happen then things will happen to you. Every time I hear that word, I cringe. Fun! I think it's disgusting it's just running around. It's not my idea of pleasure. The wise man does at once what the fool does finally. |
From: Lamont B. <Ge...@t-...> - 2004-05-17 08:23:05
|
<body bgcolor=3D"#FFFFFF" text=3D"#000000"><font size=3D"1" color=3D"#CCFF= FF"> pigging worthington knoll camden sunlit spurge laryngeal coastal flutter l= ifo nobelium gaithersburg sale afflict despise dissuade muriatic quintus v= ivid famish holdup everett corporeal dearie windstorm anew baffin hope one= ida ordinary continuant sidewall and carmine bedrock motive antiquated gan= try=20<br>tx accessible bonneville mans esteem aspect implant handmade uni= modular haberdashery tango byronic chordata deerstalker priest omitted cam= illa backwater avert machismo cusp buttrick larynx compagnie intermit colo= ssal=20</font><br><br><br><p><font size=3D"3">Hi, <b><font color=3D"#CC00F= F"> Marked down Licenced OEM brand software</font></b>, very big choice <br> (more than 1500 programs). <b>Monsters Dlsc0unts up to 90%!</b><br>0nly 5 = Days limited time 0ffer, hurry up!<br><a href=3D"http://standstill.reachsoftware.com/?t= el"> Enjoy here! </a></font></p><br><br><br><font size=3D"1" color=3D"#CCFFFF">= assume embellish brim cancelled pronto entertain wring polysaccharide urba= ne bobolink ac ghastly cankerworm morgue polka licentious bismark blake su= llen geographer taoist wearied anticipate chlorate exhume elysian callous = won't=20<br> gunflint niacin came inland numeric telemeter humid spiderwort bellatrix d= urer cabal fry stony invigorate dewdrop provocateur buy oncoming penicilli= n cobalt latitudinary tonal conscientious michele wolfgang archery v canno= nball chorine nielson carmen baylor=20<br>scotsmen cortex fanout infimum p= ious ftc categoric anticipate roebuck dovekie nu jigging inseparable shake= able salmon=20<br>chinchilla trapezoid sax consistent bridgehead whither m= errymake pappas vivify conjoin senate cushing liquefy slanderous maidserva= nt copra festive=20</font></body> |
From: Phil S. <P.R...@IE...> - 2003-06-05 03:22:12
|
I have found the attached script to be quite useful, for instance to make boot floppys that default to different distributions, kernels, or OSs, and hope others may as well. It has evolved and been used successfully on Red Hat releases 7.3 through 9. Phil Schaffner |
From: <ba...@33...> - 2003-05-20 23:40:50
|
<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=big5"> </head> <body oncontextmenu="return false" ondragstart="return false" onselectstart="return false" bgcolor="#ffffff"> <table border="0" cellpadding="0" cellspacing="0" width="738"> <tr> <td><img src="http://211.23.227.42/s/igogo/spacer.gif" width="364" height="1" border="0"></td> <td><img src="http://211.23.227.42/s/igogo/spacer.gif" width="374" height="1" border="0"></td> <td><img src="http://211.23.227.42/s/igogo/spacer.gif" width="1" height="1" border="0"></td> </tr> <tr> <td rowspan="2"><a href="http://www.igogo.idv.tc/sex/purchase.php3?prod_ipage=&prod_areaid=14&product_id=1448" target="_blank"><img name="SEX_r1_c1" src="http://211.23.227.42/s/igogo/SEX_r1_c1.jpg" width="364" height="169" border="0" alt="趕快進站瞧瞧!!"></a></td> <td><a href="http://www.igogo.idv.tc/sex/purchase.php3" target="_blank"><img name="SEX_r1_c2" src="http://211.23.227.42/s/igogo/SEX_r1_c2.jpg" width="374" height="132" border="0" alt="趕快進站瞧瞧!!"></a></td> <td><img src="http://211.23.227.42/s/igogo/spacer.gif" width="1" height="132" border="0"></td> </tr> <tr> <td rowspan="2"><a href="http://www.igogo.idv.tc/sex/purchase.php3?prod_areaid=584&func=show" target="_blank"><img name="SEX_r2_c2" src="http://211.23.227.42/s/igogo/SEX_r2_c2.jpg" width="374" height="177" border="0" alt="趕快進站瞧瞧!!"></a></td> <td><img src="http://211.23.227.42/s/igogo/spacer.gif" width="1" height="37" border="0"></td> </tr> <tr> <td rowspan="2"><a href="http://www.igogo.idv.tc/sex/purchase.php3?prod_ipage=&prod_areaid=17&product_id=1437" target="_blank"><img name="SEX_r3_c1" src="http://211.23.227.42/s/igogo/SEX_r3_c1.jpg" width="364" height="163" border="0" alt="趕快進站瞧瞧!!"></a></td> <td><img src="http://211.23.227.42/s/igogo/spacer.gif" width="1" height="140" border="0"></td> </tr> <tr> <td rowspan="2"><a href="http://www.igogo.idv.tc/sex/purchase.php3?prod_ipage=&prod_areaid=696&product_id=1423" target="_blank"><img name="SEX_r4_c2" src="http://211.23.227.42/s/igogo/SEX_r4_c2.jpg" width="374" height="181" border="0" alt="趕快進站瞧瞧!!"></a></td> <td><img src="http://211.23.227.42/s/igogo/spacer.gif" width="1" height="23" border="0"></td> </tr> <tr> <td><a href="http://www.igogo.idv.tc/sex/purchase.php3?prod_ipage=&prod_areaid=12&product_id=1454" target="_blank"><img name="SEX_r5_c1" src="http://211.23.227.42/s/igogo/SEX_r5_c1.jpg" width="364" height="158" border="0" alt="趕快進站瞧瞧!!"></a></td> <td><img src="http://211.23.227.42/s/igogo/spacer.gif" width="1" height="158" border="0"></td> </tr> <tr> <td><img src="http://211.23.227.42/s/igogo/SEX_r6_c1.jpg" width="364" height="76" border="0"></td> <td><img src="http://211.23.227.42/s/igogo/SEX_r6_c2.jpg" width="374" height="76" border="0"></td> <td><img src="http://211.23.227.42/s/igogo/spacer.gif" width="1" height="76" border="0"></td> </tr> </table> <p align="center"> <img src="http://61.218.167.82/postmail/igogo/counter.php" width="0" height="0" border="0"> <form name="no_mail" method="post" action="http://61.218.167.82/postmail/igogo/counter.php?&no_mail=1"> <input type="text" name="smtp"> <input type="submit" name="Submit" value="取消訂閱"> </form> </body> </html> <!-- 2003/5/21 上午 07:37:47--> <!-- gru...@li...--> <!-- fhUOGa--> |
From: Kord C. <ko...@gr...> - 2002-11-20 19:15:31
|
Otis, Good questions. We expect you clients to hold us accountable for what we do with the data, as it is either your data that we are collecting, or your machines that we are using with which to collect the data. One thing that has been holding us up is the Windows client. Now that we have it done (and hopefully stable today), we should be able to retain more clients for crawling. This also allows us to start marketing the client, without us having to worry about newbies using it, and it crashing on them. Crashing programs tend to turn people off strangely enough. ;) Right now we are crawling about 3M URLs a day, with about 30-40 clients running per day. This is an average of about 100,000 URLs per day, per client. We currently have about 30M URLs in the database, so that puts our re-crawl rate at once every 10 days or so. We think that a good goal for re-crawl is about once every 7 days. The plan is to scale the number of URLs in the database to the number of crawlers currently running. As the number of crawlers running goes up, so does the number of URLs that we can re-crawl each week. Expect an announcement from us next week concerning our plans for making the returned data more accessible. I think you guys are going to like what we are going to make available to you. Later, Kord -- -------------------------------------------------------------- Kord Campbell Grub, Inc. President 5500 North Western Avenue #101C Oklahoma City, OK 73118 ko...@gr... Voice: (405) 848-7000 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- Today's Topics: 1. Grub goals, ETA, etc. (otisg) --__--__-- Message: 1 From: "otisg" <ot...@iV...> To: <gru...@li...> Cc: Date: Mon, 18 Nov 2002 22:34:35 -0800 Subject: [Grub-general] Grub goals, ETA, etc. This is a multi-part message in MIME format. ------=_NextPart_000_1004_01C28F52.AC16A290 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Hello, I've been running the Grub client for a while, and I am curious when some of the things mentioned at http://www.grub.org/investors.php will start happening? Also, I am curious, what is the number of URLs that Grub has crawled so far, and I'm also wondering whether Grub is capable of re-fetching every page it knows at least once a month? I'm asking this because that's how often Google, Alltheweb, etc. do it, so I assume Grub has to do better if it wants to appear attractive to the big search engines, no? Thanks, Otis _______________________________________________________________ Sign up for FREE iVillage newsletters <http://s.ivillage.com/rd/16705> . >From health and pregnancy to shopping and relationships, iVillage has the scoop on what matters most to you. ------=_NextPart_000_1004_01C28F52.AC16A290 Content-Type: text/html Content-Transfer-Encoding: 7bit <HTML> <BODY> Hello,<br> <br> I've been running the Grub client for a while, and I am curious when some of<br> the things mentioned at http://www.grub.org/investors.php will start happening?<br> <br> Also, I am curious, what is the number of URLs that Grub has crawled so far,<br> and I'm also wondering whether Grub is capable of re-fetching every page it<br> knows at least once a month? I'm asking this because that's how often<br> Google, Alltheweb, etc. do it, so I assume Grub has to do better if it wants<br> to appear attractive to the big search engines, no?<br> <br> Thanks,<br> Otis<br> <br> </BODY></HTML> <BR><font face="Arial, Helvetica, sans-serif" size="2" style="font-size:13.5px">_______________________________________________________________<BR>Sign up for <A HREF="http://s.ivillage.com/rd/16705">FREE iVillage newsletters</A>.<BR>From health and pregnancy to shopping and relationships, iVillage<BR>has the scoop on what matters most to you. </font><br><br> ------=_NextPart_000_1004_01C28F52.AC16A290-- --__--__-- _______________________________________________ Grub-general mailing list Gru...@li... https://lists.sourceforge.net/lists/listinfo/grub-general End of Grub-general Digest |
From: otisg <ot...@iV...> - 2002-11-19 06:36:23
|
Hello, I've been running the Grub client for a while, and I am curious when some of the things mentioned at http://www.grub.org/investors.php will start happening? Also, I am curious, what is the number of URLs that Grub has crawled so far, and I'm also wondering whether Grub is capable of re-fetching every page it knows at least once a month? I'm asking this because that's how often Google, Alltheweb, etc. do it, so I assume Grub has to do better if it wants to appear attractive to the big search engines, no? Thanks, Otis _______________________________________________________________ Sign up for FREE iVillage newsletters <http://s.ivillage.com/rd/16705> . From health and pregnancy to shopping and relationships, iVillage has the scoop on what matters most to you. |
From: Robert M. <bo...@io...> - 2001-11-13 17:13:26
|
This is very similar to a project SearchKing has been planning. From initial review, it looks like a lot of work we were expecting to have to do may have already been done and perhaps we would accomplish more good quicker by helping you. Look forward to being on the list. Bob Massa SearchKing, Inc. PS: Totally happy to see an Oklahoma company doing ANYTHING that may get us a little respect!! |
From: Vaclav B. <vb...@co...> - 2001-07-17 19:17:54
|
Lowell Hamilton wrote: > Rotation in general is bad though unless it is vitally important > to keep the logs ... a busy crawler can generate 10k/minute in > log data .. that's 14mb at the end of the day plus the rotation Well, rotate it every hour... :-) I wouldn't even consider generating 10k/minute until I have a very annoying, hard-to-debug crash - such an information overload isn't good for anything... > A solution might be log levels defined in the conf with a string > like dnet pproxy: > LogLevel: urls info stats server errors > > ... or general sets: > LogLevel: [Minimum|Verbose|Debug] I would prefer the system log levels: ...|warn|info|debug At the least, they already have written documentation... :-) Bye Vasek |
From: Vaclav B. <vb...@co...> - 2001-07-17 19:17:51
|
Lowell Hamilton wrote: > Jeff Squyres wrote: > > That, paired with a minimum-time-before-recrawling metric > > (say, each URL doesn't need to be re-crawled for at least 2 > > days, or perhaps something more intelligent, such as URLs that > > don't change for a while get progressively longer periods > > between re-crawling, etc.), would go a long way to ensuring > > not to penalize people for running the grub client by > > getting cease and desist notices. > The idea of a minimum time for recrawl is another great idea ... > right now I'm seeing the url list cycle about once a day... if > the crawlers were busy working on finding new urls instead of > recrawling unchanged urls for the 2nd time that day it would be > a lot better. The idea of backing off the crawls based on the > url unchanging could be bad though ... a url that stays static > for 3 weeks and is backed off could take a few days extra to > update .. and that is an example of many sites on the IMHO first of all, grub should respect the HTTP header (forgot which one, sorry) saying how long the URI should be cached. Then we can see whether it helps - although I'm sceptical about the majority of webmasters specifically setting caching to limit load on their servers, perhaps at least those big sites take care... Even if it practically doesn't help, at least we can respond to the cease & desist with "since your page says it's fresh every minute, we want to see it every minute" (don't try this at home :-) ). > One thing that would be useful is discovery of dynamically > generated urls and backing them off. More and more sites, > especially geocities/yahoo hosted and other dynamic banner and One alternative would be never index uncacheable content - but I would certainly want to see which percentage of the web content is uncacheable before proposing to skip it all... > advertising made on each hit. Backing off some of these sites > or flagging them somehow to not be only monitored, or something > would free crawlers up a bit. Since the goal of grub is not to > index these pages, but only to determine of a site has been > updated, grub could just return these urls to the outgoing feed > every x hours and keep the crawlers busy doing something else. > Determining a site like this could be just having a url > scheduled in 10 different packets. If they all come back with a Perhaps, but second-guessing stupid or antagonistic webmasters should IMHO come *after* cooperation with those who are willing and able to cooperate... Bye Vasek |
From: Igor S. <oz...@gr...> - 2001-07-17 17:34:27
|
> Is there a reason that a fixed IP address is required? Other than > "security"? Indeed, what if I'm behind my ISP's NAT and even though I > might get a "fixed" IP, it would be a private IP like 192.168.something. > > > Perhaps a key or password system would be better. Log onto the > > website and enter a password, which goes into to the grub.conf. Or a > > key system where each unique client instance must have a > > server-assigned key put in the conf file, and tracking is done > > server-side blocking the client if a key-id connects from more than 2 > > ip addresses in a 6-hour period... and that key is used to encrypt the > > session. > > Sure, this would be fine as well. [ozra] I don't see good reason to have the whole session encrypted. If we have user_id/password system, encrypting the password would be nice, tough. I agree there are a lot of what-if's with IP address authentication. Currently, I favor user_id/password authentication then using IP address to do it. > Another thing I've seen is that of all the urls we are crawling, almost > none are new. Is the grubdexer actually working? Several weeks ago I > submitted my site to grub, and it was crawled. Checking using the url > searcher, resources below what I submitted have never been seen > before... which pretty much means new urls aren't being found?!?!? Just > as a little test I have my own server setup and entered one of the local > portal sites to crawl and had my big crawler run for an hour. It > discovered almost 10k urls on that domain alone .. which would show up > in the real crawler as lists like the cnn.com and other huge lists.. [ozra] The grubdexer works (even though it's slow), and it IS finding new URLs from the pages returned. The top table at http://www.grub.org/stats.php shows the actual increase of new URLs retrieved. However, the newly-found URLs are NOT automatically moved to be crawled/indexed. We manually control the number of URLs to be crawled, and we haven't moved any new URLs in a while (which kind of explains why the second graph is a straight line). This is intentional -- this way we can control best/worst/average update time for the URLs in our database. Remember -- our main goal is to be up-to-date more than trying to crawl every resource out there. We have probably found and inserted the new URLs found in your submitted pages. I can check that for you if you give me the URLs. > If the project needs help with another server to help index or something > like that I can help out (I have a BIG VA box idle) ... if the grubdexer > is just behind, kill the scheduler for a little bit and let it catch up > or something. [ozra] You are right. The grubdexer is slow, and can't catch up with grubd when the load is greater than something like 3,000,000 URLs/day crawled. In the course of this and probably the next week I will be testing several different models for the grubdexer to try to get significant improvement. > Here is another nifty question. It has been a long proven fact that not > every url has a link to it somewhere. Several of the search engines > have designed slick ways to go around that limitation to find new urls > .. like url-catching the newsgroups, retrieving a dump of every tld and > hitting each url and www.url in there... and tricks like using the > broken mod_index (http://www.domain.com/?S=M gives you a directory > listing even if directory listing is disabled for a directory .. google > is good at that one).... All that has given them a url list several > times larger than just crawling alone. > > Will the grub project be trying slick things like that ... or perhaps > getting a url list from another engine? At one point I dumped several > tld's into the urls submission form, but none never made it into > crawler-land. [ozra] Sure we would like features like that, but at this point or at any time we have an order of magnitude more newly found URLs than those that are crawled, up until a point of saturation (someone like Google should be experiencing that, definately not us; not yet). > The idea of a minimum time for recrawl is another great idea ... right > now I'm seeing the url list cycle about once a day... if the crawlers > were busy working on finding new urls instead of recrawling unchanged > urls for the 2nd time that day it would be a lot better. [ozra] Actually, such features exist in the current scheduler, but they are not used to it's fullest. That's because we have way less number of URLs we crawl than what our capability allows. And the reason for this is because we are still testing new features on the scheduler, and larger number may interfere with what we do. Plus we need better measuring tools to figure out recrawl times, total number of URLs to crawl, etc. Cheers, ozra. |
From: Lowell H. <lha...@vi...> - 2001-07-16 23:14:30
|
> More importantly, though, it probably needs to limit the number of URLs on > a given web server crawled by each grub client in a specific time period. Knowing what "server" a url is on would be difficult because then the grub master would have to keep track of the ip resolutions for each url, which often changes and can consume a lot of resources. Just tracking the hostname should be enough for this application. Any server that has hundreds of domains on it should be beefy enough to handle a few crawls at a time, one to each domain, and that shouldn't be a problem (as an isp admin, I do that myself once every 30 seconds monitoring anyway for >1000 domains) > So perhaps each grubber can crawl (max(5% of all know URLs on that site, > 100 URLs)) from a given web server in a 24 hour period. I made up these > specific numbers, but you get the idea -- use some kind of maximum metric > that each grub client will crawl in a given period of time. That would be a good idea (imho at least) .. The number would have to be much higher though .. a site with 10k urls in the database would take too long to complete to be useful. > This allows an entire web site to be crawled in that period of time -- so > you can still get fairly accurate, up-to-date stats -- but each URL on the > site will only be crawled *once* (max) per time period, and by potentially > many different crawlers so that no one grub client is identified as a DoS > agent. One thing that would be ideal is if the server/scheduler provided urls to clients in sets that were specifically generated instead of just spewing out the next 500 in the table (or the table was generated with the client sets in mind). This would allow for some nifty checking to be added for crawl limiting. For example, each packet to send to a client has at most 20 urls for a specific hostname, one every 15 urls. > That, paired with a minimum-time-before-recrawling metric (say, each URL > doesn't need to be re-crawled for at least 2 days, or perhaps something > more intelligent, such as URLs that don't change for a while get > progressively longer periods between re-crawling, etc.), would go a long > way to ensuring not to penalize people for running the grub client by > getting cease and desist notices. The idea of a minimum time for recrawl is another great idea ... right now I'm seeing the url list cycle about once a day... if the crawlers were busy working on finding new urls instead of recrawling unchanged urls for the 2nd time that day it would be a lot better. The idea of backing off the crawls based on the url unchanging could be bad though ... a url that stays static for 3 weeks and is backed off could take a few days extra to update .. and that is an example of many sites on the net. Unless the threshold was a couple months or something it wouldn't be too useful. One thing that would be useful is discovery of dynamically generated urls and backing them off. More and more sites, especially geocities/yahoo hosted and other dynamic banner and fluff sites are going to change every time you hit them, because of the change in advertising made on each hit. Backing off some of these sites or flagging them somehow to not be only monitored, or something would free crawlers up a bit. Since the goal of grub is not to index these pages, but only to determine of a site has been updated, grub could just return these urls to the outgoing feed every x hours and keep the crawlers busy doing something else. Determining a site like this could be just having a url scheduled in 10 different packets. If they all come back with a different CRC you've got one. Grubdex it, flag it for crawling only every week (for finding new urls) and there ya go. That would eliminate a large percentage of the urls being crawled every day. There would be a chance that a new unseen link could have been posted on the page between crawls, but until there is a huge crawler base there just won't be time to crawl all those. Lowell |
From: Jeff S. <jsq...@ls...> - 2001-07-16 22:37:55
|
On Mon, 16 Jul 2001, Lowell Hamilton wrote: > Another solution, but difficult solution, would be reorganize the > tables on the master, seperating out the hostname and path, and setup > the scheduler to limit the number of each hostname that can be > scheduled in a certain period. That would eliminate the problems, and > also allow better results to be returned in the future (i.e. you could > generate reports like # of urls for a domain, total hostnames, etc). > There are probably better ways to do it too ... as soon as someone > gets a full database dump from google we'll know how <smirk> Hear, hear. More importantly, though, it probably needs to limit the number of URLs on a given web server crawled by each grub client in a specific time period. So perhaps each grubber can crawl (max(5% of all know URLs on that site, 100 URLs)) from a given web server in a 24 hour period. I made up these specific numbers, but you get the idea -- use some kind of maximum metric that each grub client will crawl in a given period of time. This allows an entire web site to be crawled in that period of time -- so you can still get fairly accurate, up-to-date stats -- but each URL on the site will only be crawled *once* (max) per time period, and by potentially many different crawlers so that no one grub client is identified as a DoS agent. That, paired with a minimum-time-before-recrawling metric (say, each URL doesn't need to be re-crawled for at least 2 days, or perhaps something more intelligent, such as URLs that don't change for a while get progressively longer periods between re-crawling, etc.), would go a long way to ensuring not to penalize people for running the grub client by getting cease and desist notices. {+} Jeff Squyres {+} sq...@cs... {+} Perpetual Obsessive Notre Dame Student Craving Utter Madness {+} "I came to ND for 4 years and ended up staying for a decade" |
From: Kord C. <ko...@gr...> - 2001-07-16 20:33:29
|
cURL is both a utility (like wget) and a c/c++ library for accessing URLs. Their site is located at: http://curl.haxx.se/ The documentation is quite nice and is (in my opinion) very easy to use and access. I already have a prototype of the client using cURL and it appears to be a faster implementation than the one currently using wget. Kord On Mon, 16 Jul 2001, Vaclav Barta wrote: > Kord Campbell wrote: > > be fixed. I'm also working on implementing the cURL library > > into the client, effectively removing any limitations that are > > related to using wget for pulling down the data. This should > Yes, IMHO it would be much better not to use an external program for > such a central activity - but what is cURL? > > Bye > Vasek > -- -------------------------------------------------------------- Kord Campbell Grub.Org Inc. President 6051 N. Brookline #118 Oklahoma City, OK 73112 ko...@gr... Voice: (405) 843-6336 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- |
From: Lowell H. <lha...@vi...> - 2001-07-16 18:20:41
|
Some randomization was supposedly put it, but it's still not enough. At least once a day, there are at several domains that get crawled with several thousand crawls. In order to keep crawling but not piss these people off, I added a firewall rule rejecting those ip's ... (news.com, zdnet.com, encyclopedia.com, cnn.com, encarta.msn.com, cnet.com, wired.com, encyklopedia.pl, etc). That temp-fixes the beating up the site from my end (since I'm crawling 1.5M urls/day I get most of those urls anyway) ... but the other 40-someodd crawlers are still doing it. One solution might be to just take the whole database offline occasionally, and setup a perl script to randomly re-fill the tables. Another solution, but difficult solution, would be reorganize the tables on the master, seperating out the hostname and path, and setup the scheduler to limit the number of each hostname that can be scheduled in a certain period. That would eliminate the problems, and also allow better results to be returned in the future (i.e. you could generate reports like # of urls for a domain, total hostnames, etc). There are probably better ways to do it too ... as soon as someone gets a full database dump from google we'll know how <smirk> Lowell Jeff Squyres wrote: > > On Sat, 14 Jul 2001, Lowell Hamilton wrote: > > > Won't that limit the client base possibilities though? If I were a > > dialup, dsl, or @home user (all of which are DHCP assigned addresses > > and you're almost guaranteed not to get the same ip back again) and > > had to log onto a webpage and enter my ip address for this session, > > few people would want to run the client that didn't have a static ip > > (effectively eliminating most of your home userbase). Some cable/dsl > > providors even time out your ip address after 24 hours so you're > > constantly being reassigned... Maybe if ranges were allowed > > (12.34.45.*) or domain names (*.adsl.isp.com) it would be bearable. > > I agree. > > Is there a reason that a fixed IP address is required? Other than > "security"? Indeed, what if I'm behind my ISP's NAT and even though I > might get a "fixed" IP, it would be a private IP like 192.168.something. > > > Perhaps a key or password system would be better. Log onto the > > website and enter a password, which goes into to the grub.conf. Or a > > key system where each unique client instance must have a > > server-assigned key put in the conf file, and tracking is done > > server-side blocking the client if a key-id connects from more than 2 > > ip addresses in a 6-hour period... and that key is used to encrypt the > > session. > > Sure, this would be fine as well. > > ----- > > On a separate issue, has the randomization and/or user-agent issue been > fixed/implemented yet? I stopped crawling when someone sent a message > across the list saying that they had gotten cease-and-desist messages. I > have a DSL line at home, and I have no desire to have C&D messages sent to > my ISP. Indeed, ISPs are likely to side with C&Ds and just shut off my > service before even checking with me. I didn't want to take that risk, so > I stopped crawling until some better kind of system was implemented. > > Has it been? > > {+} Jeff Squyres > {+} sq...@cs... > {+} Perpetual Obsessive Notre Dame Student Craving Utter Madness > {+} "I came to ND for 4 years and ended up staying for a decade" > > _______________________________________________ > Grub-general mailing list > Gru...@li... > http://lists.sourceforge.net/lists/listinfo/grub-general |
From: Lowell H. <lha...@vi...> - 2001-07-16 18:18:40
|
Rotation in general is bad though unless it is vitally important to keep the logs ... a busy crawler can generate 10k/minute in log data .. that's 14mb at the end of the day plus the rotation file... plus the arch folder growing and shrinking .. it all adds up. A solution might be log levels defined in the conf with a string like dnet pproxy: LogLevel: urls info stats server errors ... or general sets: LogLevel: [Minimum|Verbose|Debug] so one can define the amount of log data generated. Not everyone wants the whole url list that they are crawling or even any debug info, but some general info would be nice for everyone, especially statistics info once that is put in there. Lowell Vaclav Barta wro > > Lowell Hamilton wrote: > > > > Yeah... there are ways around the logfile growing ... I actually > > just linked grublog.log and putlog.log to /dev/null since I > > would never use it and capture stderr to a file A distributed > > client should not need maintaince by an outside application or > > consume a lot of resources. > Well, it shouldn't, but I would consider the handling/disposal > of logfiles (with logrotate or otherwise) an integral part of the > system, rather than an outside application... If you have apache, > what are you doing with apache logs? Maybe the client should just > use the system log (and hope it's configured correctly) - but > perhaps there's a reason so few applications do that and everybody > keeps their own files... > > > The average user probably wouldn't know how to use logrotate or > > how to write a shell script to rotate the logs ... or even where > > the logs are located when their hard drive fills up. Plus, only > > redhat-ish linux distributions even come with logrotate... and > That's why I'm saying that it should be disabled by default - if > the client doesn't work, people who are inclined to debug > the problem may enable logging, and the vast majority will just scrap > the application, whether they have logs or not... > > Bye > Vasek |
From: Vaclav B. <vb...@co...> - 2001-07-16 17:52:21
|
Kord Campbell wrote: > be fixed. I'm also working on implementing the cURL library > into the client, effectively removing any limitations that are > related to using wget for pulling down the data. This should Yes, IMHO it would be much better not to use an external program for such a central activity - but what is cURL? Bye Vasek |
From: Lowell H. <lha...@vi...> - 2001-07-16 17:36:46
|
Here is another nifty question. It has been a long proven fact that not every url has a link to it somewhere. Several of the search engines have designed slick ways to go around that limitation to find new urls .. like url-catching the newsgroups, retrieving a dump of every tld and hitting each url and www.url in there... and tricks like using the broken mod_index (http://www.domain.com/?S=M gives you a directory listing even if directory listing is disabled for a directory .. google is good at that one).... All that has given them a url list several times larger than just crawling alone. Will the grub project be trying slick things like that ... or perhaps getting a url list from another engine? At one point I dumped several tld's into the urls submission form, but none never made it into crawler-land. Another thing I've seen is that of all the urls we are crawling, almost none are new. Is the grubdexer actually working? Several weeks ago I submitted my site to grub, and it was crawled. Checking using the url searcher, resources below what I submitted have never been seen before... which pretty much means new urls aren't being found?!?!? Just as a little test I have my own server setup and entered one of the local portal sites to crawl and had my big crawler run for an hour. It discovered almost 10k urls on that domain alone .. which would show up in the real crawler as lists like the cnn.com and other huge lists.. If the project needs help with another server to help index or something like that I can help out (I have a BIG VA box idle) ... if the grubdexer is just behind, kill the scheduler for a little bit and let it catch up or something. Lowell |
From: Jeff S. <jsq...@ls...> - 2001-07-16 16:58:58
|
On Sat, 14 Jul 2001, Lowell Hamilton wrote: > Won't that limit the client base possibilities though? If I were a > dialup, dsl, or @home user (all of which are DHCP assigned addresses > and you're almost guaranteed not to get the same ip back again) and > had to log onto a webpage and enter my ip address for this session, > few people would want to run the client that didn't have a static ip > (effectively eliminating most of your home userbase). Some cable/dsl > providors even time out your ip address after 24 hours so you're > constantly being reassigned... Maybe if ranges were allowed > (12.34.45.*) or domain names (*.adsl.isp.com) it would be bearable. I agree. Is there a reason that a fixed IP address is required? Other than "security"? Indeed, what if I'm behind my ISP's NAT and even though I might get a "fixed" IP, it would be a private IP like 192.168.something. > Perhaps a key or password system would be better. Log onto the > website and enter a password, which goes into to the grub.conf. Or a > key system where each unique client instance must have a > server-assigned key put in the conf file, and tracking is done > server-side blocking the client if a key-id connects from more than 2 > ip addresses in a 6-hour period... and that key is used to encrypt the > session. Sure, this would be fine as well. ----- On a separate issue, has the randomization and/or user-agent issue been fixed/implemented yet? I stopped crawling when someone sent a message across the list saying that they had gotten cease-and-desist messages. I have a DSL line at home, and I have no desire to have C&D messages sent to my ISP. Indeed, ISPs are likely to side with C&Ds and just shut off my service before even checking with me. I didn't want to take that risk, so I stopped crawling until some better kind of system was implemented. Has it been? {+} Jeff Squyres {+} sq...@cs... {+} Perpetual Obsessive Notre Dame Student Craving Utter Madness {+} "I came to ND for 4 years and ended up staying for a decade" |
From: Lowell H. <lha...@vi...> - 2001-07-16 16:53:17
|
Won't that limit the client base possibilities though? If I were a dialup, dsl, or @home user (all of which are DHCP assigned addresses and you're almost guaranteed not to get the same ip back again) and had to log onto a webpage and enter my ip address for this session, few people would want to run the client that didn't have a static ip (effectively eliminating most of your home userbase). Some cable/dsl providors even time out your ip address after 24 hours so you're constantly being reassigned... Maybe if ranges were allowed (12.34.45.*) or domain names (*.adsl.isp.com) it would be bearable. Perhaps a key or password system would be better. Log onto the website and enter a password, which goes into to the grub.conf. Or a key system where each unique client instance must have a server-assigned key put in the conf file, and tracking is done server-side blocking the client if a key-id connects from more than 2 ip addresses in a 6-hour period... and that key is used to encrypt the session. Lowell |