From: Lowell H. <lha...@vi...> - 2001-07-16 18:20:41
|
Some randomization was supposedly put it, but it's still not enough. At least once a day, there are at several domains that get crawled with several thousand crawls. In order to keep crawling but not piss these people off, I added a firewall rule rejecting those ip's ... (news.com, zdnet.com, encyclopedia.com, cnn.com, encarta.msn.com, cnet.com, wired.com, encyklopedia.pl, etc). That temp-fixes the beating up the site from my end (since I'm crawling 1.5M urls/day I get most of those urls anyway) ... but the other 40-someodd crawlers are still doing it. One solution might be to just take the whole database offline occasionally, and setup a perl script to randomly re-fill the tables. Another solution, but difficult solution, would be reorganize the tables on the master, seperating out the hostname and path, and setup the scheduler to limit the number of each hostname that can be scheduled in a certain period. That would eliminate the problems, and also allow better results to be returned in the future (i.e. you could generate reports like # of urls for a domain, total hostnames, etc). There are probably better ways to do it too ... as soon as someone gets a full database dump from google we'll know how <smirk> Lowell Jeff Squyres wrote: > > On Sat, 14 Jul 2001, Lowell Hamilton wrote: > > > Won't that limit the client base possibilities though? If I were a > > dialup, dsl, or @home user (all of which are DHCP assigned addresses > > and you're almost guaranteed not to get the same ip back again) and > > had to log onto a webpage and enter my ip address for this session, > > few people would want to run the client that didn't have a static ip > > (effectively eliminating most of your home userbase). Some cable/dsl > > providors even time out your ip address after 24 hours so you're > > constantly being reassigned... Maybe if ranges were allowed > > (12.34.45.*) or domain names (*.adsl.isp.com) it would be bearable. > > I agree. > > Is there a reason that a fixed IP address is required? Other than > "security"? Indeed, what if I'm behind my ISP's NAT and even though I > might get a "fixed" IP, it would be a private IP like 192.168.something. > > > Perhaps a key or password system would be better. Log onto the > > website and enter a password, which goes into to the grub.conf. Or a > > key system where each unique client instance must have a > > server-assigned key put in the conf file, and tracking is done > > server-side blocking the client if a key-id connects from more than 2 > > ip addresses in a 6-hour period... and that key is used to encrypt the > > session. > > Sure, this would be fine as well. > > ----- > > On a separate issue, has the randomization and/or user-agent issue been > fixed/implemented yet? I stopped crawling when someone sent a message > across the list saying that they had gotten cease-and-desist messages. I > have a DSL line at home, and I have no desire to have C&D messages sent to > my ISP. Indeed, ISPs are likely to side with C&Ds and just shut off my > service before even checking with me. I didn't want to take that risk, so > I stopped crawling until some better kind of system was implemented. > > Has it been? > > {+} Jeff Squyres > {+} sq...@cs... > {+} Perpetual Obsessive Notre Dame Student Craving Utter Madness > {+} "I came to ND for 4 years and ended up staying for a decade" > > _______________________________________________ > Grub-general mailing list > Gru...@li... > http://lists.sourceforge.net/lists/listinfo/grub-general |