From: Igor S. <oz...@gr...> - 2001-07-17 17:34:27
|
> Is there a reason that a fixed IP address is required? Other than > "security"? Indeed, what if I'm behind my ISP's NAT and even though I > might get a "fixed" IP, it would be a private IP like 192.168.something. > > > Perhaps a key or password system would be better. Log onto the > > website and enter a password, which goes into to the grub.conf. Or a > > key system where each unique client instance must have a > > server-assigned key put in the conf file, and tracking is done > > server-side blocking the client if a key-id connects from more than 2 > > ip addresses in a 6-hour period... and that key is used to encrypt the > > session. > > Sure, this would be fine as well. [ozra] I don't see good reason to have the whole session encrypted. If we have user_id/password system, encrypting the password would be nice, tough. I agree there are a lot of what-if's with IP address authentication. Currently, I favor user_id/password authentication then using IP address to do it. > Another thing I've seen is that of all the urls we are crawling, almost > none are new. Is the grubdexer actually working? Several weeks ago I > submitted my site to grub, and it was crawled. Checking using the url > searcher, resources below what I submitted have never been seen > before... which pretty much means new urls aren't being found?!?!? Just > as a little test I have my own server setup and entered one of the local > portal sites to crawl and had my big crawler run for an hour. It > discovered almost 10k urls on that domain alone .. which would show up > in the real crawler as lists like the cnn.com and other huge lists.. [ozra] The grubdexer works (even though it's slow), and it IS finding new URLs from the pages returned. The top table at http://www.grub.org/stats.php shows the actual increase of new URLs retrieved. However, the newly-found URLs are NOT automatically moved to be crawled/indexed. We manually control the number of URLs to be crawled, and we haven't moved any new URLs in a while (which kind of explains why the second graph is a straight line). This is intentional -- this way we can control best/worst/average update time for the URLs in our database. Remember -- our main goal is to be up-to-date more than trying to crawl every resource out there. We have probably found and inserted the new URLs found in your submitted pages. I can check that for you if you give me the URLs. > If the project needs help with another server to help index or something > like that I can help out (I have a BIG VA box idle) ... if the grubdexer > is just behind, kill the scheduler for a little bit and let it catch up > or something. [ozra] You are right. The grubdexer is slow, and can't catch up with grubd when the load is greater than something like 3,000,000 URLs/day crawled. In the course of this and probably the next week I will be testing several different models for the grubdexer to try to get significant improvement. > Here is another nifty question. It has been a long proven fact that not > every url has a link to it somewhere. Several of the search engines > have designed slick ways to go around that limitation to find new urls > .. like url-catching the newsgroups, retrieving a dump of every tld and > hitting each url and www.url in there... and tricks like using the > broken mod_index (http://www.domain.com/?S=M gives you a directory > listing even if directory listing is disabled for a directory .. google > is good at that one).... All that has given them a url list several > times larger than just crawling alone. > > Will the grub project be trying slick things like that ... or perhaps > getting a url list from another engine? At one point I dumped several > tld's into the urls submission form, but none never made it into > crawler-land. [ozra] Sure we would like features like that, but at this point or at any time we have an order of magnitude more newly found URLs than those that are crawled, up until a point of saturation (someone like Google should be experiencing that, definately not us; not yet). > The idea of a minimum time for recrawl is another great idea ... right > now I'm seeing the url list cycle about once a day... if the crawlers > were busy working on finding new urls instead of recrawling unchanged > urls for the 2nd time that day it would be a lot better. [ozra] Actually, such features exist in the current scheduler, but they are not used to it's fullest. That's because we have way less number of URLs we crawl than what our capability allows. And the reason for this is because we are still testing new features on the scheduler, and larger number may interfere with what we do. Plus we need better measuring tools to figure out recrawl times, total number of URLs to crawl, etc. Cheers, ozra. |