From: Lowell H. <lha...@vi...> - 2001-07-16 17:36:46
|
Here is another nifty question. It has been a long proven fact that not every url has a link to it somewhere. Several of the search engines have designed slick ways to go around that limitation to find new urls .. like url-catching the newsgroups, retrieving a dump of every tld and hitting each url and www.url in there... and tricks like using the broken mod_index (http://www.domain.com/?S=M gives you a directory listing even if directory listing is disabled for a directory .. google is good at that one).... All that has given them a url list several times larger than just crawling alone. Will the grub project be trying slick things like that ... or perhaps getting a url list from another engine? At one point I dumped several tld's into the urls submission form, but none never made it into crawler-land. Another thing I've seen is that of all the urls we are crawling, almost none are new. Is the grubdexer actually working? Several weeks ago I submitted my site to grub, and it was crawled. Checking using the url searcher, resources below what I submitted have never been seen before... which pretty much means new urls aren't being found?!?!? Just as a little test I have my own server setup and entered one of the local portal sites to crawl and had my big crawler run for an hour. It discovered almost 10k urls on that domain alone .. which would show up in the real crawler as lists like the cnn.com and other huge lists.. If the project needs help with another server to help index or something like that I can help out (I have a BIG VA box idle) ... if the grubdexer is just behind, kill the scheduler for a little bit and let it catch up or something. Lowell |