[Grub-general] url database

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Here is another nifty question.  It has been a long proven fact that not
every url has a link to it somewhere.  Several of the search engines
have designed slick ways to go around that limitation to find new urls
.. like url-catching the newsgroups, retrieving a dump of every tld and
hitting each url and www.url in there... and tricks like using the
broken mod_index (http://www.domain.com/?S=M gives you a directory
listing even if directory listing is disabled for a directory .. google
is good at that one)....  All that has given them a url list several
times larger than just crawling alone.  

Will the grub project be trying slick things like that ... or perhaps
getting a url list from another engine?  At one point I dumped several
tld's into the urls submission form, but none never made it into
crawler-land.  

Another thing I've seen is that of all the urls we are crawling, almost
none are new.  Is the grubdexer actually working?  Several weeks ago I
submitted my site to grub, and it was crawled.  Checking using the url
searcher, resources below what I submitted have never been seen
before... which pretty much means new urls aren't being found?!?!?  Just
as a little test I have my own server setup and entered one of the local
portal sites to crawl and had my big crawler run for an hour.  It
discovered almost 10k urls on that domain alone .. which would show up
in the real crawler as lists like the cnn.com and other huge lists..

If the project needs help with another server to help index or something
like that I can help out (I have a BIG VA box idle) ... if the grubdexer
is just behind, kill the scheduler for a little bit and let it catch up
or something.  

Lowell