From: Lowell H. <lha...@vi...> - 2001-07-01 18:29:53
|
I finally started my crawlers back up, but I'm still seeing some invalid urls that could be easily filtered out ... A good way to check urls as they are entered into the db might be: in the http://www.blah.com/ portion only skip any containing: .. (usually showing doing down a directory) // (other than in the http:// it shows brokeness and often fails) not in format of many characters dot 2-4 characters (which would bleed out urls like http://specials/lunch.html) characters not a-z, 1-9, or - Next question ... there are a few urls that are on port numbers other than 80 .. which hit a lot of people's firewalls and die One more ... I've seen some ftp urls in there... are those going to be allowed? .. and they die at the firewall too. There are still a bunch of hqx, lha, swf, ps, and other files in there ... 'causing a couple > 900k/sec spikes on my graphs as my crawlers beat up the site they were on. They are starting to randomize around now though which is nice =)) --- Lowell |