I'm getting a lot of empty source objects when I run my crawl and I'm not sure why. I've expanded the timeouts to 60 seconds. If I pull the urls myself the pages come up...
here is my code...
<?phpinclude("libs/PHPCrawler.class.php");functionshowerror(){die("Error ".mysql_errno()." : ".mysql_error());}if(!($connection=@mysql_connect("localhost","something","somethingelse")))die("Could not connect");if(!(@mysql_select_db("webcrawls",$connection)))showerror();classMyCrawlerextendsPHPCrawler{functionhandleDocumentInfo(PHPCrawlerDocumentInfo$PageInfo){('$PageInfo->url','$PageInfo->content'");$pagecode=$PageInfo->source;$pagecode=mysql_escape_string($pagecode);$pagecode=addslashes($pagecode);if(!(mysql_query("INSERT INTO realtorcrawl (url, pagecode) VALUES ('$PageInfo->url','$pagecode')")))showerror();echo$PageInfo->url."\n";echo"////////////////////////\n";echo"////////////////////////\n";echo$PageInfo->content."\n";echo$PageInfo->error_string."\n";echo$PageInfo->received."|".$PageInfo->received_completely."\n";echo"########################\n";}}$crawler=newMyCrawler();$crawler->setURL("http://www.foo.bar");$follow_mode=3;$crawler->setFollowMode($follow_mode);$crawler->addContentTypeReceiveRule("#text/html#");$crawler->enableCookieHandling(true);$crawler->addURLFollowRule("#directory#");$crawler->setConnectionTimeout(60);$crawler->setStreamTimeout(60);$crawler->addURLFilterRule("#(jpg|jpeg|gif|png|bmp)$# i");$crawler->addURLFilterRule("#(css|js)$# i");$crawler->go();?>
Spot any errors that would cause blank pages? I'd say 80% of the links it follows are blank and codeless both in my db inserts and in the echos.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
I'm getting a lot of empty source objects when I run my crawl and I'm not sure why. I've expanded the timeouts to 60 seconds. If I pull the urls myself the pages come up...
here is my code...
Spot any errors that would cause blank pages? I'd say 80% of the links it follows are blank and codeless both in my db inserts and in the echos.