Anonymous - 2016-03-03

I'm getting a lot of empty source objects when I run my crawl and I'm not sure why. I've expanded the timeouts to 60 seconds. If I pull the urls myself the pages come up...

here is my code...

<?php
include("libs/PHPCrawler.class.php");

function showerror()
{
    die("Error " . mysql_errno() . " : " . mysql_error());
}

if(!($connection = @ mysql_connect("localhost","something","somethingelse")))
die("Could not connect");

if (!(@ mysql_select_db("webcrawls", $connection)))
    showerror();

class MyCrawler extends PHPCrawler
{
  function handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo)
  {

    ('$PageInfo->url', '$PageInfo->content'");
    $pagecode = $PageInfo->source;
    $pagecode = mysql_escape_string($pagecode);
    $pagecode = addslashes($pagecode);
    if (!(mysql_query ("INSERT INTO realtorcrawl (url, pagecode) VALUES ('$PageInfo->url','$pagecode')")))
        showerror();
    echo $PageInfo->url."\n";
    echo "////////////////////////\n";
    echo "////////////////////////\n";
    echo $PageInfo->content."\n";
    echo $PageInfo->error_string."\n";
    echo $PageInfo->received."|".$PageInfo->received_completely."\n";
    echo "########################\n";
  }
} 

$crawler = new MyCrawler();
$crawler->setURL("http://www.foo.bar");
$follow_mode = 3;
$crawler->setFollowMode($follow_mode);
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->enableCookieHandling(true);
$crawler->addURLFollowRule("#directory#");
$crawler->setConnectionTimeout(60);
$crawler->setStreamTimeout(60);
$crawler->addURLFilterRule("#(jpg|jpeg|gif|png|bmp)$# i");
$crawler->addURLFilterRule("#(css|js)$# i");
$crawler->go();
?>

Spot any errors that would cause blank pages? I'd say 80% of the links it follows are blank and codeless both in my db inserts and in the echos.