Menu

Crawler dies unexpectedly

Help
Anonymous
2012-05-08
2013-04-09
  • Anonymous

    Anonymous - 2012-05-08

    Hi,
    I am running into a problem where the crawler is dying for no apparent reason when attempting to grab particular text files. An example can be found when trying to crawl http://xys.org/xys/ebooks/others/misc/ In particular, when the crawler hits http://xys.org/xys/ebooks/others/misc/beauty_physics.txt it dies. Does anyone have an idea what might be causing this?
    Here is the code I am using:

    <?php
    set_time_limit(600);
    include("libs/PHPCrawl_080/libs/PHPCrawler.class.php");
    class MyCrawler extends PHPCrawler 
    { 
      function handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo) 
      { 
        // Store out the URL of the document and its contents
        $filetype = explode('.', $PageInfo->url);
        if (end($filetype)=="txt") { 
          $m = new Mongo();
          $db = $m->texts;
          $collection = $db->untranslated_texts;
          $encodings = array("UTF-8", "GB2312", "GBK", "EUC-JP", "HZ");
          //echo mb_detect_encoding(file_get_contents($PageInfo->url), $encodings).PHP_EOL;
          //echo mb_convert_encoding(file_get_contents($PageInfo->url), "UTF-8", $encodings);
          $content = mb_convert_encoding(file_get_contents($PageInfo->url), "UTF-8", $encodings);
          $obj = array( "source" => $PageInfo->url, "name" => end(explode('/', $PageInfo->url)), "content" => mb_convert_encoding(file_get_contents($PageInfo->url), "UTF-8", $encodings) );
          $collection->insert($obj);
          //echo $content,PHP_EOL;
        }
      } 
    }
    $crawler = new MyCrawler();
    // cache in sqlite file for large sites
    $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
    $crawler->setWorkingDirectory("/tmp/");
    // url to crawl
    $crawler->setURL("http://xys.org/xys/ebooks/others/misc/"); 
    // set what links and paths crawler will take
    $crawler->setFollowMode(3);
    // receive content that is a text/html to parse whole site
    $crawler->addContentTypeReceiveRule("#text/html#");
    // Ignore links to pictures, dont even request pictures 
    $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i"); 
    // Store and send cookie-data like a browser does 
    //$crawler->enableCookieHandling(true);
    $crawler->go();
    $report = $crawler->getProcessReport(); 
         
    echo "Summary:",PHP_EOL; 
    echo "aborted because ",$report->abort_reason,PHP_EOL;
    //echo "Links followed: ".$report->links_followed.PHP_EOL; 
    //echo "Documents received: ".$report->files_received.PHP_EOL; 
    //echo "Bytes received: ".$report->bytes_received." bytes".PHP_EOL; 
    //echo "Peak memory used: ".$report->memory_peak_usage.PHP_EOL;
    //echo "Process runtime: ".$report->process_runtime." sec".PHP_EOL; 
    ?>
    
     
  • Anonymous

    Anonymous - 2012-05-08

    I apologize. I did some more debugging, and the problem lies in converting from the charset of the file to UTF-8. Charset detection is quite a pain it seems.  Feel free to remove this post entirely as it is not germane to the forum.

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.