Hi,
I am running into a problem where the crawler is dying for no apparent reason when attempting to grab particular text files. An example can be found when trying to crawl http://xys.org/xys/ebooks/others/misc/ In particular, when the crawler hits http://xys.org/xys/ebooks/others/misc/beauty_physics.txt it dies. Does anyone have an idea what might be causing this?
Here is the code I am using:
<?phpset_time_limit(600);include("libs/PHPCrawl_080/libs/PHPCrawler.class.php");classMyCrawlerextendsPHPCrawler{functionhandleDocumentInfo(PHPCrawlerDocumentInfo$PageInfo){// Store out the URL of the document and its contents$filetype=explode('.',$PageInfo->url);if(end($filetype)=="txt"){$m=newMongo();$db=$m->texts;$collection=$db->untranslated_texts;$encodings=array("UTF-8","GB2312","GBK","EUC-JP","HZ");//echo mb_detect_encoding(file_get_contents($PageInfo->url), $encodings).PHP_EOL;//echo mb_convert_encoding(file_get_contents($PageInfo->url), "UTF-8", $encodings);$content=mb_convert_encoding(file_get_contents($PageInfo->url),"UTF-8",$encodings);$obj=array("source"=>$PageInfo->url,"name"=>end(explode('/',$PageInfo->url)),"content"=>mb_convert_encoding(file_get_contents($PageInfo->url),"UTF-8",$encodings));$collection->insert($obj);//echo $content,PHP_EOL;}}}$crawler=newMyCrawler();// cache in sqlite file for large sites$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);$crawler->setWorkingDirectory("/tmp/");// url to crawl$crawler->setURL("http://xys.org/xys/ebooks/others/misc/");// set what links and paths crawler will take$crawler->setFollowMode(3);// receive content that is a text/html to parse whole site$crawler->addContentTypeReceiveRule("#text/html#");// Ignore links to pictures, dont even request pictures $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");// Store and send cookie-data like a browser does //$crawler->enableCookieHandling(true);$crawler->go();$report=$crawler->getProcessReport();echo"Summary:",PHP_EOL;echo"aborted because ",$report->abort_reason,PHP_EOL;//echo "Links followed: ".$report->links_followed.PHP_EOL; //echo "Documents received: ".$report->files_received.PHP_EOL; //echo "Bytes received: ".$report->bytes_received." bytes".PHP_EOL; //echo "Peak memory used: ".$report->memory_peak_usage.PHP_EOL;//echo "Process runtime: ".$report->process_runtime." sec".PHP_EOL; ?>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2012-05-08
I apologize. I did some more debugging, and the problem lies in converting from the charset of the file to UTF-8. Charset detection is quite a pain it seems. Feel free to remove this post entirely as it is not germane to the forum.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I am running into a problem where the crawler is dying for no apparent reason when attempting to grab particular text files. An example can be found when trying to crawl http://xys.org/xys/ebooks/others/misc/ In particular, when the crawler hits http://xys.org/xys/ebooks/others/misc/beauty_physics.txt it dies. Does anyone have an idea what might be causing this?
Here is the code I am using:
I apologize. I did some more debugging, and the problem lies in converting from the charset of the file to UTF-8. Charset detection is quite a pain it seems. Feel free to remove this post entirely as it is not germane to the forum.