I have spent so much time to solve this problem, but fail. I follow the every code to see what's wrong . I find some code confused me, can someone help me to explain it.
Summary:
In the file PHPCrawlerLinkFinder.class.php, the code in function findRedirectLinkInHeader show the redirected links has been added to the LinkCache, but the phpcrawl cannot loop it in file PHPCrawler.class.php function startChildProcessLoop.
phpcrawler get all the links match "#product.mobile.163.com/mobile/brand/\w{8}.html# i", but it cannot find any links match "#product.mobile.163.com/Samsung/\w{8}/$# i".
Can someone tell me why phpcrawl can not follow the 301 links and get the redirect links file ???
you tell the crawler ONLY to follow links that match with the given expressions.
But the redirtect-link you are missing (http://product.mobile.163.com/Nokia/) DOES NOT match with one of your follow-rules, so the crawler doesn't follow it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
I have spent so much time to solve this problem, but fail. I follow the every code to see what's wrong . I find some code confused me, can someone help me to explain it.
Summary:
In the file PHPCrawlerLinkFinder.class.php, the code in function findRedirectLinkInHeader show the redirected links has been added to the LinkCache, but the phpcrawl cannot loop it in file PHPCrawler.class.php function startChildProcessLoop.
The redirect links example:
http://product.mobile.163.com/mobile/brand/000O00ED.html => http://product.mobile.163.com/Nokia/
almost the links like "http://product.mobile.163.com/mobile/brand/000O00ED.html" has 301 http status code.
the link http://product.mobile.163.com/mobile/brand/000O00ED.html is found in http://product.mobile.163.com
This is my code:
$crawler = new Crawler();
$crawler->setURL("http://product.mobile.163.com");
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#.(jpg|jpeg|gif|png|js)$# i");
$crawler->addURLFollowRule("#product.mobile.163.com/mobile/brand/\w{8}.html# i");
$crawler->addURLFollowRule("#product.mobile.163.com/Samsung/\w{8}/$# i");
phpcrawler get all the links match "#product.mobile.163.com/mobile/brand/\w{8}.html# i", but it cannot find any links match "#product.mobile.163.com/Samsung/\w{8}/$# i".
Can someone tell me why phpcrawl can not follow the 301 links and get the redirect links file ???
This is all my code:
set_time_limit(0);
include("libs/PHPCrawler.class.php");
class Crawler extends PHPCrawler
{
function handleDocumentInfo($DocInfo)
{
if (PHP_SAPI == "cli")
{
$lb = "\n";
}
else
{
$lb = "
";
}
echo "
Page requested: ". $DocInfo->url;
echo "
Http status code: ". $DocInfo->http_status_code. "
";
flush();
}
}
$crawler = new Crawler();
$crawler->setURL("http://product.mobile.163.com");
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#.(jpg|jpeg|gif|png|js)$# i");
$crawler->addURLFollowRule("#product.mobile.163.com/mobile/brand/\w{8}.html# i");
$crawler->addURLFollowRule("#product.mobile.163.com/Samsung/\w{8}/$# i");
$crawler->setFollowRedirects(true);
$crawler->setFollowRedirectsTillContent(true);
$crawler->go();
$report = $crawler->getProcessReport();
if (PHP_SAPI == "cli")
{
$lb = "\n";
}
else
{
$lb = "
";
}
//
echo "Summary:".$lb;
echo "Links followed: ".$report->links_followed.$lb;
echo "Documents received: ".$report->files_received.$lb;
echo "Bytes received: ".$report->bytes_received." bytes".$lb;
echo "Process runtime: ".$report->process_runtime." sec".$lb;
Last edit: Anonymous 2013-11-19
Hi!
By setting these follow rules
$crawler->addURLFollowRule("#product.mobile.163.com/mobile/brand/\w{8}.html# i")
$crawler->addURLFollowRule("#product.mobile.163.com/Samsung/\w{8}/$# i")
you tell the crawler ONLY to follow links that match with the given expressions.
But the redirtect-link you are missing (http://product.mobile.163.com/Nokia/) DOES NOT match with one of your follow-rules, so the crawler doesn't follow it.