Menu

Can not follow the 301 links, anyone can help ?

Help
Anonymous
2013-04-25
2013-04-26
  • Anonymous

    Anonymous - 2013-04-25

    I have spent so much time to solve this problem, but fail. I follow the every code to see what's wrong . I find some code confused me, can someone help me to explain it.

    Summary:
    In the file PHPCrawlerLinkFinder.class.php, the code in function findRedirectLinkInHeader show the redirected links has been added to the LinkCache, but the phpcrawl cannot loop it in file PHPCrawler.class.php function startChildProcessLoop.

    The redirect links example:
    http://product.mobile.163.com/mobile/brand/000O00ED.html => http://product.mobile.163.com/Nokia/
    almost the links like "http://product.mobile.163.com/mobile/brand/000O00ED.html" has 301 http status code.

    the link http://product.mobile.163.com/mobile/brand/000O00ED.html is found in http://product.mobile.163.com
    This is my code:

    $crawler = new Crawler();
    $crawler->setURL("http://product.mobile.163.com");
    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addURLFilterRule("#.(jpg|jpeg|gif|png|js)$# i");
    $crawler->addURLFollowRule("#product.mobile.163.com/mobile/brand/\w{8}.html# i");
    $crawler->addURLFollowRule("#product.mobile.163.com/Samsung/\w{8}/$# i");

    phpcrawler get all the links match "#product.mobile.163.com/mobile/brand/\w{8}.html# i", but it cannot find any links match "#product.mobile.163.com/Samsung/\w{8}/$# i".

    Can someone tell me why phpcrawl can not follow the 301 links and get the redirect links file ???

    This is all my code:

    set_time_limit(0);
    include("libs/PHPCrawler.class.php");

    class Crawler extends PHPCrawler
    {
    function handleDocumentInfo($DocInfo)
    {
    if (PHP_SAPI == "cli")
    {
    $lb = "\n";
    }
    else
    {
    $lb = "
    ";
    }
    echo "
    Page requested: ". $DocInfo->url;
    echo "
    Http status code: ". $DocInfo->http_status_code. "
    ";
    flush();
    }
    }

    $crawler = new Crawler();
    $crawler->setURL("http://product.mobile.163.com");
    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addURLFilterRule("#.(jpg|jpeg|gif|png|js)$# i");
    $crawler->addURLFollowRule("#product.mobile.163.com/mobile/brand/\w{8}.html# i");
    $crawler->addURLFollowRule("#product.mobile.163.com/Samsung/\w{8}/$# i");
    $crawler->setFollowRedirects(true);
    $crawler->setFollowRedirectsTillContent(true);

    $crawler->go();
    $report = $crawler->getProcessReport();

    if (PHP_SAPI == "cli")
    {
    $lb = "\n";
    }
    else
    {
    $lb = "
    ";
    }
    //
    echo "Summary:".$lb;
    echo "Links followed: ".$report->links_followed.$lb;
    echo "Documents received: ".$report->files_received.$lb;
    echo "Bytes received: ".$report->bytes_received." bytes".$lb;
    echo "Process runtime: ".$report->process_runtime." sec".$lb;

     

    Last edit: Anonymous 2013-11-19
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-04-26

    Hi!

    By setting these follow rules

    $crawler->addURLFollowRule("#product.mobile.163.com/mobile/brand/\w{8}.html# i")
    $crawler->addURLFollowRule("#product.mobile.163.com/Samsung/\w{8}/$# i")

    you tell the crawler ONLY to follow links that match with the given expressions.

    But the redirtect-link you are missing (http://product.mobile.163.com/Nokia/) DOES NOT match with one of your follow-rules, so the crawler doesn't follow it.

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.