Menu

Avoid loop crawling huge sites - I would like crawl only different url

Help
Anonymous
2016-10-31
2016-11-22
  • Anonymous

    Anonymous - 2016-10-31

    Hi guys,

    I'm wiriting you because I want to crawl huge sites but I see that it repeats urls, so I would like to crawls only differents url.

    How can I do this?

    Thanks for your help!!!!

     
  • James Shaver

    James Shaver - 2016-11-01

    What have you tried so far, and what exactly where your results? I think that would tell us why you're getting repeated urls at all.

     
  • Anonymous

    Anonymous - 2016-11-01

    Hi James,

    Thanks for your help.

    I'm doing crawl from sites and I write to mysql the url tha I got, but when I do a count from the url I see that the url repeats several times.

    In the handleDocumentInfo method I wrote this condition

    function handleDocumentInfo($DocInfo){
    global $url_procesadas;
    if (in_array($DocInfo->url, $url_procesadas) == false){
    array_push($url_procesadas,$DocInfo->url);
    ... do all I need with the url
    }
    }
    ...
    $crawler = new MyCrawler();
    $crawler->setURL($url);
    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addURLFilterRule("#.(css|js|jpg|jpeg|gif|png)$# i");
    $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
    $crawler->enableCookieHandling(true);
    $crawler->go();

    But I'm not sure if this solution is the best, because I don`t know if the crawl still make the request and then do nothing with the content.

    What do you think about?

     
  • James Shaver

    James Shaver - 2016-11-22

    You might want to check that the http code is OK, but otherwise it looks good.

    function handleDocumentInfo($DocInfo){
    global $url_procesadas;
        if (in_array($DocInfo->url, $url_procesadas) == false && $DocInfo->http_status_code == 200){
        array_push($url_procesadas,$DocInfo->url);
        ... do all I need with the url
        } 
    }
    
     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.