Menu

Stop Start Resume crawler

Help
Anonymous
2014-03-06
2016-04-19
  • Anonymous

    Anonymous - 2014-03-06

    How can I abort a crawl process and then resume it?

    From the documentation (http://phpcrawl.cuab.de/resume_aborted_processes.html), we can use the $crawler->enableResumption() function to resume a aborted process.
    My question is how do I trigger this abort process? I have tried returning -1 in handleDocumentInfo() but that seems to stop the crawler entirely and cannot be resumed.

    What I'm trying to achieve is
    - start the crawler
    - crawl 10 URLs
    - pause (and possibly do some other work)
    - crawl next 10 URLs
    - and so on until completed
    Each crawl 10 URLs would be initiated via AJAX from a browser.

    Any ideas?

    Thanks

     
  • Anonymous

    Anonymous - 2014-03-09

    Hi!

    Sorry for the late answer.

    Right now i don't know how to achieve this directly without modifying tha phpcrawl sourcecode. If you let the handleDocumentInfo-method return a negativa value, the crawling-process will stop "regular", that means a complete cleanup will be done and the URL-cache get's deleted, so a resumption isn't possible anymore.

    The process-resumption only works if a process was aborted "uncelean" and the cache is still present (like a system crash or something like this).

    Maybe you could work with a "wait-flag" somwehere in a temporary file that get's set by your ajax-script, and let the crawler wait (sleep) if and until the flag is present and let it go again for the next 10 URLs if the flag isn't present.

    But you got a good point there, maybe it would be useful to have a "speacial" return value from the handleDocumentIndo() that let's the crawler stop "unclean" so that a resumption is possible?

     

    Last edit: Anonymous 2014-12-14
  • Anonymous

    Anonymous - 2015-01-17

    Does anyone can deal with that problem already? I would like to start the process for one hour, stop the crawler and resume it from the point of the previous end the next day. Also for one hour. And so on...

     
  • Anonymous

    Anonymous - 2015-01-18

    hi, had you tried "die" function? start the crawler set up the timer, then kill script execution, then resume in 1 hr...

     
  • Anonymous

    Anonymous - 2015-01-22

    die() function works like a charm, thanks

     
  • Anonymous

    Anonymous - 2015-04-20

    How did you implement that please ? I'm interested.
    Would it permit to stop the script, save a first batch of urls then resume the process ?

     
  • Anonymous

    Anonymous - 2016-04-15

    I also have the same problem. There is code to crawl using PHPCrawl I found in internet and I try to use it. There is no problem with a few first websites. However a problem arise when I tried to crawl these certain websites. The code just keep loading and stop crawling after several links.

    When crawling these websites, it suddenly stop fetching data after 11 links for the 1st websites and 46 links for the 2nd websites. When I checked task manager, I found both CPU and Memory are stuck at certain number (CPU at 25% and memory at 16MB) but no network exhange, which means the code is still working but not load the data from websites anymore. I suspect the problem lies in the PHPCrawl. But I don't know how to check it.

    This is not happening in the other websites, but now I am afraid, there might be rise the same case since all my targeted websites has tens of thousand pages, which probably the same problem is yet to be found. Is there someone can come up with solutions, why it stop crawling and how can I solve it? Here is the code:

    class Crawler extends PHPCrawler { 
        function handleDocumentInfo(PHPCrawlerDocumentInfo $p){ 
        $u=$p->url;
        $c=$p->http_status_code;
        $s=$p->source;
        if($c==200 && $s!=""){
            $html = str_get_html($s);
            if(is_object($html)){
             $d="";
        $do=$html->find("meta[name=description]", 0);
        if($do){
             $d=$do->content;
         }
        $t=$html->find("title", 0);
        if($t){
           $t=$t->innertext;
           addURL($t, $u, $d);
        }
     $html->clear(); 
      unset($html);
       }
      }
     } 
     }
    
       function crawl($u){
            $C = new Crawler();
            $C->setURL($u);
            $C->addContentTypeReceiveRule("#text/html#");
            $C->addURLFilterRule("#(jpg|gif|png|pdf|jpeg|svg|css|js)$# i");
            $C->setTrafficLimit(0);
            $C->enableCookieHandling(true);
            $C->obeyRobotsTxt(true);
            $C->obeyNoFollowTags(true);
            $C->setFollowMode(0);
            $C->go();
       }
    
     

    Last edit: Anonymous 2016-04-15
  • Anonymous

    Anonymous - 2020-11-13
    Post awaiting moderation.

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.