PHPCrawl / Forum / Help: Stop Start Resume crawler

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-03-06

How can I abort a crawl process and then resume it?

From the documentation (http://phpcrawl.cuab.de/resume_aborted_processes.html), we can use the $crawler->enableResumption() function to resume a aborted process.
My question is how do I trigger this abort process? I have tried returning -1 in handleDocumentInfo() but that seems to stop the crawler entirely and cannot be resumed.

What I'm trying to achieve is
- start the crawler
- crawl 10 URLs
- pause (and possibly do some other work)
- crawl next 10 URLs
- and so on until completed
Each crawl 10 URLs would be initiated via AJAX from a browser.

Any ideas?

Thanks

How can I abort a crawl process and then resume it? From the documentation (http://phpcrawl.cuab.de/resume_aborted_processes.html), we can use the $crawler->enableResumption() function to resume a aborted process. My question is how do I trigger this abort process? I have tried returning -1 in handleDocumentInfo() but that seems to stop the crawler entirely and cannot be resumed. What I'm trying to achieve is - start the crawler - crawl 10 URLs - pause (and possibly do some other work) - crawl next 10 URLs - and so on until completed Each crawl 10 URLs would be initiated via AJAX from a browser. Any ideas? Thanks

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-03-09

Hi!

Sorry for the late answer.

Right now i don't know how to achieve this directly without modifying tha phpcrawl sourcecode. If you let the handleDocumentInfo-method return a negativa value, the crawling-process will stop "regular", that means a complete cleanup will be done and the URL-cache get's deleted, so a resumption isn't possible anymore.

The process-resumption only works if a process was aborted "uncelean" and the cache is still present (like a system crash or something like this).

Maybe you could work with a "wait-flag" somwehere in a temporary file that get's set by your ajax-script, and let the crawler wait (sleep) if and until the flag is present and let it go again for the next 10 URLs if the flag isn't present.

But you got a good point there, maybe it would be useful to have a "speacial" return value from the handleDocumentIndo() that let's the crawler stop "unclean" so that a resumption is possible?

Last edit: Anonymous 2014-12-14

Hi! Sorry for the late answer. Right now i don't know how to achieve this directly without modifying tha phpcrawl sourcecode. If you let the handleDocumentInfo-method return a negativa value, the crawling-process will stop "regular", that means a complete cleanup will be done and the URL-cache get's deleted, so a resumption isn't possible anymore. The process-resumption only works if a process was aborted "uncelean" and the cache is still present (like a system crash or something like this). Maybe you could work with a "wait-flag" somwehere in a temporary file that get's set by your ajax-script, and let the crawler wait (sleep) if and until the flag is present and let it go again for the next 10 URLs if the flag isn't present. But you got a good point there, maybe it would be useful to have a "speacial" return value from the handleDocumentIndo() that let's the crawler stop "unclean" so that a resumption is possible?

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2015-01-17

Does anyone can deal with that problem already? I would like to start the process for one hour, stop the crawler and resume it from the point of the previous end the next day. Also for one hour. And so on...

Does anyone can deal with that problem already? I would like to start the process for one hour, stop the crawler and resume it from the point of the previous end the next day. Also for one hour. And so on...

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2015-01-18

hi, had you tried "die" function? start the crawler set up the timer, then kill script execution, then resume in 1 hr...

hi, had you tried "die" function? start the crawler set up the timer, then kill script execution, then resume in 1 hr...

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2015-01-22

die() function works like a charm, thanks

die() function works like a charm, thanks

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2015-04-20

How did you implement that please ? I'm interested.
Would it permit to stop the script, save a first batch of urls then resume the process ?

How did you implement that please ? I'm interested. Would it permit to stop the script, save a first batch of urls then resume the process ?

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2015-09-14

https://gist.github.com/dawid-z/c0904747280dba937544

https://gist.github.com/dawid-z/c0904747280dba937544

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"
Anonymous - 2016-04-15

I also have the same problem. There is code to crawl using PHPCrawl I found in internet and I try to use it. There is no problem with a few first websites. However a problem arise when I tried to crawl these certain websites. The code just keep loading and stop crawling after several links.

When crawling these websites, it suddenly stop fetching data after 11 links for the 1st websites and 46 links for the 2nd websites. When I checked task manager, I found both CPU and Memory are stuck at certain number (CPU at 25% and memory at 16MB) but no network exhange, which means the code is still working but not load the data from websites anymore. I suspect the problem lies in the PHPCrawl. But I don't know how to check it.

This is not happening in the other websites, but now I am afraid, there might be rise the same case since all my targeted websites has tens of thousand pages, which probably the same problem is yet to be found. Is there someone can come up with solutions, why it stop crawling and how can I solve it? Here is the code:

class Crawler extends PHPCrawler { function handleDocumentInfo(PHPCrawlerDocumentInfo $p){ $u=$p->url; $c=$p->http_status_code; $s=$p->source; if($c==200 && $s!=""){ $html = str_get_html($s); if(is_object($html)){ $d=""; $do=$html->find("meta[name=description]", 0); if($do){ $d=$do->content; } $t=$html->find("title", 0); if($t){ $t=$t->innertext; addURL($t, $u, $d); } $html->clear(); unset($html); } } } } function crawl($u){ $C = new Crawler(); $C->setURL($u); $C->addContentTypeReceiveRule("#text/html#"); $C->addURLFilterRule("#(jpg|gif|png|pdf|jpeg|svg|css|js)$# i"); $C->setTrafficLimit(0); $C->enableCookieHandling(true); $C->obeyRobotsTxt(true); $C->obeyNoFollowTags(true); $C->setFollowMode(0); $C->go(); }

Last edit: Anonymous 2016-04-15

I also have the same problem. There is code to crawl using PHPCrawl I found in internet and I try to use it. There is no problem with a few first websites. However a problem arise when I tried to crawl these certain websites. The code just keep loading and stop crawling after several links. When crawling these websites, it suddenly stop fetching data after 11 links for the 1st websites and 46 links for the 2nd websites. When I checked task manager, I found both CPU and Memory are stuck at certain number (CPU at 25% and memory at 16MB) but no network exhange, which means the code is still working but not load the data from websites anymore. I suspect the problem lies in the PHPCrawl. But I don't know how to check it. This is not happening in the other websites, but now I am afraid, there might be rise the same case since all my targeted websites has tens of thousand pages, which probably the same problem is yet to be found. Is there someone can come up with solutions, why it stop crawling and how can I solve it? Here is the code: class Crawler extends PHPCrawler { function handleDocumentInfo(PHPCrawlerDocumentInfo $p){ $u=$p->url; $c=$p->http_status_code; $s=$p->source; if($c==200 && $s!=""){ $html = str_get_html($s); if(is_object($html)){ $d=""; $do=$html->find("meta[name=description]", 0); if($do){ $d=$do->content; } $t=$html->find("title", 0); if($t){ $t=$t->innertext; addURL($t, $u, $d); } $html->clear(); unset($html); } } } } function crawl($u){ $C = new Crawler(); $C->setURL($u); $C->addContentTypeReceiveRule("#text/html#"); $C->addURLFilterRule("#(jpg|gif|png|pdf|jpeg|svg|css|js)$# i"); $C->setTrafficLimit(0); $C->enableCookieHandling(true); $C->obeyRobotsTxt(true); $C->obeyNoFollowTags(true); $C->setFollowMode(0); $C->go(); }

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2016-04-19

How workin it - http://www.lebenslaufmuster.biz/online-lebenslauf-builder.php
PHP + HTML5 ?

How workin it - http://www.lebenslaufmuster.biz/online-lebenslauf-builder.php PHP + HTML5 ?

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous - 2020-11-13

Post awaiting moderation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Stop Start Resume crawler

Forums

Help

Stop Start Resume crawler document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Stop Start Resume crawler