How can I abort a crawl process and then resume it?
From the documentation (http://phpcrawl.cuab.de/resume_aborted_processes.html), we can use the $crawler->enableResumption() function to resume a aborted process.
My question is how do I trigger this abort process? I have tried returning -1 in handleDocumentInfo() but that seems to stop the crawler entirely and cannot be resumed.
What I'm trying to achieve is
- start the crawler
- crawl 10 URLs
- pause (and possibly do some other work)
- crawl next 10 URLs
- and so on until completed
Each crawl 10 URLs would be initiated via AJAX from a browser.
Any ideas?
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Right now i don't know how to achieve this directly without modifying tha phpcrawl sourcecode. If you let the handleDocumentInfo-method return a negativa value, the crawling-process will stop "regular", that means a complete cleanup will be done and the URL-cache get's deleted, so a resumption isn't possible anymore.
The process-resumption only works if a process was aborted "uncelean" and the cache is still present (like a system crash or something like this).
Maybe you could work with a "wait-flag" somwehere in a temporary file that get's set by your ajax-script, and let the crawler wait (sleep) if and until the flag is present and let it go again for the next 10 URLs if the flag isn't present.
But you got a good point there, maybe it would be useful to have a "speacial" return value from the handleDocumentIndo() that let's the crawler stop "unclean" so that a resumption is possible?
Last edit: Anonymous 2014-12-14
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Does anyone can deal with that problem already? I would like to start the process for one hour, stop the crawler and resume it from the point of the previous end the next day. Also for one hour. And so on...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I also have the same problem. There is code to crawl using PHPCrawl I found in internet and I try to use it. There is no problem with a few first websites. However a problem arise when I tried to crawl these certain websites. The code just keep loading and stop crawling after several links.
When crawling these websites, it suddenly stop fetching data after 11 links for the 1st websites and 46 links for the 2nd websites. When I checked task manager, I found both CPU and Memory are stuck at certain number (CPU at 25% and memory at 16MB) but no network exhange, which means the code is still working but not load the data from websites anymore. I suspect the problem lies in the PHPCrawl. But I don't know how to check it.
This is not happening in the other websites, but now I am afraid, there might be rise the same case since all my targeted websites has tens of thousand pages, which probably the same problem is yet to be found. Is there someone can come up with solutions, why it stop crawling and how can I solve it? Here is the code:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
How can I abort a crawl process and then resume it?
From the documentation (http://phpcrawl.cuab.de/resume_aborted_processes.html), we can use the $crawler->enableResumption() function to resume a aborted process.
My question is how do I trigger this abort process? I have tried returning -1 in handleDocumentInfo() but that seems to stop the crawler entirely and cannot be resumed.
What I'm trying to achieve is
- start the crawler
- crawl 10 URLs
- pause (and possibly do some other work)
- crawl next 10 URLs
- and so on until completed
Each crawl 10 URLs would be initiated via AJAX from a browser.
Any ideas?
Thanks
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi!
Sorry for the late answer.
Right now i don't know how to achieve this directly without modifying tha phpcrawl sourcecode. If you let the handleDocumentInfo-method return a negativa value, the crawling-process will stop "regular", that means a complete cleanup will be done and the URL-cache get's deleted, so a resumption isn't possible anymore.
The process-resumption only works if a process was aborted "uncelean" and the cache is still present (like a system crash or something like this).
Maybe you could work with a "wait-flag" somwehere in a temporary file that get's set by your ajax-script, and let the crawler wait (sleep) if and until the flag is present and let it go again for the next 10 URLs if the flag isn't present.
But you got a good point there, maybe it would be useful to have a "speacial" return value from the handleDocumentIndo() that let's the crawler stop "unclean" so that a resumption is possible?
Last edit: Anonymous 2014-12-14
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Does anyone can deal with that problem already? I would like to start the process for one hour, stop the crawler and resume it from the point of the previous end the next day. Also for one hour. And so on...
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
hi, had you tried "die" function? start the crawler set up the timer, then kill script execution, then resume in 1 hr...
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
die() function works like a charm, thanks
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
How did you implement that please ? I'm interested.
Would it permit to stop the script, save a first batch of urls then resume the process ?
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
https://gist.github.com/dawid-z/c0904747280dba937544
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
I also have the same problem. There is code to crawl using PHPCrawl I found in internet and I try to use it. There is no problem with a few first websites. However a problem arise when I tried to crawl these certain websites. The code just keep loading and stop crawling after several links.
When crawling these websites, it suddenly stop fetching data after 11 links for the 1st websites and 46 links for the 2nd websites. When I checked task manager, I found both CPU and Memory are stuck at certain number (CPU at 25% and memory at 16MB) but no network exhange, which means the code is still working but not load the data from websites anymore. I suspect the problem lies in the PHPCrawl. But I don't know how to check it.
This is not happening in the other websites, but now I am afraid, there might be rise the same case since all my targeted websites has tens of thousand pages, which probably the same problem is yet to be found. Is there someone can come up with solutions, why it stop crawling and how can I solve it? Here is the code:
Last edit: Anonymous 2016-04-15
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
How workin it - http://www.lebenslaufmuster.biz/online-lebenslauf-builder.php
PHP + HTML5 ?