First of all this is the best PHP crawler I've found on the internet, thank you for supporting this project.
The question is about stability when crawling large sites, for example 1 million pages site.
Is it possible to resume crawl session if crawler was suddenly stopped?
I guess the key is in variables "urls_to_crawl", "url_map" of object:
Im sorry, but right now it's not possible to resume an aborted crawl-session since version 0.71 of phpcrawl
uses local ram to cache all URLs (just in an array as you mentioned).
So after the crawler stopped/aborted, the URLs are gone.
Upcoming version 0.8 will alternatively use a sqlite-database-file for caching these URLs, so it should be possible to resume a crawling-session, but it will take some more time to get the new version finished.
Best regards!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for your reply and support this in upcoming version. Also i hope serving urls_to_crawl data in file will be better for large sites (with thouthsands of urls), due to free memory.
Below is my quick implementation. The main idea was running this script by cron every 10 minutes to check if crawling is alive. If process was hanged - restoring session.
The user code is below. Of course, it doesn't show correctly total files downloaded and other statistics data.
Also it would be better to include all verifications and other stuff such as isBuisySession(), endSession() etc outside of user code, but here is how i did it:
<?php/* * PREPARE MyCrawler() *** */$crawl=&newMyCrawler();$crawl->setURL('http://big-site.com/');$crawl->setPageLimit(10);$crawl->enableSession(true);// enable session support;if($crawl->restoreSession()){if($crawl->isBuisySession(2))// If session file was saved within last 2 minutes;{die('Session is buisy<br>');}else// continue crawling with restored session data;{$crawl->continueGo();}}else// session was not found, perform new crawling (create new session);{$crawl->go();}/* * END SESSION, REQUIRED! *** */$crawl->endSession();?>
I hope that someone will find this helpful when operating large sites.
Regards.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry, it's not available yet in verion 0.80, it's still on the list of reature-requests.
The chapter "spifering huge websites" just explains what type of cache to use for spidering bis sites.
It has nothing to do with resuming abroted crawling-processes.
Best regards!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok - buggar, I actually thought it was done, when reading the description :)
How close is it to being done?
In that case I would btw. suggest allowing MySQL as interface, as moooost php implementations I've seen / participated with are php/mysql based. And seeing that you plan on using SQLite, then I suppose the addon would not require much?
:) Thanks again!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
First of all this is the best PHP crawler I've found on the internet, thank you for supporting this project.
The question is about stability when crawling large sites, for example 1 million pages site.
Is it possible to resume crawl session if crawler was suddenly stopped?
I guess the key is in variables "urls_to_crawl", "url_map" of object:
=> Array
(
=> 1
=> 1
=> 1
)
Any suggestion will be appreciated,
Thanks.
Hello!
Im sorry, but right now it's not possible to resume an aborted crawl-session since version 0.71 of phpcrawl
uses local ram to cache all URLs (just in an array as you mentioned).
So after the crawler stopped/aborted, the URLs are gone.
Upcoming version 0.8 will alternatively use a sqlite-database-file for caching these URLs, so it should be possible to resume a crawling-session, but it will take some more time to get the new version finished.
Best regards!
Thanks for your reply and support this in upcoming version. Also i hope serving urls_to_crawl data in file will be better for large sites (with thouthsands of urls), due to free memory.
Below is my quick implementation. The main idea was running this script by cron every 10 minutes to check if crawling is alive. If process was hanged - restoring session.
The user code is below. Of course, it doesn't show correctly total files downloaded and other statistics data.
Also it would be better to include all verifications and other stuff such as isBuisySession(), endSession() etc outside of user code, but here is how i did it:
I hope that someone will find this helpful when operating large sites.
Regards.
Hey armab,
thanks for your post an code.
Just added your request to the lists of feature-requests:
https://sourceforge.net/tracker/?func=detail&aid=3500669&group_id=89439&atid=590149
Thanks and best regards,
huni.
Hi huni,
When are you planning to add this feature and make it available for us to download?
Best Regards
Also just a friendly question as to wether this is available now, or? Would be excellent ;)
I guess that this is the solution: http://phpcrawl.cuab.de/spidering_huge_websites.html ?
Sorry, it's not available yet in verion 0.80, it's still on the list of reature-requests.
The chapter "spifering huge websites" just explains what type of cache to use for spidering bis sites.
It has nothing to do with resuming abroted crawling-processes.
Best regards!
Ok - buggar, I actually thought it was done, when reading the description :)
How close is it to being done?
In that case I would btw. suggest allowing MySQL as interface, as moooost php implementations I've seen / participated with are php/mysql based. And seeing that you plan on using SQLite, then I suppose the addon would not require much?
:) Thanks again!
Storing data in SQLite db, i think, is the best solution for this issue.
SQLite db is simple as text file. MySQL is more complex.
Ok - I agree that SQLite would be first priority :)
I need this feature as soon as possible.
I see that you can store the session in a text file now but will it resume if the crawler stopped?