What I did, was adding the crawl data into an array before placing it in the database. Then I used in_array to check if the link was already in the array.
If not, I added the link and next time it would not crawl it again. Or at least, do not place double entries in the database.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I know what your problem is, there's a callback-method missing within phpcawl that will be called just BEFORE a request will be done. Inside this method, you then could check if the request-URL already was requested before (through your mysql-table or something else) and possibly abort/skip this request.
This is still on the list of feature-requests and hopefully will get implemented on one of the next versions
(http://sourceforge.net/p/phpcrawl/feature-requests/16/)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
I have setup a cron job which runs the example.php every 20 mins. What I want to do is, after crawling, save the retrieved links to a database.
How do I stop the script from crawling the urls already in the database on the next run?
Basically, I want to check my database before crawling any url the script finds.
Last edit: Anonymous 2013-12-23
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
What I did, was adding the crawl data into an array before placing it in the database. Then I used in_array to check if the link was already in the array.
If not, I added the link and next time it would not crawl it again. Or at least, do not place double entries in the database.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi!
I know what your problem is, there's a callback-method missing within phpcawl that will be called just BEFORE a request will be done. Inside this method, you then could check if the request-URL already was requested before (through your mysql-table or something else) and possibly abort/skip this request.
This is still on the list of feature-requests and hopefully will get implemented on one of the next versions
(http://sourceforge.net/p/phpcrawl/feature-requests/16/)