Menu

PHPCrawl and Mysql Database

Help
Anonymous
2013-12-23
2014-01-02
  • Anonymous

    Anonymous - 2013-12-23

    I have setup a cron job which runs the example.php every 20 mins. What I want to do is, after crawling, save the retrieved links to a database.

    How do I stop the script from crawling the urls already in the database on the next run?

    Basically, I want to check my database before crawling any url the script finds.

     

    Last edit: Anonymous 2013-12-23
  • Anonymous

    Anonymous - 2013-12-31

    What I did, was adding the crawl data into an array before placing it in the database. Then I used in_array to check if the link was already in the array.
    If not, I added the link and next time it would not crawl it again. Or at least, do not place double entries in the database.

     
  • Anonymous

    Anonymous - 2014-01-02

    Hi!

    I know what your problem is, there's a callback-method missing within phpcawl that will be called just BEFORE a request will be done. Inside this method, you then could check if the request-URL already was requested before (through your mysql-table or something else) and possibly abort/skip this request.

    This is still on the list of feature-requests and hopefully will get implemented on one of the next versions
    (http://sourceforge.net/p/phpcrawl/feature-requests/16/)

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.