Menu

setUrlCacheType (URLCACHE_SQLITE) problem, i need help

Help
Anonymous
2013-11-19
2013-11-21
  • Anonymous

    Anonymous - 2013-11-19

    Hi peeeps :)

    Im trying To activate the SQLite-cache in the crawler. But i run into problems.
    settings is the same at the example.php that comes with the crawler i only added the setUrlCacheType:

    $crawler = new MyCrawler();
    $crawler->setURL("www.php.net");
    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addURLFilterRule("#.(jpg|jpeg|gif|png)$# i");
    $crawler->enableCookieHandling(true);
    $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
    $crawler->setTrafficLimit(200 * 1024);
    $crawler->go();

    I get these errors when running from cli on windows 7:
    php.exe -f C:\wamp\www\crawler\classes\external\PHPCrawl\example.php

    Warning: unlink(C:\Users\me\AppData\Local\Temp/phpcrawl_tmp_53321384859087\cookiecache.db3): Permission denied in C:\wamp\www\kierkegaard\classes\external\PHPCrawl\libs\PHPCrawlerUtils.class.php on line 481

    Warning: rmdir(C:\Users\me\AppData\Local\Temp/phpcrawl_tmp_53321384859087): Directory not empty in C:\wamp\www\kierkegaard\classes\external\PHPCrawl\libs\PHPCrawlerUtils.class.php on line 486

    I read on http://cuab.de/spidering_huge_websites.html:
    "Please note that the PHP PDO-extension together with the SQLite-driver (PDO_SQLITE) has to be installed and activated to use this type of caching."

    These are my settings in wamp:
    http://picpaste.com/pics/aZc3Se9v.1384859627.jpg

    PHP version 5.3.13

    Can anybody tell me what i am doing wrong? :)

    Lars.

     

    Last edit: Anonymous 2013-11-19
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-11-19

    Hi Lars,

    when does this error(s) occur? At the end of the crawling-process?

    It seems like just the "cleanup" at the end of a crawling-procces fails. For some reasons, the crawler created the sqlite-cookie-db-file correctly (cookiecache.db3), but then it's not allowed to delete it anymore at the end. Strange.

    This has nothing to do with PDO-SQlite, its "just" a permission-thing.

    Did you try to change the working-directory?

     
  • Anonymous

    Anonymous - 2013-11-19

    Thank you for fast reply. Doing bachelor project with the crawler :)

    I did not change working directory.

    Therefore i tried the following now:


    Both
    $crawler->setWorkingDirectory("tmp/");
    $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);

    Result in browser:
    Exception: Error creating working directory 'tmp/phpcrawl_tmp_38321384868535\' in C:\wamp\www\kierkegaard\classes\external\PHPCrawl\libs\PHPCrawler.class.php on line 782

    Result in commandline:
    Error creating working directory 'tmp/phpcrawl_tmp_53681384868674\' in C:\wamp\www\kierkegaard\classes\external\PHPCrawl\libs\PHPCrawler.class.php on line 782

    In top with no crawling result)



    setUrlCacheType only
    $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);

    Result in browser:
    Warning: unlink(C:\Windows\Temp/phpcrawl_tmp_38321384868771\urlcache.db3) [function.unlink]: Permission denied in C:\wamp\www\kierkegaard\classes\external\PHPCrawl\libs\PHPCrawlerUtils.class.php on line 481

    Warning: rmdir(C:\Windows\Temp/phpcrawl_tmp_38321384868771) [function.rmdir]: Directory not empty in C:\wamp\www\kierkegaard\classes\external\PHPCrawl\libs\PHPCrawlerUtils.class.php on line 486

    Result in commandline:
    Warning: unlink(C:\Users\larsmqller\AppData\Local\Temp/phpcrawl_tmp_63721384868967\urlcache.db3): Permission denied in C:\wamp\www\kierkegaard\classes\external\PHPCrawl\libs\PHPCrawlerUtils.class.php on line 481

    Warning: rmdir(C:\Users\larsmqller\AppData\Local\Temp/phpcrawl_tmp_63721384868967): Directory not empty in C:\wamp\www\kierkegaard\classes\external\PPCrawl\libs\PHPCrawlerUtils.class.php on line 486

    Crawling, error in bottom



    setWorkingDirectory only

    $crawler->setWorkingDirectory("tmp/");

    Result in browser:
    Exception: Error creating working directory 'tmp/phpcrawl_tmp_38321384869100\' in C:\wamp\www\kierkegaard\classes\external\PHPCrawl\libs\PHPCrawler.class.php on line 782

    Result in commandline:
    Exception: Error creating working directory 'tmp/phpcrawl_tmp_31601384869252\' in C:\wamp\www\kierkegaard\classes\external\PHPCrawl\libs\PHPCrawler.class.php on line 782

    In top with no crawling result)

    I create the tmp dir: (setWorkingDirectory only)

    Result in browser:
    It runs, and i see tmpfiles in tmp dir until crawler is finished
    (run.png)

    Result in commandline:
    Uncaught exception 'Exception' with message 'Error creating working directory '/tmp/phpcrawl_tmp_70681384869580\'' in C:\wamp\www\kierkegaard\classes\external\PHPCrawl\libs\PHPCrawler.class.php on line 782
    (no crawling result.)



    Both settings with the tmp dir i created:

    Result in browser:
    Exception: Error creating working directory '/tmp/phpcrawl_tmp_38321384869712\' in C:\wamp\www\kierkegaard\classes\external\PHPCrawl\libs\PHPCrawler.class.php on line 782
    (no crawling result.)

    Result in commandline:
    Error creating working directory '/tmp/phpcrawl_tmp_70401384869773\' in C:\wamp\www\kierkegaard\classes\external\PHPCrawl\libs\PHPCrawler.class.php on line 782
    (no crawling result.)

    Lars.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-11-19

    Hi Lars again,

    can't be a big problem, just permission problems and/or path-seperators. I'll take a look at it later on.

    I strongly recommend you to use a Linux-OS together with phpcrawl (if possible for your work), that's what it was made for, it's stable there!

    AND: You can use multiple processes out of the box for spidering websites, that will speed up things a LOT for you!

    I'll let you know if i know more.

    One more question: What windows-version do you use? (7/8? 32bit or 64 bit?)

     
  • Anonymous

    Anonymous - 2013-11-19

    Thanks for the advise, i will look into that after the bachelor project :)

    I'm using Windows 7 - 64bit (for now)

    • BTW. I would like to say you made a awesome crawler - Big thumbs up!
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-11-20

    Hi Lars,

    finally i figured out the problem.

    To fix the problem:
    In the cleanup()-method of PHPCrawler.class.php insert these two lines at the beginning (line 797):

    $this->CookieCache = null;
    $this->LinkCache = null;

    So it should look like this:

    protected function cleanup()
    {
      $this->CookieCache = null;
      $this->LinkCache = null;
    
      // Delete working-dir
      PHPCrawlerUtils::rmDir($this->working_directory);
    
      // Remove semaphore (if multiprocess-mode)
      if ($this->multiprocess_mode != PHPCrawlerMultiProcessModes::MPMODE_NONE)
      {
        $sem_key = sem_get($this->crawler_uniqid);
        sem_remove($sem_key);
      }
    }
    

    Please let me know if it worked four you over there too.

    I'm opening a bugreport for this, will get officially fixed in the next version.

    THANKS for the report!

     
  • Anonymous

    Anonymous - 2013-11-20

    When running in browser it works :)
    When running in command line i get this error:
    Error creating working directory 'tmp/phpcrawl_tmp_71241384980199\' in C:\wamp\www\me\classes\external\PHPCrawl\libs\PHPCrawler.class.php on line 782

    The settings i added:

    // Set working directory
    $crawler->setWorkingDirectory('tmp/');

    // Set cache to harddisk instead of the memory - For crawling huge websites
    $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);

    For the browser test to work i need to create the "tmp" folder before it start working, i don't know if this is suppose to be done manually

     
  • Anonymous

    Anonymous - 2013-11-20

    Try without $crawler->setWorkingDirectory('tmp/') (let it use the default systems tmp-dir), then it should work in browser and in CLI.

     
  • Anonymous

    Anonymous - 2013-11-20

    Yep. Working :)
    Awesome.

    Thank you :)

     
  • Anonymous

    Anonymous - 2013-11-21

    Glad i could help.
    Goold luck for you work!

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.