How to run PHPCrawl from Commandline?

Status: Beta

Brought to you by: huni

How to run PHPCrawl from Commandline?

Forum: Help

Creator: brundleseth

Created: 2012-05-28

Updated: 2013-04-09

brundleseth - 2012-05-28

Hi you lovely people :)

Is it feasible to run PHPCrawl from Commandline? Ie. is it "realistic" that it can crawl 300.000 pages or so if I run it from the commandline, or will it somehow stop (as it 100% surely would via a browser ;-))?

How would i run it from Commandline to ensure it stays as stable as possible?

I'm planning to run it from a Linux environment where I have full server control. Would it just be:

$ php my_crawler.php

…. or are there any configuration aspects I should bear in mind for optimal result?

Again, thanks & sorry if its a dumb question :-)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2012-05-30

Hi!

normally you always should run phpcrawl from the commandline (cli).
And yes, simply run your script/project by executing "$ php my_crawler_script.php".

If you want to spider huge websites (containing 300.000 pages or more),
you should switch to the internal SQLite-cache by setting:

$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);

(as described here: http://phpcrawl.cuab.de/spidering_huge_websites.html).

By running your script from the commandline and using this type of caching it shouldn't be a problem to spider
even very huge websites.

Best regards!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

brundleseth - 2012-05-30

Perfect - thank you very much for your help! :)

I did not have sqllite installed, but managed to google my way to that.

It's running like a charm now :-)

I have one last question regarding CLI. Its that I've started the process via Terminal / SSH (I'm on a Mac), but my girlfriend closed down my computer and that stopped it from running apparently.

How can I get it to run even when I'm not connected?

Thanks :))

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

brundleseth - 2012-05-30

Google to the rescue, answer to own question: Just insert nohub infront and then kill the pid when you're done.. :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous