Hi, Thanks for sharing your script and providing support.
I'm having a bit of a problem crawling some sites such as http://www.brendansadventures.com and larger blogs.
While smaller sites will give a constant figure in the links followed and document received, when I crawl the site above for example the results vary widely.
Below are results crawls I submitted all within 1 hour of each other
Documents received: 353
Documents received: 282
Documents received: 1547
Documents received: 329
Documents received: 323
I've been trying to figure out why it does this but to no avail. I've even tried turning off setPageLimit
Do you have an idea as to why it exhibits this behavior?
My crawl settings
$crawler = new MyCrawler();
You seem to have CSS turned off.
Please don't fill out this field.
I din't test you rsetup so far, but did you try to increase the timeout-values (connection-timeout and stream-timeout)?
And did you try to lower the number of processes the crawler should use?
Sometimes the hosting webserver is just to "weak" or busy to handle the amount of requests the crawler sends and returns some "501"s, in that case lowering the number of processes should help.
Let me know if this works for you.
OK will try this….
OK had a try no luck, reducing the number of simultaneous crawls caused the script to time out and tried reducing
also tried reducing setConnectionTimeout, this didn't have any noticable effect.
Sorry for the typos
Ok, i will take a deeper look later on.
Do you know how many pages the site http://www.brendansadventures.com contains all in all (and the crawler
should have received at the end)? At least 1547 i guess.
I've encountered some websites (and servers) that seem to have some kind of "webcrawler-protection" and limit
the number of requests within a spicified period from the same ip-adress, but then the number of received documents
shouldn't vary that much like in your case …
I'll let you know if i know more.
What version of PHP and phpcrawl are you using btw?
Thanks for looking into this for me
I'm using PHP Version 5.3.14
Really want to use it to track the growth of a site but on some occasions.. the values returned are not consistent.
Also, $report->files_received.$lb; Am i correct in assuming that file received does not include duplicates ?
Hey, sorry for my late answer.
OK, I tested your setup on the site http://www.brendansadventures.com
The site is VERY slow from here and by default i get a lot of "Socket-stream timed out"-errors from
But when i increase the stream-timeout to 20 seconds ($crawler->setStreamTimeout(20)) everything works fine over here, i didn't get a single errror anymore and evey page was received successfully (so far, the crawler still is runnung cause the site really is slow as i said, it's at around 1000 pages now).
Did you try to increase this value too? Maybe set it to 100 seconds or even more.
And if you want to know what's the problem if a page couldn'tbe received, just insert something like this in you handleDocumentInfo-method:
Hope this helps finding your problem.
… and yes, $report->files_received does NOT include duplicates, the crawler always just receives a page/document once.
thank you…. that is really helpful…. :-)