I am encountering a problem using the crawler where for some sites, I get loads of the 'content not received' error message. Here are a couple of examples and some stats:
Links followed: 4683, Content not received: 3823
Links followed: 2429, Content not received: 2164
I was wondering if anyone had any idea how to stop this happening. I have crawled other sites using the same script/server, and its been fine. Also, if you could inform what usually triggers the message i.e. what features in the script / server / website that cause it to happen, then that would be great too.
You seem to have CSS turned off.
Please don't fill out this field.
Did you try to increase the stream-timeout and connection-timout? Some slow sites (or servers) don't respond within the default timeoutsettings, maybe that's the reason.
And did you ttake a look at the error-code ($DocInfo->error)?
Also take a look at the FAQs (http://phpcrawl.cuab.de/faq.html, first point).
… sorry, it's $DocInfo->error_string, not $DocInfo->error.
"Did you try to increase the stream-timeout and connection-timout?"
Thanks, that did the trick.
Good to hear.
Maybe the default stream- and connection-timouts should get increased in the next version.
Do you remember how much did you increase it?
Where Can I find these crawled data on my system once crawling finished?
im also cant find where is that files once i done crawled , kindly help me on this
Sign up for the SourceForge newsletter: