I have been using PHPCrawl for a while now. Love it!
But, I get very strange results when I am crawling http://www.atlascopco.com/portableenergy/ .
The header reads fine, but the content gives all rubbish, and hence also does not find any URLs.
I do not get a problem when I crawl that site e.g. with HTTTrack or load it in the browser.
Would really appreciate your help.
You seem to have CSS turned off.
Please don't fill out this field.
I will take a look at it soon, didn't have time so far.
"But the content gives all rubbish" sounds like the content is gzip-encoded.
I will let you know if i know more.
Damn. The monitoring did not work, so I saw this rather late. It still is really important to me.
I checked your suggestion and you're right: the reply is gzip-encoded.
I added "Accept-Encoding: identity", but it still replies gzip encoding.
Do you know of anything else I can try or do?
That would really save me…
seems like the webserver is not handling requests staraightforward correctly. If a client doesn't accept gzip-encoded
content (Accept: gzip …), it really shouldn't deliver the pages encoded. First time i hear about something like this.
And im really sorry, right now i dont have a solution for this, phpcrawl doesn't support gzip-encoded content in version 0.81,
it's still on the list of feature-requests for the next release (http://sourceforge.net/tracker/?func=detail&aid=3528545&group_id=89439&atid=590149).
For a temporary solution you maybe could "play around" with different header-directives to force the server to answer unencoded (but i guess it always delivers encoded content regardless of the request-header).
Thanks! That clarifies a lot!
Tough one though.
I'll call the owners tomorrow to see if they are willing/able to change the site.
Next to that, you have my vote to have it in the release ASAP. I guess it makes sense for more and more websites to only do gzip to e.g. limit their server and infrastructure costs… I guess all browsers can handle this…