I have been using PHPCrawl for a while now. Love it!
But, I get very strange results when I am crawling http://www.atlascopco.com/portableenergy/ .
The header reads fine, but the content gives all rubbish, and hence also does not find any URLs.
I do not get a problem when I crawl that site e.g. with HTTTrack or load it in the browser.
Would really appreciate your help.
Geert.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Damn. The monitoring did not work, so I saw this rather late. It still is really important to me.
I checked your suggestion and you're right: the reply is gzip-encoded.
I added "Accept-Encoding: identity", but it still replies gzip encoding.
Do you know of anything else I can try or do?
That would really save me…
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
seems like the webserver is not handling requests staraightforward correctly. If a client doesn't accept gzip-encoded
content (Accept: gzip …), it really shouldn't deliver the pages encoded. First time i hear about something like this.
For a temporary solution you maybe could "play around" with different header-directives to force the server to answer unencoded (but i guess it always delivers encoded content regardless of the request-header).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks! That clarifies a lot!
Tough one though.
I'll call the owners tomorrow to see if they are willing/able to change the site.
Next to that, you have my vote to have it in the release ASAP. I guess it makes sense for more and more websites to only do gzip to e.g. limit their server and infrastructure costs… I guess all browsers can handle this…
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I have been using PHPCrawl for a while now. Love it!
But, I get very strange results when I am crawling http://www.atlascopco.com/portableenergy/ .
The header reads fine, but the content gives all rubbish, and hence also does not find any URLs.
I do not get a problem when I crawl that site e.g. with HTTTrack or load it in the browser.
Would really appreciate your help.
Geert.
Hi!
I will take a look at it soon, didn't have time so far.
"But the content gives all rubbish" sounds like the content is gzip-encoded.
I will let you know if i know more.
Damn. The monitoring did not work, so I saw this rather late. It still is really important to me.
I checked your suggestion and you're right: the reply is gzip-encoded.
I added "Accept-Encoding: identity", but it still replies gzip encoding.
Do you know of anything else I can try or do?
That would really save me…
Hi again,
seems like the webserver is not handling requests staraightforward correctly. If a client doesn't accept gzip-encoded
content (Accept: gzip …), it really shouldn't deliver the pages encoded. First time i hear about something like this.
And im really sorry, right now i dont have a solution for this, phpcrawl doesn't support gzip-encoded content in version 0.81,
it's still on the list of feature-requests for the next release (http://sourceforge.net/tracker/?func=detail&aid=3528545&group_id=89439&atid=590149).
For a temporary solution you maybe could "play around" with different header-directives to force the server to answer unencoded (but i guess it always delivers encoded content regardless of the request-header).
Thanks! That clarifies a lot!
Tough one though.
I'll call the owners tomorrow to see if they are willing/able to change the site.
Next to that, you have my vote to have it in the release ASAP. I guess it makes sense for more and more websites to only do gzip to e.g. limit their server and infrastructure costs… I guess all browsers can handle this…
Thanks.