Problem with certain website

Status: Beta

Brought to you by: huni

Problem with certain website

Forum: Help

Creator: GeertVW

Created: 2013-02-02

Updated: 2013-04-09

GeertVW - 2013-02-02

Hi,

I have been using PHPCrawl for a while now. Love it!
But, I get very strange results when I am crawling http://www.atlascopco.com/portableenergy/ .
The header reads fine, but the content gives all rubbish, and hence also does not find any URLs.
I do not get a problem when I crawl that site e.g. with HTTTrack or load it in the browser.

Would really appreciate your help.

Geert.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2013-02-05

Hi!

I will take a look at it soon, didn't have time so far.
"But the content gives all rubbish" sounds like the content is gzip-encoded.

I will let you know if i know more.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

GeertVW - 2013-02-12

Damn. The monitoring did not work, so I saw this rather late. It still is really important to me.
I checked your suggestion and you're right: the reply is gzip-encoded.
I added "Accept-Encoding: identity", but it still replies gzip encoding.
Do you know of anything else I can try or do?
That would really save me…

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2013-02-12

Hi again,

seems like the webserver is not handling requests staraightforward correctly. If a client doesn't accept gzip-encoded
content (Accept: gzip …), it really shouldn't deliver the pages encoded. First time i hear about something like this.

And im really sorry, right now i dont have a solution for this, phpcrawl doesn't support gzip-encoded content in version 0.81,
it's still on the list of feature-requests for the next release (http://sourceforge.net/tracker/?func=detail&aid=3528545&group_id=89439&atid=590149).

For a temporary solution you maybe could "play around" with different header-directives to force the server to answer unencoded (but i guess it always delivers encoded content regardless of the request-header).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

GeertVW - 2013-02-13

Thanks! That clarifies a lot!
Tough one though.
I'll call the owners tomorrow to see if they are willing/able to change the site.
Next to that, you have my vote to have it in the release ASAP. I guess it makes sense for more and more websites to only do gzip to e.g. limit their server and infrastructure costs… I guess all browsers can handle this…
Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous