Menu

Problem with certain website

Help
GeertVW
2013-02-02
2013-04-09
  • GeertVW

    GeertVW - 2013-02-02

    Hi,

    I have been using PHPCrawl for a while now. Love it!
    But, I get very strange results when I am crawling http://www.atlascopco.com/portableenergy/ .
    The header reads fine, but the content gives all rubbish, and hence also does not find any URLs.
    I do not get a problem when I crawl that site e.g. with HTTTrack or load it in the browser.

    Would really appreciate your help.

    Geert.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-02-05

    Hi!

    I will take a look at it soon, didn't have time so far.
    "But the content gives all rubbish" sounds like the content is gzip-encoded.

    I will let you know if i know more.

     
  • GeertVW

    GeertVW - 2013-02-12

    Damn. The monitoring did not work, so I saw this rather late. It still is really important to me.
    I checked your suggestion and you're right: the reply is gzip-encoded.
    I added "Accept-Encoding: identity", but it still replies gzip encoding.
    Do you know of anything else I can try or do?
    That would really save me…

     
  • Nobody/Anonymous

    Hi again,

    seems like the webserver is not handling requests staraightforward correctly. If a client doesn't accept gzip-encoded
    content (Accept: gzip …), it really shouldn't deliver the pages encoded. First time i hear about something like this.

    And im really sorry, right now i dont have a solution for this, phpcrawl doesn't support gzip-encoded content in version 0.81,
    it's still on the list of feature-requests for the next release (http://sourceforge.net/tracker/?func=detail&aid=3528545&group_id=89439&atid=590149).

    For a temporary solution you maybe could "play around" with different header-directives to force the server to answer unencoded (but i guess it always delivers encoded content regardless of the request-header).

     
  • GeertVW

    GeertVW - 2013-02-13

    Thanks! That clarifies a lot!
    Tough one though.
    I'll call the owners tomorrow to see if they are willing/able to change the site.
    Next to that, you have my vote to have it in the release ASAP. I guess it makes sense for more and more websites to only do gzip to e.g. limit their server and infrastructure costs… I guess all browsers can handle this…
    Thanks.

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.