Problem with certain website

  • GeertVW

    GeertVW - 2013-02-02


    I have been using PHPCrawl for a while now. Love it!
    But, I get very strange results when I am crawling .
    The header reads fine, but the content gives all rubbish, and hence also does not find any URLs.
    I do not get a problem when I crawl that site e.g. with HTTTrack or load it in the browser.

    Would really appreciate your help.


  • Uwe Hunfeld

    Uwe Hunfeld - 2013-02-05


    I will take a look at it soon, didn't have time so far.
    "But the content gives all rubbish" sounds like the content is gzip-encoded.

    I will let you know if i know more.

  • GeertVW

    GeertVW - 2013-02-12

    Damn. The monitoring did not work, so I saw this rather late. It still is really important to me.
    I checked your suggestion and you're right: the reply is gzip-encoded.
    I added "Accept-Encoding: identity", but it still replies gzip encoding.
    Do you know of anything else I can try or do?
    That would really save me…

  • Nobody/Anonymous

    Hi again,

    seems like the webserver is not handling requests staraightforward correctly. If a client doesn't accept gzip-encoded
    content (Accept: gzip …), it really shouldn't deliver the pages encoded. First time i hear about something like this.

    And im really sorry, right now i dont have a solution for this, phpcrawl doesn't support gzip-encoded content in version 0.81,
    it's still on the list of feature-requests for the next release (

    For a temporary solution you maybe could "play around" with different header-directives to force the server to answer unencoded (but i guess it always delivers encoded content regardless of the request-header).

  • GeertVW

    GeertVW - 2013-02-13

    Thanks! That clarifies a lot!
    Tough one though.
    I'll call the owners tomorrow to see if they are willing/able to change the site.
    Next to that, you have my vote to have it in the release ASAP. I guess it makes sense for more and more websites to only do gzip to e.g. limit their server and infrastructure costs… I guess all browsers can handle this…



Cancel  Add attachments