#27 Endless loop on calendar


PHPCrawl goes through calendar.
It seems that this app doesnt hadle codes like 301,302,303 and max redirects settings, which makes it quite useless.



  • Uwe Hunfeld
    Uwe Hunfeld


    Thanks for the report!
    phpcrawl DOES handle 30xx-codes, and what do you mean with "max redirect settings"?

    Could you please explain the problem a little more detailed?

    Thanks a lot!

  • I meant "max redirects" as a counter which means that if crawler gets code 30x for x times it should remove actual url from the queue(because of possible loop) and go on with other urls.

  • Uwe Hunfeld
    Uwe Hunfeld

    Hi again,

    the crawler doesn't visit the (exact) same URL twice, so this should not be the problem,
    There must be another reason why the cralwer hangs "in a loop".
    Could you send me the direct URL that's causing the problem? (or parts of a logfile or something else that's showing up the problem)?


  • No i did not hit the real loop. The spider worked as in Your test.
    OK, in that case this is not a bug but implementation error, causing waste of resources at both sites(crawler and crawled web page) which in the end is some kind of a bug :)

    Imagine spidering a very small web page(100 urls) with "buggy" calendar. Instead of invoking 100 links spider invokes 100000. Is it normal ?

    OK, how would you fix this ?

  • Uwe Hunfeld
    Uwe Hunfeld

    You could exclude the whole calendar from the crawling-process maybe (if you dont need it's content) or parts of it.

  • I'm interested in global solution, not only for this page or type of calendar.

    Thanks for Your time.
    This bug can be closed as it is not considered a bug.

  • Uwe Hunfeld
    Uwe Hunfeld

    I have no idea for a solution right now.
    How shoud the crawler know that the page contains an endless number of links,
    it just could be a very huge website.

    Do you have an idea in mind?

  • Hi,

    In this case there was a lot of 303 responses from server, so I thought about counter which counts responses. After it reaches specified in config file value, it skips all url on actual page and jumps one level above continuing work.

  • I was wondering if you ever thought of changing the structure of your website? Its very well written; I love what youve got to say. But maybe you could a little more in the way of content so people could connect with it better. Youve got an awful lot of text for only having 1 or 2 images. Maybe you could space it out better?
    north face jackets sale http://kfoxljllbm.tumblr.com/

  • Store all items in my briefcase or car console/glovebox. <a href="http://survival-gear.info" title="camping gear">camping gear</a> It's getting pretty dark out though and this in turn reduces your overall shoulder stress and this will keep the carving tool?



Cancel   Add attachments