#27 Endless loop on calendar

closed
nobody
None
5
2012-09-11
2012-09-10
Anonymous
No

PHPCrawl goes through calendar.
It seems that this app doesnt hadle codes like 301,302,303 and max redirects settings, which makes it quite useless.

$crawler->setURL("http://www.ul.edu.pl/calendar/view.php?view=month&cal_d=1&cal_m=10&cal_y=2012");

Discussion

  • Uwe Hunfeld
    Uwe Hunfeld
    2012-09-10

    Hi!

    Thanks for the report!
    phpcrawl DOES handle 30xx-codes, and what do you mean with "max redirect settings"?

    Could you please explain the problem a little more detailed?

    Thanks a lot!

     
  • I meant "max redirects" as a counter which means that if crawler gets code 30x for x times it should remove actual url from the queue(because of possible loop) and go on with other urls.

     
  • Uwe Hunfeld
    Uwe Hunfeld
    2012-09-10

    Hi again,

    the crawler doesn't visit the (exact) same URL twice, so this should not be the problem,
    There must be another reason why the cralwer hangs "in a loop".
    Could you send me the direct URL that's causing the problem? (or parts of a logfile or something else that's showing up the problem)?

    Thanks!

     
  • No i did not hit the real loop. The spider worked as in Your test.
    OK, in that case this is not a bug but implementation error, causing waste of resources at both sites(crawler and crawled web page) which in the end is some kind of a bug :)

    Imagine spidering a very small web page(100 urls) with "buggy" calendar. Instead of invoking 100 links spider invokes 100000. Is it normal ?

    OK, how would you fix this ?

     
  • Uwe Hunfeld
    Uwe Hunfeld
    2012-09-10

    You could exclude the whole calendar from the crawling-process maybe (if you dont need it's content) or parts of it.

     
  • I'm interested in global solution, not only for this page or type of calendar.

    Thanks for Your time.
    This bug can be closed as it is not considered a bug.

     
  • Uwe Hunfeld
    Uwe Hunfeld
    2012-09-11

    I have no idea for a solution right now.
    How shoud the crawler know that the page contains an endless number of links,
    it just could be a very huge website.

    Do you have an idea in mind?

     
  • Hi,

    In this case there was a lot of 303 responses from server, so I thought about counter which counts responses. After it reaches specified in config file value, it skips all url on actual page and jumps one level above continuing work.

     
  • I was wondering if you ever thought of changing the structure of your website? Its very well written; I love what youve got to say. But maybe you could a little more in the way of content so people could connect with it better. Youve got an awful lot of text for only having 1 or 2 images. Maybe you could space it out better?
    north face jackets sale http://kfoxljllbm.tumblr.com/

     
  • Store all items in my briefcase or car console/glovebox. <a href="http://survival-gear.info" title="camping gear">camping gear</a> It's getting pretty dark out though and this in turn reduces your overall shoulder stress and this will keep the carving tool?

     


Anonymous


Cancel   Add attachments