#27 Endless loop on calendar

closed
nobody
None
5
2012-09-11
2012-09-10
Anonymous
No

PHPCrawl goes through calendar.
It seems that this app doesnt hadle codes like 301,302,303 and max redirects settings, which makes it quite useless.

$crawler->setURL("http://www.ul.edu.pl/calendar/view.php?view=month&cal_d=1&cal_m=10&cal_y=2012");

Discussion

  • Uwe Hunfeld

    Uwe Hunfeld - 2012-09-10

    Hi!

    Thanks for the report!
    phpcrawl DOES handle 30xx-codes, and what do you mean with "max redirect settings"?

    Could you please explain the problem a little more detailed?

    Thanks a lot!

     
  • Nobody/Anonymous

    I meant "max redirects" as a counter which means that if crawler gets code 30x for x times it should remove actual url from the queue(because of possible loop) and go on with other urls.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-09-10

    Hi again,

    the crawler doesn't visit the (exact) same URL twice, so this should not be the problem,
    There must be another reason why the cralwer hangs "in a loop".
    Could you send me the direct URL that's causing the problem? (or parts of a logfile or something else that's showing up the problem)?

    Thanks!

     
  • Nobody/Anonymous

    This one seems to be better:
    http://gminalochow.pl/szkolenia/

     
  • Nobody/Anonymous

    No i did not hit the real loop. The spider worked as in Your test.
    OK, in that case this is not a bug but implementation error, causing waste of resources at both sites(crawler and crawled web page) which in the end is some kind of a bug :)

    Imagine spidering a very small web page(100 urls) with "buggy" calendar. Instead of invoking 100 links spider invokes 100000. Is it normal ?

    OK, how would you fix this ?

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-09-10

    You could exclude the whole calendar from the crawling-process maybe (if you dont need it's content) or parts of it.

     
  • Nobody/Anonymous

    I'm interested in global solution, not only for this page or type of calendar.

    Thanks for Your time.
    This bug can be closed as it is not considered a bug.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-09-11
    • status: open --> closed
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-09-11

    I have no idea for a solution right now.
    How shoud the crawler know that the page contains an endless number of links,
    it just could be a very huge website.

    Do you have an idea in mind?

     
  • Nobody/Anonymous

    Hi,

    In this case there was a lot of 303 responses from server, so I thought about counter which counts responses. After it reaches specified in config file value, it skips all url on actual page and jumps one level above continuing work.

     
  • Nobody/Anonymous

    I was wondering if you ever thought of changing the structure of your website? Its very well written; I love what youve got to say. But maybe you could a little more in the way of content so people could connect with it better. Youve got an awful lot of text for only having 1 or 2 images. Maybe you could space it out better?
    north face jackets sale http://kfoxljllbm.tumblr.com/

     
  • Nobody/Anonymous

    Hi! I just wanted to ask if you ever have any issues with hackers? My last blog (wordpress) was hacked and I ended up losing a few months of hard work due to no data backup. Do you have any solutions to stop hackers?
    north face osito jacket http://www.gotalpha.com/bbp/topic.php?id=226716&replies=1

     
  • Nobody/Anonymous

    Store all items in my briefcase or car console/glovebox. <a href="http://survival-gear.info" title="camping gear">camping gear</a> It's getting pretty dark out though and this in turn reduces your overall shoulder stress and this will keep the carving tool?

     


Anonymous

Cancel  Add attachments





Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks