Endless loop on calendar
Status: Beta
Brought to you by:
huni
PHPCrawl goes through calendar.
It seems that this app doesnt hadle codes like 301,302,303 and max redirects settings, which makes it quite useless.
$crawler->setURL("http://www.ul.edu.pl/calendar/view.php?view=month&cal_d=1&cal_m=10&cal_y=2012");
Anonymous
Hi!
Thanks for the report!
phpcrawl DOES handle 30xx-codes, and what do you mean with "max redirect settings"?
Could you please explain the problem a little more detailed?
Thanks a lot!
I meant "max redirects" as a counter which means that if crawler gets code 30x for x times it should remove actual url from the queue(because of possible loop) and go on with other urls.
Hi again,
the crawler doesn't visit the (exact) same URL twice, so this should not be the problem,
There must be another reason why the cralwer hangs "in a loop".
Could you send me the direct URL that's causing the problem? (or parts of a logfile or something else that's showing up the problem)?
Thanks!
I sent url in bug report:
http://www.ul.edu.pl/calendar/view.php?view=month&cal_d=1&cal_m=10&cal_y=2012
This one seems to be better:
http://gminalochow.pl/szkolenia/
Ok, i see.
The crawler is following a lot of URLs like:
http://gminalochow.pl/szkolenia/calendar/set.php?return=aHR0cDovL2dtaW5hbG9jaG93LnBsL3N6a29sZW5pYS9jYWxlbmRhci92aWV3LnBocD92aWV3PW1vbnRoJmNhbF9kPTEmY2FsX209MSZjYWxfeT0yMDE5JmNvdXJzZT0x&sesskey=daFplObkEG&var=showcourses
But all these links are really present on the website.
It doesn't look like a loop to me. It seems that the website (the calendar) just contains an endless number of different links because it reaches to far up in the feature.
I just did a short test and let the crawler run a few minutes on that site and i've reched the year 2019 (Links like http://gminalochow.pl/szkolenia/calendar/view.php?view=month&course=1&cal_d=1&cal_m=2&cal_y=2019\).
So if the site conatins an endless number of links, the crawler will spider it forever.
I think this is not a bug.
Or did you really trap in a "real" loop after crawling the site for a longer time?
No i did not hit the real loop. The spider worked as in Your test.
OK, in that case this is not a bug but implementation error, causing waste of resources at both sites(crawler and crawled web page) which in the end is some kind of a bug :)
Imagine spidering a very small web page(100 urls) with "buggy" calendar. Instead of invoking 100 links spider invokes 100000. Is it normal ?
OK, how would you fix this ?
You could exclude the whole calendar from the crawling-process maybe (if you dont need it's content) or parts of it.
I'm interested in global solution, not only for this page or type of calendar.
Thanks for Your time.
This bug can be closed as it is not considered a bug.
I have no idea for a solution right now.
How shoud the crawler know that the page contains an endless number of links,
it just could be a very huge website.
Do you have an idea in mind?
Hi,
In this case there was a lot of 303 responses from server, so I thought about counter which counts responses. After it reaches specified in config file value, it skips all url on actual page and jumps one level above continuing work.
I was wondering if you ever thought of changing the structure of your website? Its very well written; I love what youve got to say. But maybe you could a little more in the way of content so people could connect with it better. Youve got an awful lot of text for only having 1 or 2 images. Maybe you could space it out better?
north face jackets sale http://kfoxljllbm.tumblr.com/
Hi! I just wanted to ask if you ever have any issues with hackers? My last blog (wordpress) was hacked and I ended up losing a few months of hard work due to no data backup. Do you have any solutions to stop hackers?
north face osito jacket http://www.gotalpha.com/bbp/topic.php?id=226716&replies=1
Store all items in my briefcase or car console/glovebox. <a href="http://survival-gear.info" title="camping gear">camping gear</a> It's getting pretty dark out though and this in turn reduces your overall shoulder stress and this will keep the carving tool?