Hello!
Merry Christmas and Happy New year! Thanks for your work!
I have a question. What you would recommend?
Situation:
html has misformatting of links in subfolder such as "/news/" and "news" on one page.
server has special config so it sends code 200 for both this links. So, crawler finds domain.com/news/, eat it and finds same misformatting: "/news/" and "news" on this page and sends request to domain.com/news/news/, gets 200 code and same misformatting. So link may be like domain.com/news/news/news/news/news/... and its never ends.
Hope it's clear. Question is how to avoid this trap.
Example: http://bfit.kiev.ua/news/sportzal.php
if you click on button under the info block few times you'll get /news/news/news/ page
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I understand the problem you described, there was a very similar problem here in the forum before.
The problem is: How should the crawler know that all these links are the exactly same page?
It could know, but only AFTER it requested the page (md5 of the content i.e.), but then it's too late.
So, right now i don't have any solution for this, i'm sorry.
Do you have any idea or approach?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, i think it can be done on few steps.
First step is analyzing url. if analyzer catch looping url parts it may run check of the page after content getting . If content is the same, further url looping should be filtered.
something like that, but i think it's quite difficult.
Also may be difficulties with content comparing. looping pages may have different content because of different pathes (css, js, imgs), or looping a href link depth etc...
Thanks for your answer, good luck!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hello!
Merry Christmas and Happy New year! Thanks for your work!
I have a question. What you would recommend?
Situation:
html has misformatting of links in subfolder such as "/news/" and "news" on one page.
server has special config so it sends code 200 for both this links. So, crawler finds domain.com/news/, eat it and finds same misformatting: "/news/" and "news" on this page and sends request to domain.com/news/news/, gets 200 code and same misformatting. So link may be like domain.com/news/news/news/news/news/... and its never ends.
Hope it's clear. Question is how to avoid this trap.
Example:
http://bfit.kiev.ua/news/sportzal.php
if you click on button under the info block few times you'll get /news/news/news/ page
Hi, and a happy and healthy new year to you too!
I understand the problem you described, there was a very similar problem here in the forum before.
The problem is: How should the crawler know that all these links are the exactly same page?
It could know, but only AFTER it requested the page (md5 of the content i.e.), but then it's too late.
So, right now i don't have any solution for this, i'm sorry.
Do you have any idea or approach?
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Thanks.
Well, i think it can be done on few steps.
First step is analyzing url. if analyzer catch looping url parts it may run check of the page after content getting . If content is the same, further url looping should be filtered.
something like that, but i think it's quite difficult.
Also may be difficulties with content comparing. looping pages may have different content because of different pathes (css, js, imgs), or looping a href link depth etc...
Thanks for your answer, good luck!