Menu

Infinity link loop

Help
Anonymous
2015-01-05
2015-01-07
  • Anonymous

    Anonymous - 2015-01-05

    Hello!
    Merry Christmas and Happy New year! Thanks for your work!
    I have a question. What you would recommend?

    Situation:
    html has misformatting of links in subfolder such as "/news/" and "news" on one page.
    server has special config so it sends code 200 for both this links. So, crawler finds domain.com/news/, eat it and finds same misformatting: "/news/" and "news" on this page and sends request to domain.com/news/news/, gets 200 code and same misformatting. So link may be like domain.com/news/news/news/news/news/... and its never ends.

    Hope it's clear. Question is how to avoid this trap.
    Example:
    http://bfit.kiev.ua/news/sportzal.php
    if you click on button under the info block few times you'll get /news/news/news/ page

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2015-01-06

    Hi, and a happy and healthy new year to you too!

    I understand the problem you described, there was a very similar problem here in the forum before.

    The problem is: How should the crawler know that all these links are the exactly same page?
    It could know, but only AFTER it requested the page (md5 of the content i.e.), but then it's too late.

    So, right now i don't have any solution for this, i'm sorry.

    Do you have any idea or approach?

     
  • Anonymous

    Anonymous - 2015-01-07

    Thanks.

    Well, i think it can be done on few steps.
    First step is analyzing url. if analyzer catch looping url parts it may run check of the page after content getting . If content is the same, further url looping should be filtered.

    something like that, but i think it's quite difficult.

    Also may be difficulties with content comparing. looping pages may have different content because of different pathes (css, js, imgs), or looping a href link depth etc...

    Thanks for your answer, good luck!

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.