PHPCrawl / Forum / Help: Infinity link loop

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2015-01-05

Hello!
Merry Christmas and Happy New year! Thanks for your work!
I have a question. What you would recommend?

Situation:
html has misformatting of links in subfolder such as "/news/" and "news" on one page.
server has special config so it sends code 200 for both this links. So, crawler finds domain.com/news/, eat it and finds same misformatting: "/news/" and "news" on this page and sends request to domain.com/news/news/, gets 200 code and same misformatting. So link may be like domain.com/news/news/news/news/news/... and its never ends.

Hope it's clear. Question is how to avoid this trap.
Example:
http://bfit.kiev.ua/news/sportzal.php
if you click on button under the info block few times you'll get /news/news/news/ page

Hello! Merry Christmas and Happy New year! Thanks for your work! I have a question. What you would recommend? Situation: html has misformatting of links in subfolder such as "/news/" and "news" on one page. server has special config so it sends code 200 for both this links. So, crawler finds domain.com/news/, eat it and finds same misformatting: "/news/" and "news" on this page and sends request to domain.com/news/news/, gets 200 code and same misformatting. So link may be like domain.com/news/news/news/news/news/... and its never ends. Hope it's clear. Question is how to avoid this trap. Example: http://bfit.kiev.ua/news/sportzal.php if you click on button under the info block few times you'll get /news/news/news/ page

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2015-01-06

Hi, and a happy and healthy new year to you too!

I understand the problem you described, there was a very similar problem here in the forum before.

The problem is: How should the crawler know that all these links are the exactly same page?
It could know, but only AFTER it requested the page (md5 of the content i.e.), but then it's too late.

So, right now i don't have any solution for this, i'm sorry.

Do you have any idea or approach?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2015-01-07

Thanks.

Well, i think it can be done on few steps.
First step is analyzing url. if analyzer catch looping url parts it may run check of the page after content getting . If content is the same, further url looping should be filtered.

something like that, but i think it's quite difficult.

Also may be difficulties with content comparing. looping pages may have different content because of different pathes (css, js, imgs), or looping a href link depth etc...

Thanks for your answer, good luck!

Thanks. Well, i think it can be done on few steps. First step is analyzing url. if analyzer catch looping url parts it may run check of the page after content getting . If content is the same, further url looping should be filtered. something like that, but i think it's quite difficult. Also may be difficulties with content comparing. looping pages may have different content because of different pathes (css, js, imgs), or looping a href link depth etc... Thanks for your answer, good luck!

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Infinity link loop

Forums

Help

Infinity link loop document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Infinity link loop