Link that redirect to other site not captured

Status: Beta

Brought to you by: huni

Link that redirect to other site not captured

Forum: Help

Creator: Nobody/Anonymous

Created: 2012-02-05

Updated: 2013-04-09

Nobody/Anonymous - 2012-02-05

Hi!

Did you change the follow-mode of the crawler?
By default, the crawler stays within the host of the page set in setURL().

Try to set $crawler->setFollowMode(0);
(http://phpcrawl.cuab.de/classreference.html#setfollowmode)

Hope it will help!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-02-05

Hi,

Thanks for the reply. I tried all of the follow modes but they did not seem to work. I don't actually want the crawler to crawl outside of the domain. I am trying to capture all the internal links that belong to the website being crawled but it won't capture a link that redirects to an outside domain.

I've created an array of strings and, inside the handlePageData function, I add each new URL found to the array. For example:

class MyCrawler extends PHPCrawler
{
function handlePageData(&$page_data)
{
array_push($this->urlsFound, $page_data);
}
}

However if a link on the site, with a relative path, redirects to an outside domain, it does not get added to my list. If you want to see my site, it is at http://webvulscanner.net84.net/testsitewithvulns/vulnerabilities.php If you click on "Unvalidated Redirect 2", it redirects to a different website and the crawler won't log the URL as a found URL in my array.

If you have any more suggestions, please let me know

Many Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-02-05

Sorry I forgot to say that the URL I need to capture is the link, not the domain that it redirects to. i.e. this one needs to be captured: "unvalidated_redirect2.php?redirect=http://www.limerickleader.ie&message=fakemsg"

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2012-02-06

I just checked you report and the crawler seems to have difficulties to rebiuld the full qualified URL from the link "unvalidated_redirect2.php?redirect=http://www.limerickleader.ie&message=fakemsg".

array(5) { ["link_raw"]=> string(79) "unvalidated_redirect2.php?redirect=http://www.limerickleader.ie&message=fakemsg" ["linktext"]=> string(0) "" ["linkcode"]=> string(90) "<a href="unvalidated_redirect2.php?redirect=http://www.limerickleader.ie&message=fakemsg">" ["referer_url"]=> string(68) "http://webvulscanner.net84.net/testsitewithvulns/vulnerabilities.php" [b] ["url_rebuild"]=> string(8) "http:///"[/b] }

This is definitely a bug, sorry!

It will get fixed in the next version.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2012-02-06

Just added the bug to the buglist:
https://sourceforge.net/tracker/?func=detail&aid=3485155&group_id=89439&atid=590146

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Dermot Blair - 2012-02-07

No worries, thanks for the reply! Do you have any idea when the next version is due to be released? Its not urgent or anything, I was just curious..

Many Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous