Thanks for the reply. I tried all of the follow modes but they did not seem to work. I don't actually want the crawler to crawl outside of the domain. I am trying to capture all the internal links that belong to the website being crawled but it won't capture a link that redirects to an outside domain.
I've created an array of strings and, inside the handlePageData function, I add each new URL found to the array. For example:
class MyCrawler extends PHPCrawler
{
function handlePageData(&$page_data)
{
array_push($this->urlsFound, $page_data);
}
}
However if a link on the site, with a relative path, redirects to an outside domain, it does not get added to my list. If you want to see my site, it is at http://webvulscanner.net84.net/testsitewithvulns/vulnerabilities.php If you click on "Unvalidated Redirect 2", it redirects to a different website and the crawler won't log the URL as a found URL in my array.
If you have any more suggestions, please let me know
Many Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry I forgot to say that the URL I need to capture is the link, not the domain that it redirects to. i.e. this one needs to be captured: "unvalidated_redirect2.php?redirect=http://www.limerickleader.ie&message=fakemsg"
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I just checked you report and the crawler seems to have difficulties to rebiuld the full qualified URL from the link "unvalidated_redirect2.php?redirect=http://www.limerickleader.ie&message=fakemsg".
Hi!
Did you change the follow-mode of the crawler?
By default, the crawler stays within the host of the page set in setURL().
Try to set $crawler->setFollowMode(0);
(http://phpcrawl.cuab.de/classreference.html#setfollowmode)
Hope it will help!
Hi,
Thanks for the reply. I tried all of the follow modes but they did not seem to work. I don't actually want the crawler to crawl outside of the domain. I am trying to capture all the internal links that belong to the website being crawled but it won't capture a link that redirects to an outside domain.
I've created an array of strings and, inside the handlePageData function, I add each new URL found to the array. For example:
class MyCrawler extends PHPCrawler
{
function handlePageData(&$page_data)
{
array_push($this->urlsFound, $page_data);
}
}
However if a link on the site, with a relative path, redirects to an outside domain, it does not get added to my list. If you want to see my site, it is at http://webvulscanner.net84.net/testsitewithvulns/vulnerabilities.php If you click on "Unvalidated Redirect 2", it redirects to a different website and the crawler won't log the URL as a found URL in my array.
If you have any more suggestions, please let me know
Many Thanks
Sorry I forgot to say that the URL I need to capture is the link, not the domain that it redirects to. i.e. this one needs to be captured: "unvalidated_redirect2.php?redirect=http://www.limerickleader.ie&message=fakemsg"
Thanks
I just checked you report and the crawler seems to have difficulties to rebiuld the full qualified URL from the link "unvalidated_redirect2.php?redirect=http://www.limerickleader.ie&message=fakemsg".
This is definitely a bug, sorry!
It will get fixed in the next version.
Just added the bug to the buglist:
https://sourceforge.net/tracker/?func=detail&aid=3485155&group_id=89439&atid=590146
No worries, thanks for the reply! Do you have any idea when the next version is due to be released? Its not urgent or anything, I was just curious..
Many Thanks!