I'm using PHPCrawl to crawl various company websites. However, some of these websites have multiple URL's (eg. .com and .nl), and on one of those, they include the other using a iframe or frame attribute. Because I have disabled cross-domain crawling, this means that no (useful) content is crawled at all.
Cross-domain crawling should not be enabled in my situation, since usually I don't want that. However, if the root URL contains a single frame or iframe including a different URL, I would like that URL to be crawled instead, much like the handling of redirects.
Is there any way to do this currently?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For a lot of situations, this will work, assuming that the domainname will not change.
However, in a more general case, it would be great to have something like setFollowRedirectsTillContent that, instead of following redirects, follows URL includes using a single frame or iframe on a page, or more specifically, the first content it receives when starting a crawl.
Thanks for the suggestion, it does fix the problem for now.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
I'm using PHPCrawl to crawl various company websites. However, some of these websites have multiple URL's (eg. .com and .nl), and on one of those, they include the other using a iframe or frame attribute. Because I have disabled cross-domain crawling, this means that no (useful) content is crawled at all.
Cross-domain crawling should not be enabled in my situation, since usually I don't want that. However, if the root URL contains a single frame or iframe including a different URL, I would like that URL to be crawled instead, much like the handling of redirects.
Is there any way to do this currently?
Hi!
The only solution that comes into my mind right now is to ENable crossdomain-crawling and then set a followRule like this:
$crawler->setFollowMode(0);
$crawler->addURLFollowRule("#domainname\.(com|nl))#");
Could this work for you?
Last edit: Uwe Hunfeld 2013-05-29
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Thanks for the suggestion. It will do for this specific case. I changed it a bit: I remove the TLD from the domain name, and use this regexp:
$crawler->addURLFollowRule("#" . $url_domain . ".([a-z]{2,4}))#");
For a lot of situations, this will work, assuming that the domainname will not change.
However, in a more general case, it would be great to have something like setFollowRedirectsTillContent that, instead of following redirects, follows URL includes using a single frame or iframe on a page, or more specifically, the first content it receives when starting a crawl.
Thanks for the suggestion, it does fix the problem for now.
I see what you mean.
It could be done by checking whether there's only one URL on a page coming from an iframe or frame-tag.
Would you like to open a feature-request for this?
Thanks!
Done, see: https://sourceforge.net/p/phpcrawl/feature-requests/20/
Thanks!