Menu

Crawl pages using iframe / frames to change URL

Help
Anonymous
2013-05-29
2013-05-30
  • Anonymous

    Anonymous - 2013-05-29

    I'm using PHPCrawl to crawl various company websites. However, some of these websites have multiple URL's (eg. .com and .nl), and on one of those, they include the other using a iframe or frame attribute. Because I have disabled cross-domain crawling, this means that no (useful) content is crawled at all.

    Cross-domain crawling should not be enabled in my situation, since usually I don't want that. However, if the root URL contains a single frame or iframe including a different URL, I would like that URL to be crawled instead, much like the handling of redirects.

    Is there any way to do this currently?

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-05-29

    Hi!

    The only solution that comes into my mind right now is to ENable crossdomain-crawling and then set a followRule like this:

    $crawler->setFollowMode(0);
    $crawler->addURLFollowRule("#domainname\.(com|nl))#");

    Could this work for you?

     

    Last edit: Uwe Hunfeld 2013-05-29
  • Anonymous

    Anonymous - 2013-05-29

    Thanks for the suggestion. It will do for this specific case. I changed it a bit: I remove the TLD from the domain name, and use this regexp:

    $crawler->addURLFollowRule("#" . $url_domain . ".([a-z]{2,4}))#");

    For a lot of situations, this will work, assuming that the domainname will not change.

    However, in a more general case, it would be great to have something like setFollowRedirectsTillContent that, instead of following redirects, follows URL includes using a single frame or iframe on a page, or more specifically, the first content it receives when starting a crawl.

    Thanks for the suggestion, it does fix the problem for now.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-05-29

    I see what you mean.
    It could be done by checking whether there's only one URL on a page coming from an iframe or frame-tag.

    Would you like to open a feature-request for this?

    Thanks!

     
  • MadEgg

    MadEgg - 2013-05-29
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-05-30

    Thanks!

     

Anonymous
Anonymous

Add attachments
Cancel