Menu

Does Crawler follow all the types of links?

Help
Anonymous
2014-02-21
2014-11-27
  • Anonymous

    Anonymous - 2014-02-21

    Dear Colleagues,

    I have a domain like: "http://www.foo.de" and this website has a link like "http://www.hamburg.foo.de/" but the crawler can not following it. It can only follow links ended in pdf or html like "http://www.foo.de/file.html
    My settings are:
    $crawler = flx_Crawler::getInstance();
    $crawler->setFollowMode(1);
    $crawler->setFollowRedirects(TRUE);
    $crawler->setFollowRedirectsTillContent(TRUE);
    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addContentTypeReceiveRule("#application/pdf#");
    $crawler->addURLFilterRule("#.(jpg|jpeg|gif|png)$# i");
    $crawler->addURLFilterRule("#.(css|js)$# i");
    $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
    $crawler->enableCookieHandling(true);
    $crawler->setURL($url);

    any suggestions?

    In advance thanks a lot

    Regards

    Jorge von Rudno

     
    • Anonymous

      Anonymous - 2020-11-13
      Post awaiting moderation.
  • Anonymous

    Anonymous - 2014-02-22

    Hi!

    Your setup looks good, it should work as expected as far as i can see.

    To give it a test:
    Could you try it again with follow-mode 0? Does it follow "http://www.hamburg.foo.de/" with that? If so, the crawler seems to handle "http://www.hamburg.foo.de/" as a different host than foo.de.

    I don't know, maybe this is a bug then, is it a different "host"?

     
  • Anonymous

    Anonymous - 2014-02-24

    Hi, Thanks alot for your answer.

    I have done your suggestion (setFollowMode(0)), and in fact the crawler can follow the link. I think the most likely is tha the link "http://www.hamburg.foo.de/" go to a different host. There are a way to solve this situation?
    Best regards.

    Jorge von Rudno

     
  • Anonymous

    Anonymous - 2014-02-24

    Ok, i think this is a bug, http://www.hamburg.foo.de ist the same host as "www.foo.com",
    i think the crawler should defenatly follow links like that with follow-mode 1!

    I'll give it a test soon.

    For a workaround you could simply set the follow-rules yourself without using the follow-mode.

    Try something like this:
    $crawler->setFollowMode(0);
    $crawler->addURLFollowRule("#foo\.com#");
    ...

    This let's the crawler follow every URL that contains the string foo.com. You could refine the rule so that it won't follow URLs like "www.bla.com\dabadafoo.com" of course if you want.

     

    Last edit: Anonymous 2014-02-24
  • Anonymous

    Anonymous - 2014-02-25

    Hi,

    I have implement your suggestion and this has solved my problem. for your knowledge and your comments I think you are part of the team of phpCraw development, so please tell me if I should report this case as a bug.

     
  • Anonymous

    Anonymous - 2014-02-25

    Sorry, I forgot to say Thanks alot for your help.

    Regards.

    Jorge von Rudno

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2014-02-28

    Hi!

    No problem.

    I just opened a bug-report for this:
    https://sourceforge.net/p/phpcrawl/bugs/67/

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2014-11-27

    Hi Jorge,

    do you have an acutal example (page) for this problem? (for testing)

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.