Menu

#67 Follow-mode 1, crawler doesn't follow some subdomains

open
nobody
None
5
2015-03-18
2014-02-28
Uwe Hunfeld
No

A user reported that the crawler doesn't follow (some) subdomains if follow-mode is set to 1.
(setFollowMode(1))

Example:
Root-URL: www.foo.com, follow-mode 1.
Links to e.g. www.hamburg.foo.com don't get followed.

Also see this original forum-post:
https://sourceforge.net/p/phpcrawl/discussion/307696/thread/85b8d294/

Discussion

  • Uwe Hunfeld

    Uwe Hunfeld - 2014-02-28
    • summary: Fololow-mode 1, crawler doesn't follow some subdomains --> Follow-mode 1, crawler doesn't follow some subdomains
     
  • Anonymous

    Anonymous - 2015-03-18

    Yes, I managed to work this around by:

    // Filter URLs to other domains if wanted
    if ($this->general_follow_mode >= 1)
    {
    $should_remove = !( $url_parts["domain"] == $this->starting_url_parts["host"] || $url_parts["host"] == $this->starting_url_parts["host"] );
    // if ($url_parts["domain"] != $this->starting_url_parts["domain"]) return false;
    if ( $should_remove ) {
    return false;
    }
    }

    in file PHPCrawlerURLFilter.class.php(line 174).

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.