Menu

#67 Follow-mode 1, crawler doesn't follow some subdomains

open
nobody
None
5
2015-03-18
2014-02-28
Uwe Hunfeld
No

A user reported that the crawler doesn't follow (some) subdomains if follow-mode is set to 1.
(setFollowMode(1))

Example:
Root-URL: www.foo.com, follow-mode 1.
Links to e.g. www.hamburg.foo.com don't get followed.

Also see this original forum-post:
https://sourceforge.net/p/phpcrawl/discussion/307696/thread/85b8d294/

Discussion

  • Uwe Hunfeld

    Uwe Hunfeld - 2014-02-28
    • summary: Fololow-mode 1, crawler doesn't follow some subdomains --> Follow-mode 1, crawler doesn't follow some subdomains
     
  • Anonymous

    Anonymous - 2015-03-18

    Yes, I managed to work this around by:

    // Filter URLs to other domains if wanted
    if ($this->general_follow_mode >= 1)
    {
    $should_remove = !( $url_parts["domain"] == $this->starting_url_parts["host"] || $url_parts["host"] == $this->starting_url_parts["host"] );
    // if ($url_parts["domain"] != $this->starting_url_parts["domain"]) return false;
    if ( $should_remove ) {
    return false;
    }
    }

    in file PHPCrawlerURLFilter.class.php(line 174).

     

Anonymous
Anonymous

Add attachments
Cancel