Menu

#35 Restrict the crawl on either foo.com or www.foo.com

open
nobody
None
5
2015-09-27
2015-09-27
Anonymous
No

Since Version 0.7, the crawler consider foo.com and www.foo.com as the same host.
"Fixed problem that the crawler handled i.e. "foo.com" and "www.foo.com" as different hosts."

But on certain sites (with the default $follow_mode = 2), the crawler will crawl both http://foo.com/bar.html and http://www.foo.com/bar.html and at the end of the crawl, every url will have been parsed "twice" (with and without www).

I know that it can be fixed with with addURLFollowRule() for the sites having this problem.
But I think it would be better with an option like enableWithOrWithoutWww() or disableWithOrWithoutWww()
Of course, this option would only be used if the starting url is http://foo.com/ or http://www.foo.com/ but not if the starting url is http://subdomain.foo.com/
If set to false (and with the default $follow_mode = 2) :
- With the starting url http://foo.com/ the crawl would stay on foo.com and ignore any url on www.foo.com
- With the starting url http://www.foo.com/ the crawl would stay on www.foo.com and ignore any url on foo.com

Discussion

Anonymous
Anonymous

Add attachments
Cancel