for some reason my script in PHPCrawl crawls some of domains fully while doesn't others.
With some domains it just crawls the first page and does not follow the subpages. So it only crawls and finds 1 page. I tried to crawl the same domains with a desktop crawler (screamingfrog) and it finds many subpages.
For example if i crawl this page it only returns 1 page and stops: http://chic.clinic.si/
Any idea why?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
By default the crawler only follows links to the same HOST, an the site you mentioned only seems to contain links to differetn hosts.
(Entry host is "chic.clinic.si", links links lead to "www.chic.clinic.si")
You also may simply change the root-URL to "www.chic.clinic.si" (same site) without changing the follow-mode, then the links lead to the same host as the one from the entry-URL.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
if i set the www.* then it works.
I dont get the domains with www. but now that i think about it maybe i should add this to domains i asume this is the preferred way for websites to be indexed on.
I will let the followMode on 1, because i actually want to crawl subdomains.
Thanks Uwe
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Helllo,
for some reason my script in PHPCrawl crawls some of domains fully while doesn't others.
With some domains it just crawls the first page and does not follow the subpages. So it only crawls and finds 1 page. I tried to crawl the same domains with a desktop crawler (screamingfrog) and it finds many subpages.
For example if i crawl this page it only returns 1 page and stops: http://chic.clinic.si/
Any idea why?
Hi,
did you try to change the basic followMode to 1?
By default the crawler only follows links to the same HOST, an the site you mentioned only seems to contain links to differetn hosts.
(Entry host is "chic.clinic.si", links links lead to "www.chic.clinic.si")
http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_setFollowMode.htm
You also may simply change the root-URL to "www.chic.clinic.si" (same site) without changing the follow-mode, then the links lead to the same host as the one from the entry-URL.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi,
it turns out i have to set the followMode to 0 to make it crawl the site. Unfortunately that also makes the crawler follow external domains.
Any idea why is followMode 1 not working in this case?
Hi,
ok, seems you suffer from this bug reported here:
https://sourceforge.net/p/phpcrawl/bugs/67/
Didn't adress this bug so far, but you may try the workaround posted
by a user there.
One question: If you set the root-URL to "www.chic.clinic.si", it doen't work
either?
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi,
if i set the www.* then it works.
I dont get the domains with www. but now that i think about it maybe i should add this to domains i asume this is the preferred way for websites to be indexed on.
I will let the followMode on 1, because i actually want to crawl subdomains.
Thanks Uwe