I'd like to find all invalid domains (ie without a statuscode) when crawling, I don't care about actual linked pages. Example: The crawler finds a link to www.microsoft.com/downloads/ie/index.html, I want it to try to reach only www.microsoft.com Also don't want to crawl multiple links from same domain. How would I go about doing this with PHPCrawl?
Last edit: Anonymous 2015-01-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think there is no way to achive this directly with phpcrawl.
But what you could do:
Just let phpcrawl collect all external links (or domains of external links) from a page (without letting it follow them), and store them somewhere (file or database).
Afterwards, just check these collected domains for existance (i.e. with phpcrawl too with a pagelimit set to 1 or simply file_get_contents).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
I'd like to find all invalid domains (ie without a statuscode) when crawling, I don't care about actual linked pages. Example: The crawler finds a link to www.microsoft.com/downloads/ie/index.html, I want it to try to reach only www.microsoft.com Also don't want to crawl multiple links from same domain. How would I go about doing this with PHPCrawl?
Last edit: Anonymous 2015-01-06
Hi!
I think there is no way to achive this directly with phpcrawl.
But what you could do:
Just let phpcrawl collect all external links (or domains of external links) from a page (without letting it follow them), and store them somewhere (file or database).
Afterwards, just check these collected domains for existance (i.e. with phpcrawl too with a pagelimit set to 1 or simply file_get_contents).
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi thanks for the quick reply. How do I prevent the links from being followed? addURLFilterRule?
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
If you want the crawler to stay in a domain, set the followmode to 1 with setFollowMode(), see http://phpcrawl.cuab.de/classreferences/index.html