How to only check for 404 domains when crawling?

Status: Beta

Brought to you by: huni

How to only check for 404 domains when crawling?

Forum: Help

Creator: Anonymous

Created: 2015-01-06

Updated: 2015-01-08

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2015-01-06

I'd like to find all invalid domains (ie without a statuscode) when crawling, I don't care about actual linked pages. Example: The crawler finds a link to www.microsoft.com/downloads/ie/index.html, I want it to try to reach only www.microsoft.com Also don't want to crawl multiple links from same domain. How would I go about doing this with PHPCrawl?

Last edit: Anonymous 2015-01-06

I'd like to find all invalid domains (ie without a statuscode) when crawling, I don't care about actual linked pages. Example: The crawler finds a link to www.microsoft.com/downloads/ie/index.html, I want it to try to reach only www.microsoft.com Also don't want to crawl multiple links from same domain. How would I go about doing this with PHPCrawl?

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2015-01-06

Hi!

I think there is no way to achive this directly with phpcrawl.

But what you could do:
Just let phpcrawl collect all external links (or domains of external links) from a page (without letting it follow them), and store them somewhere (file or database).

Afterwards, just check these collected domains for existance (i.e. with phpcrawl too with a pagelimit set to 1 or simply file_get_contents).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2015-01-08

Hi thanks for the quick reply. How do I prevent the links from being followed? addURLFilterRule?

Hi thanks for the quick reply. How do I prevent the links from being followed? addURLFilterRule?

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2015-01-08

If you want the crawler to stay in a domain, set the followmode to 1 with setFollowMode(), see http://phpcrawl.cuab.de/classreferences/index.html

If you want the crawler to stay in a domain, set the followmode to 1 with setFollowMode(), see http://phpcrawl.cuab.de/classreferences/index.html

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous