Menu

How to only check for 404 domains when crawling?

Help
Anonymous
2015-01-06
2015-01-08
  • Anonymous

    Anonymous - 2015-01-06

    I'd like to find all invalid domains (ie without a statuscode) when crawling, I don't care about actual linked pages. Example: The crawler finds a link to www.microsoft.com/downloads/ie/index.html, I want it to try to reach only www.microsoft.com Also don't want to crawl multiple links from same domain. How would I go about doing this with PHPCrawl?

     

    Last edit: Anonymous 2015-01-06
  • Uwe Hunfeld

    Uwe Hunfeld - 2015-01-06

    Hi!

    I think there is no way to achive this directly with phpcrawl.

    But what you could do:
    Just let phpcrawl collect all external links (or domains of external links) from a page (without letting it follow them), and store them somewhere (file or database).

    Afterwards, just check these collected domains for existance (i.e. with phpcrawl too with a pagelimit set to 1 or simply file_get_contents).

     
  • Anonymous

    Anonymous - 2015-01-08

    Hi thanks for the quick reply. How do I prevent the links from being followed? addURLFilterRule?

     
  • Anonymous

    Anonymous - 2015-01-08

    If you want the crawler to stay in a domain, set the followmode to 1 with setFollowMode(), see http://phpcrawl.cuab.de/classreferences/index.html

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.