I am having great success with my crawler based on your code… the only problem I am having right now is that sometimes I see URLs like this which are invalid (404s):
this is due to the fact that phpcrawl looks for links in the enritre content of a document/page, even in <script>-parts of a website. So sometimes it finds links that are not really links (from javascript-code e.g.).
I am having great success with my crawler based on your code… the only problem I am having right now is that sometimes I see URLs like this which are invalid (404s):
http://www.churchbuzz.org/Church-Content-Management/(
I have looked at the source file from where this link is being found
http://www.churchbuzz.org/Church-Content-Management/Church-Website-Maintenance.htm
and don't see any link like this…
Any ideas?
Thanks!
Patrick
Here is some additional debug I output while processing this url:
Thanks!
Patrick
Hi Patrick,
this is due to the fact that phpcrawl looks for links in the enritre content of a document/page, even in <script>-parts of a website. So sometimes it finds links that are not really links (from javascript-code e.g.).
You may try to set $crawler->enableAggressiveLinkSearch(false) (http://phpcrawl.cuab.de/classreferences/index.html).
This is a known problem and is already on the list of known bugs (http://sourceforge.net/tracker/?func=detail&aid=3555300&group_id=89439&atid=590146) and will (hopefully) get fixed inthe next version.