Menu

Random urls contains bad chars

Help
pmsteil
2013-02-27
2013-04-09
  • pmsteil

    pmsteil - 2013-02-27

    I am having great success with my crawler based on your code… the only problem I am having right now is that sometimes I see URLs like this which are invalid (404s):

    http://www.churchbuzz.org/Church-Content-Management/(

    I have looked at the source file from where this link is being found

    http://www.churchbuzz.org/Church-Content-Management/Church-Website-Maintenance.htm

    and don't see any link like this…

    Any ideas?

    Thanks!
    Patrick

     
  • pmsteil

    pmsteil - 2013-02-27

    Here is some additional debug I output while processing this url:

    http://www.churchbuzz.org/Church-Content-Management/(
    pageinfo->file:  (
    pageinfo->query:     
    pageinfo->url: http://www.churchbuzz.org/Church-Content-Management/(
    pageinfo->content_type: text/html
    pageinfo->path:  /Church-Content-Management/
    pageinfo->refering_link_raw:    (
    pageinfo->http_status_code: [404]
    pageinfo->referer_url: http://www.churchbuzz.org/Church-Content-Management/Church-Website-Maintenance.htm
    pageinfo->refering_link_raw: (
    

    Thanks!
    Patrick

     
  • Nobody/Anonymous

    Hi Patrick,

    this is due to the fact that phpcrawl looks for links in the enritre content of a document/page, even in <script>-parts of a website. So sometimes it finds links that are not really links (from javascript-code e.g.).

    You may try to set $crawler->enableAggressiveLinkSearch(false) (http://phpcrawl.cuab.de/classreferences/index.html).

    This is a known problem and is already on the list of known bugs (http://sourceforge.net/tracker/?func=detail&aid=3555300&group_id=89439&atid=590146) and will (hopefully) get fixed inthe next version.

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.