Random urls contains bad chars

Status: Beta

Brought to you by: huni

Random urls contains bad chars

Forum: Help

Creator: pmsteil

Created: 2013-02-27

Updated: 2013-04-09

pmsteil - 2013-02-27

I am having great success with my crawler based on your code… the only problem I am having right now is that sometimes I see URLs like this which are invalid (404s):

http://www.churchbuzz.org/Church-Content-Management/(

I have looked at the source file from where this link is being found

http://www.churchbuzz.org/Church-Content-Management/Church-Website-Maintenance.htm

and don't see any link like this…

Any ideas?

Thanks!
Patrick

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

pmsteil - 2013-02-27

Here is some additional debug I output while processing this url:

http://www.churchbuzz.org/Church-Content-Management/( pageinfo->file: ( pageinfo->query: pageinfo->url: http://www.churchbuzz.org/Church-Content-Management/( pageinfo->content_type: text/html pageinfo->path: /Church-Content-Management/ pageinfo->refering_link_raw: ( pageinfo->http_status_code: [404] pageinfo->referer_url: http://www.churchbuzz.org/Church-Content-Management/Church-Website-Maintenance.htm pageinfo->refering_link_raw: (

Thanks!
Patrick
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2013-02-28

Hi Patrick,

this is due to the fact that phpcrawl looks for links in the enritre content of a document/page, even in <script>-parts of a website. So sometimes it finds links that are not really links (from javascript-code e.g.).

You may try to set $crawler->enableAggressiveLinkSearch(false) (http://phpcrawl.cuab.de/classreferences/index.html).

This is a known problem and is already on the list of known bugs (http://sourceforge.net/tracker/?func=detail&aid=3555300&group_id=89439&atid=590146) and will (hopefully) get fixed inthe next version.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous