Note the above isn't an actual working link just an example. In this case, it only required that I provide an instance in which this described multiple instances.
The crawler is appending a left closing parenthesis "(" to the end of some URLs and returning 404.
Any idea what's causing this?
Thanks,
Mark
Last edit: Mark 2013-10-15
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
the crawler just sometimes finds some links in pages that are not links at all, for example in javascipt-code (something like image.src = "("+xyz....) This is a known issue, see this bug: http://sourceforge.net/p/phpcrawl/bugs/25/
You may try to set enableAggressiveLinkSearch() to FALSE, maybe it helps in your case.
Otherwiese you can just ignore these links as they are not existant.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I tried setting enableAggressiveLinkSearch() to FALSE, but still returning the same issue.
I was taking a look at the class (class files) to locate the function that parses the URLs. My idea was to add a regular expression to strip any parenthesis or backslash from the end of the end of URLs and return false.
Would you be able to give me a little bit of a starting point on which part of class to take a look at?
Thanks,
Mark
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Page requested: http://www.testwebsite/test-page/( (404)
Note the above isn't an actual working link just an example. In this case, it only required that I provide an instance in which this described multiple instances.
The crawler is appending a left closing parenthesis "(" to the end of some URLs and returning 404.
Any idea what's causing this?
Thanks,
Mark
Last edit: Mark 2013-10-15
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi Mark,
the crawler just sometimes finds some links in pages that are not links at all, for example in javascipt-code (something like image.src = "("+xyz....) This is a known issue, see this bug: http://sourceforge.net/p/phpcrawl/bugs/25/
You may try to set enableAggressiveLinkSearch() to FALSE, maybe it helps in your case.
Otherwiese you can just ignore these links as they are not existant.
Thanks.
I tried setting enableAggressiveLinkSearch() to FALSE, but still returning the same issue.
I was taking a look at the class (class files) to locate the function that parses the URLs. My idea was to add a regular expression to strip any parenthesis or backslash from the end of the end of URLs and return false.
Would you be able to give me a little bit of a starting point on which part of class to take a look at?
Thanks,
Mark
Ah nevermind, I found $DocInfo->url. I can go ahead and parse that.