I've made a testing bot with PHPCrawl (PHP, Apache, Windows). I am using it to enter a few hundred pages on my site, executing pages to trigger possible errors. If I a syntax error exists in any visisted file, my error log will contain info about it.
THE PROBLEM
If I have a thousand database posts viewed by the same file, like: post.php?id=1 post.php?id=n
It could be enough to test one, or maybe ten different ids for a certain file. If one post work, its likely all posts are working (in my case, it is).
I HAVE FILTERS, BUT OTHER KINDS
I have filters that make the bot avoid URLs with words like "delete" and "remove" which saves me from having a bot that is deleting all my data, but those filters are defined by how the url is formatted, not by a count or limit.
I don't understand how to ignore just some of the URLs. I have overloaded handleDocumentInfo() which is called after the link already has been processed. There "must be" a function called when the URL is found, before visiting it and by that time can be kept or ignored, or am I wrong?
Example of how I want to control it
I would like to make code similar to this example:
I have an idea of how to use that solution, but that forces me to edit the code on my site for the crawler to work properly, which is a way I'd like to avoid.
Isn't there a way to filter the links in the crawler before they are crawled?
like:
foundLinkCallback(foundLink);functionfoundLink($url){if(iWantToUseURL)// Insert code in original postreturntrue;// The URL will be crawledelsereturnfalse;// the URL will not be crawled}
To answer myself: I have read the documentation (local: classreferences/index.html) and none of the 12 lines from "Overriding" to "Link finding" includes this, if Im getting it right. There are lots of possibilities for regex rules, but I guess regex is not a solution for this need, where counting is the problem.
There is this function:
addURLFollowRule() which is totaly useful for other purposes, but I'd like this one to exist:
addURLFollowCallback(); - NOT EXISTING
Last edit: Anonymous 2016-09-10
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, well I'm using a textarea/db record to skip over URLs... maybe this will help you. It wouldn't be pretty including all of the pagination IDs, but it's something:
// Parse sitemap_skip_urls textarea to an array
$skipurls = explode("\n", str_replace("\r", "", $params->get('sitemap_skip_urls')));
foreach($skipurls as $skip) {
// preg_quote escapes the dot, question mark and equals sign in the URL (by
// default) as well as all the forward slashes (because we pass '/' as the
// $delimiter argument).
$escapedUrl = preg_quote($skip, '/');
// We enclose our regex in '/' characters here - the same delimiter we passed
// to preg_quote
$regex = '/\s' . $escapedUrl . '\s/';
// Add the $regex to the filters
$crawler->addURLFilterRule($regex);
}
Last edit: James Shaver 2016-09-10
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
aha, I see, sweet hack. :)
This will solve it, but the regex string will be extremely long, I hope there are no max length, but I guess the crawler will be a bit slower parsing this extremely long regex for each link, but maybe not noticeable.
My content is user generated and increases by a few pages every day, so with thousands of skipped pages this string will be.... something special... :)
How many skips are you having in it? Hundreds?
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm not sure what you'd need, but if it's just a bunch of page ID's that you don't want to follow, it shouldn't be too bad with a couple of regex's:
(untested)
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
BACKGROUND
I've made a testing bot with PHPCrawl (PHP, Apache, Windows). I am using it to enter a few hundred pages on my site, executing pages to trigger possible errors. If I a syntax error exists in any visisted file, my error log will contain info about it.
THE PROBLEM
If I have a thousand database posts viewed by the same file, like: post.php?id=1 post.php?id=n
It could be enough to test one, or maybe ten different ids for a certain file. If one post work, its likely all posts are working (in my case, it is).
I HAVE FILTERS, BUT OTHER KINDS
I have filters that make the bot avoid URLs with words like "delete" and "remove" which saves me from having a bot that is deleting all my data, but those filters are defined by how the url is formatted, not by a count or limit.
I don't understand how to ignore just some of the URLs. I have overloaded handleDocumentInfo() which is called after the link already has been processed. There "must be" a function called when the URL is found, before visiting it and by that time can be kept or ignored, or am I wrong?
Example of how I want to control it
I would like to make code similar to this example:
Any good ideas? Thanks!
Have you tried obeyNoFollowTags?
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Thanks, I have not.
I have an idea of how to use that solution, but that forces me to edit the code on my site for the crawler to work properly, which is a way I'd like to avoid.
Isn't there a way to filter the links in the crawler before they are crawled?
like:
To answer myself: I have read the documentation (local: classreferences/index.html) and none of the 12 lines from "Overriding" to "Link finding" includes this, if Im getting it right. There are lots of possibilities for regex rules, but I guess regex is not a solution for this need, where counting is the problem.
There is this function:
addURLFollowRule() which is totaly useful for other purposes, but I'd like this one to exist:
addURLFollowCallback(); - NOT EXISTING
Last edit: Anonymous 2016-09-10
Ok, well I'm using a textarea/db record to skip over URLs... maybe this will help you. It wouldn't be pretty including all of the pagination IDs, but it's something:
Last edit: James Shaver 2016-09-10
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
aha, I see, sweet hack. :)
This will solve it, but the regex string will be extremely long, I hope there are no max length, but I guess the crawler will be a bit slower parsing this extremely long regex for each link, but maybe not noticeable.
My content is user generated and increases by a few pages every day, so with thousands of skipped pages this string will be.... something special... :)
How many skips are you having in it? Hundreds?
Thanks!
I'm not sure what you'd need, but if it's just a bunch of page ID's that you don't want to follow, it shouldn't be too bad with a couple of regex's:
(untested)
Notice the first digit set is [2-9], so it won't match the first id=1 or page=1
Last edit: James Shaver 2016-09-11
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Great idea, this is a solution I think I am fully comfortable with!
Thanks!