PHPCrawl / Forum / Help: Possible to control how many times phpcrawl visits the same file (but different id)?

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"
Anonymous - 2016-09-08

BACKGROUND

I've made a testing bot with PHPCrawl (PHP, Apache, Windows). I am using it to enter a few hundred pages on my site, executing pages to trigger possible errors. If I a syntax error exists in any visisted file, my error log will contain info about it.

THE PROBLEM

If I have a thousand database posts viewed by the same file, like: post.php?id=1 post.php?id=n

It could be enough to test one, or maybe ten different ids for a certain file. If one post work, its likely all posts are working (in my case, it is).

I HAVE FILTERS, BUT OTHER KINDS

I have filters that make the bot avoid URLs with words like "delete" and "remove" which saves me from having a bot that is deleting all my data, but those filters are defined by how the url is formatted, not by a count or limit.

I don't understand how to ignore just some of the URLs. I have overloaded handleDocumentInfo() which is called after the link already has been processed. There "must be" a function called when the URL is found, before visiting it and by that time can be kept or ignored, or am I wrong?

Example of how I want to control it

I would like to make code similar to this example:

if(substr($foundURL, 0, 8) == "post.php") { $counter++; if($counter == 10) { return false; } return true; }

Any good ideas? Thanks!

# BACKGROUND I've made a testing bot with PHPCrawl (PHP, Apache, Windows). I am using it to enter a few hundred pages on my site, executing pages to trigger possible errors. If I a syntax error exists in any visisted file, my error log will contain info about it. # THE PROBLEM If I have a thousand database posts viewed by the same file, like: post.php?id=1 post.php?id=n It could be enough to test one, or maybe ten different ids for a certain file. If one post work, its likely all posts are working (in my case, it is). # I HAVE FILTERS, BUT OTHER KINDS I have filters that make the bot avoid URLs with words like "delete" and "remove" which saves me from having a bot that is deleting all my data, but those filters are defined by how the url is formatted, not by a count or limit. I don't understand how to ignore just some of the URLs. I have overloaded handleDocumentInfo() which is called after the link already has been processed. There "must be" a function called when the URL is found, before visiting it and by that time can be kept or ignored, or am I wrong? # Example of how I want to control it I would like to make code similar to this example: ~~~ if(substr($foundURL, 0, 8) == "post.php") { $counter++; if($counter == 10) { return false; } return true; } ~~~ Any good ideas? Thanks!

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

James Shaver - 2016-09-10

Have you tried obeyNoFollowTags?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"
Anonymous - 2016-09-10

Thanks, I have not.

I have an idea of how to use that solution, but that forces me to edit the code on my site for the crawler to work properly, which is a way I'd like to avoid.

Isn't there a way to filter the links in the crawler before they are crawled?

like:

foundLinkCallback(foundLink); function foundLink($url) { if(iWantToUseURL) // Insert code in original post return true; // The URL will be crawled else return false; // the URL will not be crawled }

To answer myself: I have read the documentation (local: classreferences/index.html) and none of the 12 lines from "Overriding" to "Link finding" includes this, if Im getting it right. There are lots of possibilities for regex rules, but I guess regex is not a solution for this need, where counting is the problem.

There is this function:
addURLFollowRule() which is totaly useful for other purposes, but I'd like this one to exist:
addURLFollowCallback(); - NOT EXISTING

Last edit: Anonymous 2016-09-10

Thanks, I have not. I have an idea of how to use that solution, but that forces me to edit the code on my site for the crawler to work properly, which is a way I'd like to avoid. Isn't there a way to filter the links in the crawler before they are crawled? like: ~~~ foundLinkCallback(foundLink); function foundLink($url) { if(iWantToUseURL) // Insert code in original post return true; // The URL will be crawled else return false; // the URL will not be crawled } ~~~ To answer myself: I have read the documentation (local: classreferences/index.html) and none of the 12 lines from "Overriding" to "Link finding" includes this, if Im getting it right. There are lots of possibilities for regex rules, but I guess regex is not a solution for this need, where counting is the problem. There is this function: addURLFollowRule() which is totaly useful for other purposes, but I'd like this one to exist: addURLFollowCallback(); - NOT EXISTING

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Ok, well I'm using a textarea/db record to skip over URLs... maybe this will help you. It wouldn't be pretty including all of the pagination IDs, but it's something:

// Parse sitemap_skip_urls textarea to an array
$skipurls = explode("\n", str_replace("\r", "", $params->get('sitemap_skip_urls')));
foreach($skipurls as $skip) {
    // preg_quote escapes the dot, question mark and equals sign in the URL (by
    // default) as well as all the forward slashes (because we pass '/' as the
    // $delimiter argument).
    $escapedUrl = preg_quote($skip, '/');

    // We enclose our regex in '/' characters here - the same delimiter we passed
    // to preg_quote
    $regex = '/\s' . $escapedUrl . '\s/';

    // Add the $regex to the filters
    $crawler->addURLFilterRule($regex);
}

Last edit: James Shaver 2016-09-10

Anonymous

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2016-09-11

aha, I see, sweet hack. :)
This will solve it, but the regex string will be extremely long, I hope there are no max length, but I guess the crawler will be a bit slower parsing this extremely long regex for each link, but maybe not noticeable.

My content is user generated and increases by a few pages every day, so with thousands of skipped pages this string will be.... something special... :)

How many skips are you having in it? Hundreds?

Thanks!

aha, I see, sweet hack. :) This will solve it, but the regex string will be extremely long, I hope there are no max length, but I guess the crawler will be a bit slower parsing this extremely long regex for each link, but maybe not noticeable. My content is user generated and increases by a few pages every day, so with thousands of skipped pages this string will be.... something special... :) How many skips are you having in it? Hundreds? Thanks!

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

James Shaver - 2016-09-11

I'm not sure what you'd need, but if it's just a bunch of page ID's that you don't want to follow, it shouldn't be too bad with a couple of regex's:
(untested)

//post.php?id=100000000 //post.php?page=2&id=100000000 $crawler->addURLFilterRule("^post\.php\?id\=([2-9][0-9]{0,99})$"); $crawler->addURLFilterRule("^post\.php\?page\=([2-9]{1,9})\&id\=([2-9][0-9]{0,99})$");

Notice the first digit set is [2-9], so it won't match the first id=1 or page=1

Last edit: James Shaver 2016-09-11
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2016-09-15

Great idea, this is a solution I think I am fully comfortable with!
Thanks!

Great idea, this is a solution I think I am fully comfortable with! Thanks!

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Possible to control how many times phpcrawl visits the same file (but...

Forums

Help

Possible to control how many times phpcrawl visits the same file (but different id)?

BACKGROUND

THE PROBLEM

I HAVE FILTERS, BUT OTHER KINDS

Example of how I want to control it

Possible to control how many times phpcrawl visits the same file (but...

Forums

Help

Possible to control how many times phpcrawl visits the same file (but different id)? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

BACKGROUND

THE PROBLEM

I HAVE FILTERS, BUT OTHER KINDS

Example of how I want to control it

Possible to control how many times phpcrawl visits the same file (but different id)?