Menu

Possible to control how many times phpcrawl visits the same file (but different id)?

Help
Anonymous
2016-09-08
2016-09-15
  • Anonymous

    Anonymous - 2016-09-08

    BACKGROUND

    I've made a testing bot with PHPCrawl (PHP, Apache, Windows). I am using it to enter a few hundred pages on my site, executing pages to trigger possible errors. If I a syntax error exists in any visisted file, my error log will contain info about it.

    THE PROBLEM

    If I have a thousand database posts viewed by the same file, like: post.php?id=1 post.php?id=n

    It could be enough to test one, or maybe ten different ids for a certain file. If one post work, its likely all posts are working (in my case, it is).

    I HAVE FILTERS, BUT OTHER KINDS

    I have filters that make the bot avoid URLs with words like "delete" and "remove" which saves me from having a bot that is deleting all my data, but those filters are defined by how the url is formatted, not by a count or limit.

    I don't understand how to ignore just some of the URLs. I have overloaded handleDocumentInfo() which is called after the link already has been processed. There "must be" a function called when the URL is found, before visiting it and by that time can be kept or ignored, or am I wrong?

    Example of how I want to control it

    I would like to make code similar to this example:

    if(substr($foundURL, 0, 8) == "post.php")
    {
    
        $counter++;
        if($counter == 10)
        {
            return false;
        }
        return true;
    }
    

    Any good ideas? Thanks!

     
  • James Shaver

    James Shaver - 2016-09-10

    Have you tried obeyNoFollowTags?

     
  • Anonymous

    Anonymous - 2016-09-10

    Thanks, I have not.

    I have an idea of how to use that solution, but that forces me to edit the code on my site for the crawler to work properly, which is a way I'd like to avoid.

    Isn't there a way to filter the links in the crawler before they are crawled?

    like:

    foundLinkCallback(foundLink);
    
    function foundLink($url)
    {
        if(iWantToUseURL) // Insert code in original post
            return true; // The URL will be crawled
        else
            return false; // the URL will not be crawled
    }
    

    To answer myself: I have read the documentation (local: classreferences/index.html) and none of the 12 lines from "Overriding" to "Link finding" includes this, if Im getting it right. There are lots of possibilities for regex rules, but I guess regex is not a solution for this need, where counting is the problem.

    There is this function:
    addURLFollowRule() which is totaly useful for other purposes, but I'd like this one to exist:
    addURLFollowCallback(); - NOT EXISTING

     

    Last edit: Anonymous 2016-09-10
  • James Shaver

    James Shaver - 2016-09-10

    Ok, well I'm using a textarea/db record to skip over URLs... maybe this will help you. It wouldn't be pretty including all of the pagination IDs, but it's something:

    // Parse sitemap_skip_urls textarea to an array
    $skipurls = explode("\n", str_replace("\r", "", $params->get('sitemap_skip_urls')));
    foreach($skipurls as $skip) {
        // preg_quote escapes the dot, question mark and equals sign in the URL (by
        // default) as well as all the forward slashes (because we pass '/' as the
        // $delimiter argument).
        $escapedUrl = preg_quote($skip, '/');
    
        // We enclose our regex in '/' characters here - the same delimiter we passed
        // to preg_quote
        $regex = '/\s' . $escapedUrl . '\s/';
    
        // Add the $regex to the filters
        $crawler->addURLFilterRule($regex);
    }
    
     

    Last edit: James Shaver 2016-09-10
  • Anonymous

    Anonymous - 2016-09-11

    aha, I see, sweet hack. :)
    This will solve it, but the regex string will be extremely long, I hope there are no max length, but I guess the crawler will be a bit slower parsing this extremely long regex for each link, but maybe not noticeable.

    My content is user generated and increases by a few pages every day, so with thousands of skipped pages this string will be.... something special... :)

    How many skips are you having in it? Hundreds?

    Thanks!

     
  • James Shaver

    James Shaver - 2016-09-11

    I'm not sure what you'd need, but if it's just a bunch of page ID's that you don't want to follow, it shouldn't be too bad with a couple of regex's:
    (untested)

    //post.php?id=100000000
    //post.php?page=2&id=100000000
    $crawler->addURLFilterRule("^post\.php\?id\=([2-9][0-9]{0,99})$");
    $crawler->addURLFilterRule("^post\.php\?page\=([2-9]{1,9})\&id\=([2-9][0-9]{0,99})$");
    

    Notice the first digit set is [2-9], so it won't match the first id=1 or page=1

     

    Last edit: James Shaver 2016-09-11
  • Anonymous

    Anonymous - 2016-09-15

    Great idea, this is a solution I think I am fully comfortable with!
    Thanks!

     

Anonymous
Anonymous

Add attachments
Cancel