Menu

Possible to control how many times phpcrawl visits the same file (but different id)?

Help
Anonymous
2016-09-08
2016-09-15
  • Anonymous

    Anonymous - 2016-09-08

    BACKGROUND

    I've made a testing bot with PHPCrawl (PHP, Apache, Windows). I am using it to enter a few hundred pages on my site, executing pages to trigger possible errors. If I a syntax error exists in any visisted file, my error log will contain info about it.

    THE PROBLEM

    If I have a thousand database posts viewed by the same file, like: post.php?id=1 post.php?id=n

    It could be enough to test one, or maybe ten different ids for a certain file. If one post work, its likely all posts are working (in my case, it is).

    I HAVE FILTERS, BUT OTHER KINDS

    I have filters that make the bot avoid URLs with words like "delete" and "remove" which saves me from having a bot that is deleting all my data, but those filters are defined by how the url is formatted, not by a count or limit.

    I don't understand how to ignore just some of the URLs. I have overloaded handleDocumentInfo() which is called after the link already has been processed. There "must be" a function called when the URL is found, before visiting it and by that time can be kept or ignored, or am I wrong?

    Example of how I want to control it

    I would like to make code similar to this example:

    if(substr($foundURL, 0, 8) == "post.php")
    {
    
        $counter++;
        if($counter == 10)
        {
            return false;
        }
        return true;
    }
    

    Any good ideas? Thanks!

     
  • James Shaver

    James Shaver - 2016-09-10

    Have you tried obeyNoFollowTags?

     
  • Anonymous

    Anonymous - 2016-09-10

    Thanks, I have not.

    I have an idea of how to use that solution, but that forces me to edit the code on my site for the crawler to work properly, which is a way I'd like to avoid.

    Isn't there a way to filter the links in the crawler before they are crawled?

    like:

    foundLinkCallback(foundLink);
    
    function foundLink($url)
    {
        if(iWantToUseURL) // Insert code in original post
            return true; // The URL will be crawled
        else
            return false; // the URL will not be crawled
    }
    

    To answer myself: I have read the documentation (local: classreferences/index.html) and none of the 12 lines from "Overriding" to "Link finding" includes this, if Im getting it right. There are lots of possibilities for regex rules, but I guess regex is not a solution for this need, where counting is the problem.

    There is this function:
    addURLFollowRule() which is totaly useful for other purposes, but I'd like this one to exist:
    addURLFollowCallback(); - NOT EXISTING

     

    Last edit: Anonymous 2016-09-10
  • James Shaver

    James Shaver - 2016-09-10

    Ok, well I'm using a textarea/db record to skip over URLs... maybe this will help you. It wouldn't be pretty including all of the pagination IDs, but it's something:

    // Parse sitemap_skip_urls textarea to an array
    $skipurls = explode("\n", str_replace("\r", "", $params->get('sitemap_skip_urls')));
    foreach($skipurls as $skip) {
        // preg_quote escapes the dot, question mark and equals sign in the URL (by
        // default) as well as all the forward slashes (because we pass '/' as the
        // $delimiter argument).
        $escapedUrl = preg_quote($skip, '/');
    
        // We enclose our regex in '/' characters here - the same delimiter we passed
        // to preg_quote
        $regex = '/\s' . $escapedUrl . '\s/';
    
        // Add the $regex to the filters
        $crawler->addURLFilterRule($regex);
    }
    
     

    Last edit: James Shaver 2016-09-10
  • Anonymous

    Anonymous - 2016-09-11

    aha, I see, sweet hack. :)
    This will solve it, but the regex string will be extremely long, I hope there are no max length, but I guess the crawler will be a bit slower parsing this extremely long regex for each link, but maybe not noticeable.

    My content is user generated and increases by a few pages every day, so with thousands of skipped pages this string will be.... something special... :)

    How many skips are you having in it? Hundreds?

    Thanks!

     
  • James Shaver

    James Shaver - 2016-09-11

    I'm not sure what you'd need, but if it's just a bunch of page ID's that you don't want to follow, it shouldn't be too bad with a couple of regex's:
    (untested)

    //post.php?id=100000000
    //post.php?page=2&id=100000000
    $crawler->addURLFilterRule("^post\.php\?id\=([2-9][0-9]{0,99})$");
    $crawler->addURLFilterRule("^post\.php\?page\=([2-9]{1,9})\&id\=([2-9][0-9]{0,99})$");
    

    Notice the first digit set is [2-9], so it won't match the first id=1 or page=1

     

    Last edit: James Shaver 2016-09-11
  • Anonymous

    Anonymous - 2016-09-15

    Great idea, this is a solution I think I am fully comfortable with!
    Thanks!

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.