any way to not crawl url with ? twice

Help
2012-04-16
2014-11-25
  • I have an events calendar mysite.com/event.html?day=1 and mysite.com/event.html?day=2 which is getting crawled for each ?get query. How do I set it so it does not crawl for each query - thanks.

     
  • Still curious how to prevent this.
    Thanks

     
  • Hi!

    There's no setting that prevents phpcrawl from crawling the same URL with different get-queries (?…) only once.

    But in your case you may add a non-follow-rule like

    $crawler->addNonFollowMatch("#mysite.com/event.html?day=+#")

    (so only mysite.com/event.html?day=1 will be followed).

     
  • Thanks!

     
  • Any way to add this functionality? I am crawling sites that use url data like ?sortOrder=1&PID=123&keywords=test

    I'd love to treat all urls that have PID=123 as the same regardless of the rest. Any suggestions would be MUCH appreciated!

     
  • Martin Larsen
    Martin Larsen
    2012-11-20

    A nice solution would be a regex filter to apply to the url before phpcrawl decides that it has been crawled before. Using that you can strip any part of the url you want to ignore.

     
  • Uwe Hunfeld
    Uwe Hunfeld
    2012-11-20

    That's a nice idea.

    Let's say there are URLs like http://mysite.com/event.html?day=2 (day=3, day=4 ans so on).

    So something like $crawler->setUrlDistinctFilter("#mysite.com/event.html#", "#day=+#") affects that
    the crawler ignores "day=+" in every URL that's containing "mysite.com/event.html" when it comes
    to decide if the URL was already crawled?

    Do you know a better name for this method? ;)

     
  • setUrlDistinctFilter is a good name as it pinpoints what it is used for :)

    However, to be more versatile I actually think the filter should take 3 arguments. By adding a replacement argument you could normalize the url in many ways. And if you need to delete a part, just give an empty string as the replacment (or let it be the default value).

    Perhaps even better, we could simply imitate preg_replace(). For example, this will remove the day part:

    $crawler->setUrlDistinctFilter("#(mysite.com/event.html)\?day=\d+#", "\1")

    It works by replacing the url with the part before the question mark if the regex matches.

    The first version is probably easier to understand while the preg_replace imitation is more powerful.

     
  • Martin Larsen
    Martin Larsen
    2012-11-20

    Oh, I didn't know I was logged out, so I write this to make sure I will get a nofication on reply :)

     
  • Martin Larsen
    Martin Larsen
    2012-11-20

    Thinking more about it, I think the best solution is using your version but with a replacement parameter and allowing for regex substitution. Thus, let's say we want to change day to month, we could use:

    $crawler->setUrlDistinctFilter("#mysite.com/event.html#", "#day=(+)#", "month=\1")

    This is very versatile but still easy to understand as the user does not need to use substitution. In this simple example it could be written as:

    $crawler->setUrlDistinctFilter("#mysite.com/event.html#", "#day#", "month")

    The substitution allows for advanced uses though.

     
  • Uwe Hunfeld
    Uwe Hunfeld
    2012-11-20

    OK, i like both of your solutions!

    The first one (replacement) works exactly as apaches mod-rewrite for example, so people that are familiar
    with this just know how to use it.
    We could even adopt the name for the method, like "addUrlRewriteRule"

    Your second solution is a little more easy to understand in general i think.

    I'll open a feature-request for this.

     
  • Martin Larsen
    Martin Larsen
    2012-11-20

    addUrlRewriteRule is nice :)

    In a perfect world we could have both version with the two names suggested.

    But for me any of the solutions is fine though.

     

  • Anonymous
    2012-12-13

    Hey there. I need this functionality for my current project. Currently I am 'stripping' / rewriting the URL in the handleDocumentInfo() function, and checking if it's not the same as a previous URL, but that's too late, because the crawler will already have followed and downloaded all of the permutations of similar links.

    So I was thinking about the kind of implementation that I would need and I'd opt to extend the PHPCrawler::processUrl function to not only call PHPCrawlerURLFilter::filterUrls, but also a new abstract function stripUrls, so that all found URL's are stripped before they are added to the list of to-be-followed-links. That function defaults to "keep the url the same", but can be overridden by the user in any kind of complex way. Eg. I need a pretty complex procedure like below, to remove certain parts of the URL, but keep the  and GET parameters and sort them so that only the sorted permutation is regarded as unique. I would feel limited if I could only use regex for this, since a function like parse_url() is pretty useful in my case.

    What do you guys think? It wouldn't be that hard to implement and pretty customizable IMO, but maybe I'm missing something.

        private function getStrippedUrl($url)
        {
            $pieces = parse_url($url);
            // print_r($pieces);
            $query = ''; // default value
            extract($pieces);
            $queryfields = preg_split('#[\;\&]#', $query);
            foreach($queryfields as $key => $val)
            {
                // keep only the 'page=' and 's=' query variables
                if (!preg_match('#^page\=#', $val))
                if (!preg_match('#^s\=#', $val))
                    unset($queryfields[$key]);
                // remove a page=1 query parameter when it exists
                if (preg_match('#^page\=1$#', $val))
                    unset($queryfields[$key]);
            }
            sort($queryfields);
            $query = join('&', $queryfields);
            
            $query = $query ? "?$query" : "";
            // user, pass, port and fragment (#...) are always ignored
            $surl = $scheme . '://' . $host . $path . $query; 
            return $surl;
        }
    
     
  • Martin Larsen
    Martin Larsen
    2012-12-13

    I have thought more about it, and I think the most versatile way would be to have a callback function (or method override) that is called before the crawler does anything with the url.

    Using this function one could do a multitude of things, like normalizing the url using a regex as discussed before or disqualifying the link altogether returning false instead of the url etc. One could even fetch the http headers and look at time stamp or file size etc when deciding what to do.

    Using such a callback / method override would be the best of all worlds.

     
  • Hmm that would work but it would be even more perfect if the URL's can be disqualified even before the header is even fetched,

     
  • Yes, you are absolutely right!

    Will add it to the list of feauter-requests.
    Any ideas of a good name for the callback-function?
    Maybe simply "prepareRequest" or something?

    The crawler could pass everything regarding the next request to that function (URL, request-header) so
    users could "manipulate" it as they want or even abort the entire request.

    What would be a good name?

    Thanks!

     
  • pmsteil
    pmsteil
    2013-02-24

    Hello,

    I am trying to use the handleHeaderInfo() callback function… I just did this:

    function handleHeaderInfo(PHPCrawlerResponseHeader $header)
      {
        echo 'header '.$header->http_status_code.' url: '.$header->source_url;
        return -1;
      }
    

    but when I run my script, it actually calls my handleDocumentInfo() function ONCE and THEN aborts all further processing of pages…   I think it should instead print my echo statement for every page it found, but never call my handleDocumentInfo().  I am just doing this to test how to use the handleHeaderInfo() function.

    If I return TRUE, it correctly processes all 50 of the pages i wanted it to process with setPageLimit(50);

    weird!

    I looked at the PHPCrawler code (version 081) and found this:

        // Call header-check-callback
        $ret = 0;
        if ($this->header_check_callback_function != null)
          $ret = call_user_func($this->header_check_callback_function, $this->lastResponseHeader);
        
        // Check if content should be received
        $receive = $this->decideRecevieContent($this->lastResponseHeader);
        
        if ($ret < 0 || $receive == false)
        {
          @fclose($this->socket);
          $PageInfo->received = false;
          $PageInfo->links_found_url_descriptors = $this->LinkFinder->getAllURLs(); // Maybe found a link/redirect in the header
          $PageInfo->meta_attributes = $this->LinkFinder->getAllMetaAttributes();
          return $PageInfo;
        }
        else
        {
          $PageInfo->received = true;
        }
    

    So I don't see anything wrong in the PHPCrawler code…

    So, what am I doing wrong?

    TIA,
    Patrick

     
  • Martin Larsen
    Martin Larsen
    2013-02-25

    What would be a good name?

    prepareRequest would be fine for me.

     
  • Uwe Hunfeld
    Uwe Hunfeld
    2013-02-27

    Hi Patrick,

    looks like everything works as expected over there.

    It works like this:
    If you retrun a -1 from the handleHeaderInfo, the particular document will NOT be received, and you always return a -1 in your script.
    So even at the first entry page you tell the crawler NOT to receive the content, and so the crawler is not able to find ANY links. (No content -> no links).

    That's why the process stops immedeatlly for you after one link, theres simply nothing more to do for the crawler
    (no links in the queue).

    The haldneDocumentInfo will always be called, even if you abort the request with the handleHeaderInfo-method.
    But the flag PHPCrawlerDocumentInfo::received will be false in that case.

    Hope i could help and that it's get celear now fou you!

     
  • pmsteil
    pmsteil
    2013-02-27

    @huni, Thanks!  LOL I feel dumb… :)  That makes sense…

    What I was trying to do was to make a 404 checker that would be as efficient as possible and avoid downloading every page, but I guess it has to in order to get the next set of links to process… thanks for the help!

    Great product here… I will be sending a donation!

    Patrick

     


Anonymous


Cancel   Add attachments