Menu

Retrieve backlinks for every URL processed

Help
Kurtubba
2013-01-09
2013-04-09
  • Kurtubba

    Kurtubba - 2013-01-09

    Great work!
    I have created a similar crawler for my Sitemap Creator http://sitemapcreator.org/ but I would like to replace it with yours in the new versions. So I wonder if I can get the total number of backlinks of every URL processed and their position in the crawled pages. I think that would be very handy for my sitemaps, SEO reports or site search. Or just let me know your suggestions!

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-01-11

    Hi!

    To get the backlinks of every URL you may simply take a look at all found links in the other pages of that website and implement a counter for every backlink found in there.

    You can access all found links in a page by the property PHPCrawlerDocumentInfo::links_found or  PHPCrawlerDocumentInfo::links_found_url_descriptors (http://phpcrawl.cuab.de/classreferences/index.html or  PHPCrawlerDocumentInfo::links_found_url_descriptors).

    Didn't try something like that so far, but shouldn't be a problem (i think). 

     
  • Kurtubba

    Kurtubba - 2013-01-11

    Thank you for your support.

    I have tried many times to add the counter to PHPCrawlerDocumentInfo::links_found_url_descriptors by modifying PHPCrawlerURLCacheBase::addURL()  as following:

      /**
       * Adds an URL to the url-cache
       *
       * @param PHPCrawlerURLDescriptor $UrlDescriptor      
       */
      public function addURL(PHPCrawlerURLDescriptor $UrlDescriptor)
      { 
        if ($UrlDescriptor == null) return;
        // Hash of the URL
        $map_key = $this->getDistinctURLHash($UrlDescriptor);
        
        // If URL already in cache -> abort
    //    if($map_key != null && isset($this->url_map[$map_key])) return;
        
        // Retrieve priority-level
        $priority_level = $this->getUrlPriority($UrlDescriptor->url_rebuild);
        //@modified
        // If URL already in cache -> add backlinks
        if($map_key != null && isset($this->url_map[$map_key])){ 
            $UrlDescriptor = & $this->urls[$priority_level][$this->url_map[$map_key]];
            $UrlDescriptor->backlinks = empty($UrlDescriptor->backlinks) ? 2  : $UrlDescriptor->backlinks++;
            return;
        }
        
        // Add URL to URL-Array
        $this->urls[$priority_level][] = $UrlDescriptor;
        //@modified
        //add a reference to $UrlDescriptor key in $this->urls[$priority_level] to url_maps array()
        $this->url_map[$map_key] = count($this->urls[$priority_level]) -1;
       
        // Add URL to URL-Map
    //    if ($this->url_distinct_property != self::URLHASH_NONE)
    //      $this->url_map[$map_key] = true;
      }
    

    The problem is that the PHPCrawlerURLCacheBase arrays are cleared often. Moreover, I tried to save the backlinks to url_map and disable the clear() function for this array, I ended up with only one backlink which means that there's another variable that disallow duplicates to be sent to cache. here is the other code:

      /**
       * Adds an URL to the url-cache
       *
       * @param PHPCrawlerURLDescriptor $UrlDescriptor      
       */
      public function addURL(PHPCrawlerURLDescriptor $UrlDescriptor)
      { 
        if ($UrlDescriptor == null) return;
    //    var_dump($this->url_map);
        // Hash of the URL
        $map_key = $this->getDistinctURLHash($UrlDescriptor);
        
        // If URL already in cache -> abort
    //    if($map_key != null && isset($this->url_map[$map_key])) return;
        // If URL already in cache -> add backlinks
        if($map_key != null && isset($this->url_map[$map_key])){ 
            $this->url_map[$map_key] = $this->url_map[$map_key]++;
            return;
        }
        
        // Retrieve priority-level
        $priority_level = $this->getUrlPriority($UrlDescriptor->url_rebuild);
        
        
        // Add URL to URL-Array
    //    $this->urls[$priority_level][] = $UrlDescriptor;
        $this->urls[$priority_level][] = $UrlDescriptor;
        
        //add $UrlDescriptor key in $this->urls[$priority_level]
        $this->url_map[$map_key] = 1;
       
        // Add URL to URL-Map
    //    if ($this->url_distinct_property != self::URLHASH_NONE)
    //      $this->url_map[$map_key] = true;
      }
    
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-01-11

    Hi again,

    i really would recommend NOT to modify the phpcrawl-library itelf, you won't be able to update it anymore in a straight way and you will run into other problems (as you mentioned)

    Why don't you just implement the backlink-counter in your own crawler-class that extends from the phpcrawler-class?

    You could simply use an array as a property in your class that contains counters for every URL of the domain you are crawling (or maybe a more complex data-structure to store backling-positions as well).

    That would be the best (and clean and proper) way i think.

     
  • Kurtubba

    Kurtubba - 2013-01-11

    Thanks again :)

    Yes, actually I tried to extract the back links from the links_found_url_descriptors  array, however there are no duplicated URLs on that array which does not give an accurate number of back links on every single page.

     
  • Nobody/Anonymous

    Yes, you are right, didn't think about that.

    Maybe a counter inside the  PHPCrawlerURLDescriptor-class would do the trick, something like a property named "occurance".

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.