Great work!
I have created a similar crawler for my Sitemap Creator http://sitemapcreator.org/ but I would like to replace it with yours in the new versions. So I wonder if I can get the total number of backlinks of every URL processed and their position in the crawled pages. I think that would be very handy for my sitemaps, SEO reports or site search. Or just let me know your suggestions!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
To get the backlinks of every URL you may simply take a look at all found links in the other pages of that website and implement a counter for every backlink found in there.
You can access all found links in a page by the property PHPCrawlerDocumentInfo::links_found or PHPCrawlerDocumentInfo::links_found_url_descriptors (http://phpcrawl.cuab.de/classreferences/index.html or PHPCrawlerDocumentInfo::links_found_url_descriptors).
Didn't try something like that so far, but shouldn't be a problem (i think).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have tried many times to add the counter to PHPCrawlerDocumentInfo::links_found_url_descriptors by modifying PHPCrawlerURLCacheBase::addURL() as following:
/**
* Adds an URL to the url-cache
*
* @param PHPCrawlerURLDescriptor $UrlDescriptor
*/
public function addURL(PHPCrawlerURLDescriptor $UrlDescriptor)
{
if ($UrlDescriptor == null) return;
// Hash of the URL
$map_key = $this->getDistinctURLHash($UrlDescriptor);
// If URL already in cache -> abort
// if($map_key != null && isset($this->url_map[$map_key])) return;
// Retrieve priority-level
$priority_level = $this->getUrlPriority($UrlDescriptor->url_rebuild);
//@modified
// If URL already in cache -> add backlinks
if($map_key != null && isset($this->url_map[$map_key])){
$UrlDescriptor = & $this->urls[$priority_level][$this->url_map[$map_key]];
$UrlDescriptor->backlinks = empty($UrlDescriptor->backlinks) ? 2 : $UrlDescriptor->backlinks++;
return;
}
// Add URL to URL-Array
$this->urls[$priority_level][] = $UrlDescriptor;
//@modified
//add a reference to $UrlDescriptor key in $this->urls[$priority_level] to url_maps array()
$this->url_map[$map_key] = count($this->urls[$priority_level]) -1;
// Add URL to URL-Map
// if ($this->url_distinct_property != self::URLHASH_NONE)
// $this->url_map[$map_key] = true;
}
The problem is that the PHPCrawlerURLCacheBase arrays are cleared often. Moreover, I tried to save the backlinks to url_map and disable the clear() function for this array, I ended up with only one backlink which means that there's another variable that disallow duplicates to be sent to cache. here is the other code:
/**
* Adds an URL to the url-cache
*
* @param PHPCrawlerURLDescriptor $UrlDescriptor
*/
public function addURL(PHPCrawlerURLDescriptor $UrlDescriptor)
{
if ($UrlDescriptor == null) return;
// var_dump($this->url_map);
// Hash of the URL
$map_key = $this->getDistinctURLHash($UrlDescriptor);
// If URL already in cache -> abort
// if($map_key != null && isset($this->url_map[$map_key])) return;
// If URL already in cache -> add backlinks
if($map_key != null && isset($this->url_map[$map_key])){
$this->url_map[$map_key] = $this->url_map[$map_key]++;
return;
}
// Retrieve priority-level
$priority_level = $this->getUrlPriority($UrlDescriptor->url_rebuild);
// Add URL to URL-Array
// $this->urls[$priority_level][] = $UrlDescriptor;
$this->urls[$priority_level][] = $UrlDescriptor;
//add $UrlDescriptor key in $this->urls[$priority_level]
$this->url_map[$map_key] = 1;
// Add URL to URL-Map
// if ($this->url_distinct_property != self::URLHASH_NONE)
// $this->url_map[$map_key] = true;
}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i really would recommend NOT to modify the phpcrawl-library itelf, you won't be able to update it anymore in a straight way and you will run into other problems (as you mentioned)
Why don't you just implement the backlink-counter in your own crawler-class that extends from the phpcrawler-class?
You could simply use an array as a property in your class that contains counters for every URL of the domain you are crawling (or maybe a more complex data-structure to store backling-positions as well).
That would be the best (and clean and proper) way i think.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, actually I tried to extract the back links from the links_found_url_descriptors array, however there are no duplicated URLs on that array which does not give an accurate number of back links on every single page.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Great work!
I have created a similar crawler for my Sitemap Creator http://sitemapcreator.org/ but I would like to replace it with yours in the new versions. So I wonder if I can get the total number of backlinks of every URL processed and their position in the crawled pages. I think that would be very handy for my sitemaps, SEO reports or site search. Or just let me know your suggestions!
Hi!
To get the backlinks of every URL you may simply take a look at all found links in the other pages of that website and implement a counter for every backlink found in there.
You can access all found links in a page by the property PHPCrawlerDocumentInfo::links_found or PHPCrawlerDocumentInfo::links_found_url_descriptors (http://phpcrawl.cuab.de/classreferences/index.html or PHPCrawlerDocumentInfo::links_found_url_descriptors).
Didn't try something like that so far, but shouldn't be a problem (i think).
Thank you for your support.
I have tried many times to add the counter to PHPCrawlerDocumentInfo::links_found_url_descriptors by modifying PHPCrawlerURLCacheBase::addURL() as following:
The problem is that the PHPCrawlerURLCacheBase arrays are cleared often. Moreover, I tried to save the backlinks to url_map and disable the clear() function for this array, I ended up with only one backlink which means that there's another variable that disallow duplicates to be sent to cache. here is the other code:
Hi again,
i really would recommend NOT to modify the phpcrawl-library itelf, you won't be able to update it anymore in a straight way and you will run into other problems (as you mentioned)
Why don't you just implement the backlink-counter in your own crawler-class that extends from the phpcrawler-class?
You could simply use an array as a property in your class that contains counters for every URL of the domain you are crawling (or maybe a more complex data-structure to store backling-positions as well).
That would be the best (and clean and proper) way i think.
Thanks again :)
Yes, actually I tried to extract the back links from the links_found_url_descriptors array, however there are no duplicated URLs on that array which does not give an accurate number of back links on every single page.
Yes, you are right, didn't think about that.
Maybe a counter inside the PHPCrawlerURLDescriptor-class would do the trick, something like a property named "occurance".