I have an events calendar mysite.com/event.html?day=1 and mysite.com/event.html?day=2 which is getting crawled for each ?get query. How do I set it so it does not crawl for each query - thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
A nice solution would be a regex filter to apply to the url before phpcrawl decides that it has been crawled before. Using that you can strip any part of the url you want to ignore.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So something like $crawler->setUrlDistinctFilter("#mysite.com/event.html#", "#day=+#") affects that
the crawler ignores "day=+" in every URL that's containing "mysite.com/event.html" when it comes
to decide if the URL was already crawled?
Do you know a better name for this method? ;)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
setUrlDistinctFilter is a good name as it pinpoints what it is used for :)
However, to be more versatile I actually think the filter should take 3 arguments. By adding a replacement argument you could normalize the url in many ways. And if you need to delete a part, just give an empty string as the replacment (or let it be the default value).
Perhaps even better, we could simply imitate preg_replace(). For example, this will remove the day part:
Thinking more about it, I think the best solution is using your version but with a replacement parameter and allowing for regex substitution. Thus, let's say we want to change day to month, we could use:
The first one (replacement) works exactly as apaches mod-rewrite for example, so people that are familiar
with this just know how to use it.
We could even adopt the name for the method, like "addUrlRewriteRule"
Your second solution is a little more easy to understand in general i think.
I'll open a feature-request for this.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2012-12-13
Hey there. I need this functionality for my current project. Currently I am 'stripping' / rewriting the URL in the handleDocumentInfo() function, and checking if it's not the same as a previous URL, but that's too late, because the crawler will already have followed and downloaded all of the permutations of similar links.
So I was thinking about the kind of implementation that I would need and I'd opt to extend the PHPCrawler::processUrl function to not only call PHPCrawlerURLFilter::filterUrls, but also a new abstract function stripUrls, so that all found URL's are stripped before they are added to the list of to-be-followed-links. That function defaults to "keep the url the same", but can be overridden by the user in any kind of complex way. Eg. I need a pretty complex procedure like below, to remove certain parts of the URL, but keep the and GET parameters and sort them so that only the sorted permutation is regarded as unique. I would feel limited if I could only use regex for this, since a function like parse_url() is pretty useful in my case.
What do you guys think? It wouldn't be that hard to implement and pretty customizable IMO, but maybe I'm missing something.
private function getStrippedUrl($url)
{
$pieces = parse_url($url);
// print_r($pieces);
$query = ''; // default value
extract($pieces);
$queryfields = preg_split('#[\;\&]#', $query);
foreach($queryfields as $key => $val)
{
// keep only the 'page=' and 's=' query variables
if (!preg_match('#^page\=#', $val))
if (!preg_match('#^s\=#', $val))
unset($queryfields[$key]);
// remove a page=1 query parameter when it exists
if (preg_match('#^page\=1$#', $val))
unset($queryfields[$key]);
}
sort($queryfields);
$query = join('&', $queryfields);
$query = $query ? "?$query" : "";
// user, pass, port and fragment (#...) are always ignored
$surl = $scheme . '://' . $host . $path . $query;
return $surl;
}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have thought more about it, and I think the most versatile way would be to have a callback function (or method override) that is called before the crawler does anything with the url.
Using this function one could do a multitude of things, like normalizing the url using a regex as discussed before or disqualifying the link altogether returning false instead of the url etc. One could even fetch the http headers and look at time stamp or file size etc when deciding what to do.
Using such a callback / method override would be the best of all worlds.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Will add it to the list of feauter-requests.
Any ideas of a good name for the callback-function?
Maybe simply "prepareRequest" or something?
The crawler could pass everything regarding the next request to that function (URL, request-header) so
users could "manipulate" it as they want or even abort the entire request.
What would be a good name?
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
but when I run my script, it actually calls my handleDocumentInfo() function ONCE and THEN aborts all further processing of pages… I think it should instead print my echo statement for every page it found, but never call my handleDocumentInfo(). I am just doing this to test how to use the handleHeaderInfo() function.
If I return TRUE, it correctly processes all 50 of the pages i wanted it to process with setPageLimit(50);
weird!
I looked at the PHPCrawler code (version 081) and found this:
// Call header-check-callback
$ret = 0;
if ($this->header_check_callback_function != null)
$ret = call_user_func($this->header_check_callback_function, $this->lastResponseHeader);
// Check if content should be received
$receive = $this->decideRecevieContent($this->lastResponseHeader);
if ($ret < 0 || $receive == false)
{
@fclose($this->socket);
$PageInfo->received = false;
$PageInfo->links_found_url_descriptors = $this->LinkFinder->getAllURLs(); // Maybe found a link/redirect in the header
$PageInfo->meta_attributes = $this->LinkFinder->getAllMetaAttributes();
return $PageInfo;
}
else
{
$PageInfo->received = true;
}
So I don't see anything wrong in the PHPCrawler code…
So, what am I doing wrong?
TIA,
Patrick
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
looks like everything works as expected over there.
It works like this:
If you retrun a -1 from the handleHeaderInfo, the particular document will NOT be received, and you always return a -1 in your script.
So even at the first entry page you tell the crawler NOT to receive the content, and so the crawler is not able to find ANY links. (No content -> no links).
That's why the process stops immedeatlly for you after one link, theres simply nothing more to do for the crawler
(no links in the queue).
The haldneDocumentInfo will always be called, even if you abort the request with the handleHeaderInfo-method.
But the flag PHPCrawlerDocumentInfo::received will be false in that case.
Hope i could help and that it's get celear now fou you!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
@huni, Thanks! LOL I feel dumb… :) That makes sense…
What I was trying to do was to make a 404 checker that would be as efficient as possible and avoid downloading every page, but I guess it has to in order to get the next set of links to process… thanks for the help!
Great product here… I will be sending a donation!
Patrick
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have an events calendar mysite.com/event.html?day=1 and mysite.com/event.html?day=2 which is getting crawled for each ?get query. How do I set it so it does not crawl for each query - thanks.
Still curious how to prevent this.
Thanks
Hi!
There's no setting that prevents phpcrawl from crawling the same URL with different get-queries (?…) only once.
But in your case you may add a non-follow-rule like
$crawler->addNonFollowMatch("#mysite.com/event.html?day=+#")
(so only mysite.com/event.html?day=1 will be followed).
Thanks!
Any way to add this functionality? I am crawling sites that use url data like ?sortOrder=1&PID=123&keywords=test
I'd love to treat all urls that have PID=123 as the same regardless of the rest. Any suggestions would be MUCH appreciated!
A nice solution would be a regex filter to apply to the url before phpcrawl decides that it has been crawled before. Using that you can strip any part of the url you want to ignore.
That's a nice idea.
Let's say there are URLs like http://mysite.com/event.html?day=2 (day=3, day=4 ans so on).
So something like $crawler->setUrlDistinctFilter("#mysite.com/event.html#", "#day=+#") affects that
the crawler ignores "day=+" in every URL that's containing "mysite.com/event.html" when it comes
to decide if the URL was already crawled?
Do you know a better name for this method? ;)
setUrlDistinctFilter is a good name as it pinpoints what it is used for :)
However, to be more versatile I actually think the filter should take 3 arguments. By adding a replacement argument you could normalize the url in many ways. And if you need to delete a part, just give an empty string as the replacment (or let it be the default value).
Perhaps even better, we could simply imitate preg_replace(). For example, this will remove the day part:
$crawler->setUrlDistinctFilter("#(mysite.com/event.html)\?day=\d+#", "\1")
It works by replacing the url with the part before the question mark if the regex matches.
The first version is probably easier to understand while the preg_replace imitation is more powerful.
Oh, I didn't know I was logged out, so I write this to make sure I will get a nofication on reply :)
Thinking more about it, I think the best solution is using your version but with a replacement parameter and allowing for regex substitution. Thus, let's say we want to change day to month, we could use:
$crawler->setUrlDistinctFilter("#mysite.com/event.html#", "#day=(+)#", "month=\1")
This is very versatile but still easy to understand as the user does not need to use substitution. In this simple example it could be written as:
$crawler->setUrlDistinctFilter("#mysite.com/event.html#", "#day#", "month")
The substitution allows for advanced uses though.
OK, i like both of your solutions!
The first one (replacement) works exactly as apaches mod-rewrite for example, so people that are familiar
with this just know how to use it.
We could even adopt the name for the method, like "addUrlRewriteRule"
Your second solution is a little more easy to understand in general i think.
I'll open a feature-request for this.
addUrlRewriteRule is nice :)
In a perfect world we could have both version with the two names suggested.
But for me any of the solutions is fine though.
Opened the request: https://sourceforge.net/tracker/?func=detail&aid=3588762&group_id=89439&atid=590149
I like both solutions too.
Thanks for your suggentions!
Hey there. I need this functionality for my current project. Currently I am 'stripping' / rewriting the URL in the handleDocumentInfo() function, and checking if it's not the same as a previous URL, but that's too late, because the crawler will already have followed and downloaded all of the permutations of similar links.
So I was thinking about the kind of implementation that I would need and I'd opt to extend the PHPCrawler::processUrl function to not only call PHPCrawlerURLFilter::filterUrls, but also a new abstract function stripUrls, so that all found URL's are stripped before they are added to the list of to-be-followed-links. That function defaults to "keep the url the same", but can be overridden by the user in any kind of complex way. Eg. I need a pretty complex procedure like below, to remove certain parts of the URL, but keep the and
GET parameters and sort them so that only the sorted permutation is regarded as unique. I would feel limited if I could only use regex for this, since a function like parse_url() is pretty useful in my case.What do you guys think? It wouldn't be that hard to implement and pretty customizable IMO, but maybe I'm missing something.
I have thought more about it, and I think the most versatile way would be to have a callback function (or method override) that is called before the crawler does anything with the url.
Using this function one could do a multitude of things, like normalizing the url using a regex as discussed before or disqualifying the link altogether returning false instead of the url etc. One could even fetch the http headers and look at time stamp or file size etc when deciding what to do.
Using such a callback / method override would be the best of all worlds.
Hi!
Such a callback is already prsent since 0.8, it's called "handleHeaderInfo()". (If i understand you right)
http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_handleHeaderInfo.htm
If you let it return any negative integer, the crawler woll NOT continue with that URL.
Do you guys read the docs from time to time? ;) (no offense)
Hmm that would work but it would be even more perfect if the URL's can be disqualified even before the header is even fetched,
Yes, you are absolutely right!
Will add it to the list of feauter-requests.
Any ideas of a good name for the callback-function?
Maybe simply "prepareRequest" or something?
The crawler could pass everything regarding the next request to that function (URL, request-header) so
users could "manipulate" it as they want or even abort the entire request.
What would be a good name?
Thanks!
Hello,
I am trying to use the handleHeaderInfo() callback function… I just did this:
but when I run my script, it actually calls my handleDocumentInfo() function ONCE and THEN aborts all further processing of pages… I think it should instead print my echo statement for every page it found, but never call my handleDocumentInfo(). I am just doing this to test how to use the handleHeaderInfo() function.
If I return TRUE, it correctly processes all 50 of the pages i wanted it to process with setPageLimit(50);
weird!
I looked at the PHPCrawler code (version 081) and found this:
So I don't see anything wrong in the PHPCrawler code…
So, what am I doing wrong?
TIA,
Patrick
What would be a good name?
prepareRequest would be fine for me.
Hi Patrick,
looks like everything works as expected over there.
It works like this:
If you retrun a -1 from the handleHeaderInfo, the particular document will NOT be received, and you always return a -1 in your script.
So even at the first entry page you tell the crawler NOT to receive the content, and so the crawler is not able to find ANY links. (No content -> no links).
That's why the process stops immedeatlly for you after one link, theres simply nothing more to do for the crawler
(no links in the queue).
The haldneDocumentInfo will always be called, even if you abort the request with the handleHeaderInfo-method.
But the flag PHPCrawlerDocumentInfo::received will be false in that case.
Hope i could help and that it's get celear now fou you!
@huni, Thanks! LOL I feel dumb… :) That makes sense…
What I was trying to do was to make a 404 checker that would be as efficient as possible and avoid downloading every page, but I guess it has to in order to get the next set of links to process… thanks for the help!
Great product here… I will be sending a donation!
Patrick