It's difficult to say without some more information.
First of all: Did the crawling process finish completely both times? Or did it abort (for some reasons)?
Otherwise there could be different reasons for the different number of crawled links, like a blocking-tool/firewall on the webserver (server doesn't respond anymore because of too many requests from your IP), or the server just doesn't reponde in the given connection- or stream-timeout.
Did you try to increase the default connection and stream timeout?
It's always a good idea to log failed requests when crawling larger websites (check for $DocInfo->error_occured and $DocInfo->error_string), mostly failed requests are the reason for such a occurrence as you described (remember that all the potential links and sublinks in a page that couldn't get requested will be missing later on!).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So, did you take a look at the 3 URLs that weren't received?
Do they contain a lot of links, sublinks ans subsublinks that are missing in the crawling-process?
I'm almost sure that this is not a bug in phpcrawl, because it always works the same, it's not like "today i'm working this way, and tomorrow another way with a different result", you know?
And how did you find out the number of URLs of that page in googles cache?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, another attemp (i just can grope in the dark):
Does the google-bot maybe visit URLs mor than one time?
Like anchor-links for instance?
(www.page.com/file.hmtl#bla and www.page.com/file.hmtl#bli?)
phpcrawl doesn't visit URLs like that twice.
It's really really difficult to find out without knowing the actual website
and without having the list of URLs google visited.
As far as i know you are the first one reporting such a problem, so there has to be
something "special" with the website i guess.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If I have:
$crawler->addURLFilterRule("#.(jpg|jpeg|gif|png|js|swf|xml|ico)$#i");
$crawler->addURLFilterRule("#\/upload_data\/#");
and I add too:
$crawler->obeyNoFollowTags(true);
$crawler->obeyNoFollowTags(true);
In the result I have and text/css returned;
I read in documentation this :
Pleas note that the directives found in a robots.txt-file have a higher priority than other settings made by the user.
If e.g. addFollowMatch("#http://foo.com/path/file.html#") was set, but a directive in the robots.txt-file of the host
foo.com says "Disallow: /path/", the URL http://foo.com/path/file.html will be ignored by the crawler anyway.
<<
My question, how I can avoid these extension (jpg|jpeg|gif|png|js|swf|xml|ico) and obey robots?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Same domain crawled for 2 times (same parameters) and major different results.
First result was 77K links followed and second 57K links followed (arround 20k of urls diferece).
Google crawled 872K links (249K links with http code 200) in december month, 2013 (too big diference).
It's a bug with phpcrawl?
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi!
It's difficult to say without some more information.
First of all: Did the crawling process finish completely both times? Or did it abort (for some reasons)?
Otherwise there could be different reasons for the different number of crawled links, like a blocking-tool/firewall on the webserver (server doesn't respond anymore because of too many requests from your IP), or the server just doesn't reponde in the given connection- or stream-timeout.
Did you try to increase the default connection and stream timeout?
It's always a good idea to log failed requests when crawling larger websites (check for $DocInfo->error_occured and $DocInfo->error_string), mostly failed requests are the reason for such a occurrence as you described (remember that all the potential links and sublinks in a page that couldn't get requested will be missing later on!).
Hi
Thank You for your answer.
I crawled again with new updates:
$this->crawler->setStreamTimeout(5)
$this->crawler->setConnectionTimeout(10);
$this->crawler->enableAggressiveLinkSearch(false);
$this->crawler->setLinkExtractionTags(array('href'));
The result report:
Links followed: 50434
Files received: 50431
Process runtime: 18234.844
Data throughput: 71355.047
Abort reason: 1
User abort: 0
I store all url found in mysql. Just 3 rows fount with "received_completely" or "received" false.
If an error occurs I suppose that the flags for "received" or "received_completely" are false.
Yes, received will be false if an error occured.
And the process just finishes fine.
So, did you take a look at the 3 URLs that weren't received?
Do they contain a lot of links, sublinks ans subsublinks that are missing in the crawling-process?
I'm almost sure that this is not a bug in phpcrawl, because it always works the same, it's not like "today i'm working this way, and tomorrow another way with a different result", you know?
And how did you find out the number of URLs of that page in googles cache?
Yes
These 3 links are from a listing, page number: 300 from 546 pages .
I saw that, all pages from this listing are followed by crawler.
But the small number of the crawled pages seems to not be from here.
Numer of pages crawled by google bot: 872K.
I get these uniques urls from a list of accesslog file (2013, december month).
Ok, another attemp (i just can grope in the dark):
Does the google-bot maybe visit URLs mor than one time?
Like anchor-links for instance?
(www.page.com/file.hmtl#bla and www.page.com/file.hmtl#bli?)
phpcrawl doesn't visit URLs like that twice.
It's really really difficult to find out without knowing the actual website
and without having the list of URLs google visited.
As far as i know you are the first one reporting such a problem, so there has to be
something "special" with the website i guess.
Big diference: google - 872K links (249K links with http code 200) , php crawl arround 55K.
All urls are without '#'.
The crawl settings rules:
Content type receive: #text/html#
Url filter: #[(.(jpg|jpeg|gif|png|js|swf|xml|ico))|!(\/upload_data\/)]$# i
Url follow: --
Obey nofollow tags: yes
Obey robots txt: yes
I hope that url filter to be well defined: I want to say filter all extension from (...) and url that start with folder /upload_data/.
Maybe your URL-filter isn't correct, don't know, didn't test.
Why don't you just set two filters, it's more easy to read ;)
$crawler->addURLFilterRule("#.(jpg|jpeg|gif|png|js|swf|xml|ico)$#i");
$crawler->addURLFilterRule("#/upload_data/#");
Could you find any URL that google visited and phpcrawl didn't?
Then you could test this URL against your rules and see if that's the
problem maybe?
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
I have a question regarding url filter.
If I have:
$crawler->addURLFilterRule("#.(jpg|jpeg|gif|png|js|swf|xml|ico)$#i");
$crawler->addURLFilterRule("#\/upload_data\/#");
and I add too:
$crawler->obeyNoFollowTags(true);
$crawler->obeyNoFollowTags(true);
In the result I have and text/css returned;
I read in documentation this :
My question, how I can avoid these extension (jpg|jpeg|gif|png|js|swf|xml|ico) and obey robots?
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi!
I think i dont't understand your problem.
Simply set obeyRobotsTxt(true) and set you follow-rules, that's it.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Sorry I repeat obeyNoFollowTags for 2 times - writing mistake.
Just I have:
Sometime I get strange results.
An example: in robots.txt I have:
.
With this obeyRobotsTxt(true);, still i receive links like /index.php?obj=....
Maybe obeyRobotsTxt() method don't know to interpretate the stars * .
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Sorry I repeat obeyNoFollowTags for 2 times - writing mistake.
Just I have:
Sometime I get strange results.
An example: in robots.txt I have:
.
With this obeyRobotsTxt(true);, still i receive links like /index.php?obj=....
Maybe obeyRobotsTxt() method don't know to interpretate the stars * .
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Phpcrawl do not find links from dropdowns .
How to set up phpcrawl to get select options ?
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
By default it works fine for me, with more than half of my links coming from dropdowns.