Dear Greg,
I did some more research. At the point where the script is canceled and returning 403 error I placed some printing of variables, related to this function.
session_sider.php mod:
if ($SEARCH_SPIDER && !in_array(PGV_SCRIPT_NAME, $bots_not_allowed)) {
header("HTTP/1.0 403 Forbidden");
print "Sorry spider, this page is not available for search engine bots.";
print (($SEARCH_SPIDER));
print (($bot_session) );
print (PGV_SCRIPT_NAME);
The result is as follows when simulating Google spider:
Sorry spider, this page is not available for search engine bots.
Googlebot/ http://www.google.com/bot.html
xxGOOGLEBOTfsHTTPcffWWWdGOOGLxx
index.php
The last one is the pgv script name and this is not included in the 'bots_not_allowed' array.
I do not understand why phpgedview is blocking this spider.
Rick.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
No apologies!. I know how it works and you know, having these chalanges makes it intersting for me to search in the source code. Who knows there will be a time that I can assist in development ;-)
Glad that this is solved and the website can be spidered again by google.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I kind of skimmed through quickly, so forgive me if this has already been addressed. One of the posts, perhaps without meaning to, gave an incorrect impression. A 403 is NOT caused by anything in robots.txt. robots.txt is essentially instructions to robots equivalent to "if you are this, then don't do that." Good robots obey, bad robots don't.
Because bad robots exist, PGV also includes code that attempts to detect "they are this" and then prevent them from doing that. Or rather, prevent them from doing anything-since they are (believed to be) uncivilized, we slap them silly with a 403.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear Greg,
I did some more research. At the point where the script is canceled and returning 403 error I placed some printing of variables, related to this function.
session_sider.php mod:
if ($SEARCH_SPIDER && !in_array(PGV_SCRIPT_NAME, $bots_not_allowed)) {
header("HTTP/1.0 403 Forbidden");
print "Sorry spider, this page is not available for search engine bots.";
print (($SEARCH_SPIDER));
print (($bot_session) );
print (PGV_SCRIPT_NAME);
The result is as follows when simulating Google spider:
Sorry spider, this page is not available for search engine bots.
Googlebot/ http://www.google.com/bot.html
xxGOOGLEBOTfsHTTPcffWWWdGOOGLxx
index.php
The last one is the pgv script name and this is not included in the 'bots_not_allowed' array.
I do not understand why phpgedview is blocking this spider.
Rick.
Maybe ask Gerry. This was his area. He seems to be online now.
Hi All,
Continued testing to find the problem.
The system seems to work now with following modifications:
In session_spider.php:
if ($SEARCH_SPIDER && !in_array(PGV_SCRIPT_NAME, $bots_not_allowed)) {
header("HTTP/1.0 403 Forbidden");
print "Sorry , this page is not available for search engine bots.";
exit;
}
into:
if ($SEARCH_SPIDER && in_array(PGV_SCRIPT_NAME, $bots_not_allowed)) {
header("HTTP/1.0 403 Forbidden");
print "Sorry , this page is not available for search engine bots.";
exit;
}
Note: the ! sign before in_array
In functions_print.php:
if ($SEARCH_SPIDER) {
if (
!(PGV_SCRIPT_NAME=='/individual.php' ||
PGV_SCRIPT_NAME=='/indilist.php' ||
PGV_SCRIPT_NAME=='/login.php' ||
PGV_SCRIPT_NAME=='/family.php' ||
PGV_SCRIPT_NAME=='/famlist.php' ||
PGV_SCRIPT_NAME=='/help_text.php' ||
PGV_SCRIPT_NAME=='/source.php' ||
PGV_SCRIPT_NAME=='/search_engine.php' ||
PGV_SCRIPT_NAME=='/index.php')
) {
header("Location: search_engine.php");
exit;
}
}
into:
if ($SEARCH_SPIDER) {
if (
!(PGV_SCRIPT_NAME=='individual.php' ||
PGV_SCRIPT_NAME=='indilist.php' ||
PGV_SCRIPT_NAME=='login.php' ||
PGV_SCRIPT_NAME=='family.php' ||
PGV_SCRIPT_NAME=='famlist.php' ||
PGV_SCRIPT_NAME=='help_text.php' ||
PGV_SCRIPT_NAME=='source.php' ||
PGV_SCRIPT_NAME=='search_engine.php' ||
PGV_SCRIPT_NAME=='index.php')
) {
header("Location: search_engine.php");
exit;
}
}
Note : the / before the function ID
I do not know if this is only an issue at my site otherwise it might be required to raise a bug report.
Thanks for the support till now.
Rick.
Rick,
Both these changes are correct, and are both my own coding errors (in SVN 6879), so apologies.
Greg
Hi Greg,
No apologies!. I know how it works and you know, having these chalanges makes it intersting for me to search in the source code. Who knows there will be a time that I can assist in development ;-)
Glad that this is solved and the website can be spidered again by google.
I kind of skimmed through quickly, so forgive me if this has already been addressed. One of the posts, perhaps without meaning to, gave an incorrect impression. A 403 is NOT caused by anything in robots.txt. robots.txt is essentially instructions to robots equivalent to "if you are this, then don't do that." Good robots obey, bad robots don't.
Because bad robots exist, PGV also includes code that attempts to detect "they are this" and then prevent them from doing that. Or rather, prevent them from doing anything-since they are (believed to be) uncivilized, we slap them silly with a 403.
I checked your patch into SVN (6953).
Thanks