Most other CMS software do something similar. Session IDs
just mess up spiders and hence our site's ranking on that
search engine. I don't see any reason to ever disable this,
but I guess you can add a new configuration variable.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I also don't see a reason to make it a config option.
But maybe we can expand this. I wonder if we can do some
other things to improve the site just for search engines...
for example remove the printer friendly link if they are a
search engine... remove the option to switch language if
they are a spider... redirect all spider attempts to access
the login page back to the index page.
--John
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
John,
Making PGV more spider friendly has a lot of merit,
especially since spiders seem to take up a good portion of
all bandwidth to the site. Theme changes can also be
removed, as can the contact links, help links (we don't need
100,000 references to our help files in google) and the
clipping cart. I think that the reports menu can probably
also be removed, as can the portal link
(index.php?command=user).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I looked over the code and noticed that the session.php file
has some code to ban some bad user agents (and it uses a
slightly different user agent sniffing code). You can find
this right at the start of the file. We should probably
unify the code to have 2 arrays, 1 of banned user agents and
one of known spiders. By the way, the Yahoo bot in my logs
looks very different than the one you listed. I think that
using eregi is probably preferable. By the way, have a look
at bot detecting code at http://nes-emulator.com/x_bot.php
and http://danzcontrib2.free.fr/en/activiterobots.php .
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for your suggestion about altering the
session.php, I'm afraid though your solution doesn't work
for me, I isolated your code and added some text I know to
be included in my own userstring, which resulted in
test1.php:
<?php
echo "Test for string in Useragent <br />";
$spiders = array(
'compatible',
'molvis,
'zyborg',
'fast-webcrawler',
'gigabot',
'scrubby',
'msnbot',
'yahooseeker',
'mozilla'
);
$useragent = (isset($_SERVER['HTTP_USER_AGENT'] )) ?
$_SERVER['HTTP_USER_AGENT'] : "";
$spider = in_array(strtolower($useragent), $spiders );
$spider |= stristr($useragent, 'OmniExplorer_Bot/');
if ($spider)
{
echo "Congratulations, string has been found <br />";
}
?>
This does not return a congratulations message. I then
altered your script somwhat to test2.php:
<?php
echo "Test for string in Useragent <br />";
$spidera = array(
'compatible',
'molvis',
'zyborg',
'fast-webcrawler',
'gigabot',
'scrubby',
'msnbot',
'yahooseeker',
'mozilla'
);
$useragent = (isset( $_SERVER['HTTP_USER_AGENT'] )) ?
$_SERVER['HTTP_USER_AGENT'] : "";
foreach ($spidera as $spiderv)
{
if ((stristr($useragent,$spiderv)) != (stristr
($useragent,"OmniExplorer_Bot/")))
{
echo "Congratulations, string has been found <br />";
}
}
?>
Which in my case returns 3 congratulations messages.
I've checked out the links posted by kosherjava and found
them quite useful. Based on that code, I've rewritten
session.php to use 3 arrays: $worms, $bots, and
$bots_not_allowed. $worms and $bots are used to match the
user agent string. $bots_not_allowed is a list of places
that bots shouldn't go. It outputs a 403 (Forbidden) status
and exits.
I've also made it so that $spider contains the name of the
bot that is found, in case you want to treat Googlebot
different than Yahoo! Slurp.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Any ideas on the ini_set part? this would hopefully never
present any bot wuth a sessionID, mitigating a lot of the
multiple hits for the same page by bots.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Logged In: YES
user_id=634811
Thanks for the patch.
Is there any reason anyone can think of to not apply this
patch? Should the setting to use this be a config option?
Thanks
Logged In: YES
user_id=1278885
Most other CMS software do something similar. Session IDs
just mess up spiders and hence our site's ranking on that
search engine. I don't see any reason to ever disable this,
but I guess you can add a new configuration variable.
Logged In: YES
user_id=300048
I also don't see a reason to make it a config option.
But maybe we can expand this. I wonder if we can do some
other things to improve the site just for search engines...
for example remove the printer friendly link if they are a
search engine... remove the option to switch language if
they are a spider... redirect all spider attempts to access
the login page back to the index page.
--John
Logged In: YES
user_id=634811
John,
Making PGV more spider friendly has a lot of merit,
especially since spiders seem to take up a good portion of
all bandwidth to the site. Theme changes can also be
removed, as can the contact links, help links (we don't need
100,000 references to our help files in google) and the
clipping cart. I think that the reports menu can probably
also be removed, as can the portal link
(index.php?command=user).
Logged In: YES
user_id=634811
I looked over the code and noticed that the session.php file
has some code to ban some bad user agents (and it uses a
slightly different user agent sniffing code). You can find
this right at the start of the file. We should probably
unify the code to have 2 arrays, 1 of banned user agents and
one of known spiders. By the way, the Yahoo bot in my logs
looks very different than the one you listed. I think that
using eregi is probably preferable. By the way, have a look
at bot detecting code at http://nes-emulator.com/x_bot.php
and http://danzcontrib2.free.fr/en/activiterobots.php .
Logged In: YES
user_id=634811
One other point is that Drupal uses the following for
session issues with spiders (I am not sure if the first line
'save_handler' is part of this):
ini_set('session.save_handler', 'user');
ini_set('session.use_only_cookies', 1);
ini_set('session.use_trans_sid', 0);
ini_set('url_rewriter.tags', '');
See:
http://drupal.org/node/42186
and
http://baheyeldin.com/drupal/how-to-get-rid-of-phpsessid-in-drupal-and-other-php-applications.html
Adding spider detection might still be something worth doing.
Logged In: YES
user_id=1410859
Dear kosherjava,
Thank you for your suggestion about altering the
session.php, I'm afraid though your solution doesn't work
for me, I isolated your code and added some text I know to
be included in my own userstring, which resulted in
test1.php:
<?php
echo "Test for string in Useragent <br />";
$spiders = array(
'compatible',
'molvis,
'zyborg',
'fast-webcrawler',
'gigabot',
'scrubby',
'msnbot',
'yahooseeker',
'mozilla'
);
$useragent = (isset($_SERVER['HTTP_USER_AGENT'] )) ?
$_SERVER['HTTP_USER_AGENT'] : "";
$spider = in_array(strtolower($useragent), $spiders );
$spider |= stristr($useragent, 'OmniExplorer_Bot/');
if ($spider)
{
echo "Congratulations, string has been found <br />";
}
?>
This does not return a congratulations message. I then
altered your script somwhat to test2.php:
<?php
echo "Test for string in Useragent <br />";
$spidera = array(
'compatible',
'molvis',
'zyborg',
'fast-webcrawler',
'gigabot',
'scrubby',
'msnbot',
'yahooseeker',
'mozilla'
);
$useragent = (isset( $_SERVER['HTTP_USER_AGENT'] )) ?
$_SERVER['HTTP_USER_AGENT'] : "";
foreach ($spidera as $spiderv)
{
if ((stristr($useragent,$spiderv)) != (stristr
($useragent,"OmniExplorer_Bot/")))
{
echo "Congratulations, string has been found <br />";
}
}
?>
Which in my case returns 3 congratulations messages.
Would someone please acknowledge what I found??
I implemented the alteration at http://genealogie.molvis.nl
Improved includes/session.php
Logged In: YES
user_id=1278885
I've checked out the links posted by kosherjava and found
them quite useful. Based on that code, I've rewritten
session.php to use 3 arrays: $worms, $bots, and
$bots_not_allowed. $worms and $bots are used to match the
user agent string. $bots_not_allowed is a list of places
that bots shouldn't go. It outputs a 403 (Forbidden) status
and exits.
I've also made it so that $spider contains the name of the
bot that is found, in case you want to treat Googlebot
different than Yahoo! Slurp.
Logged In: YES
user_id=634811
Any ideas on the ini_set part? this would hopefully never
present any bot wuth a sessionID, mitigating a lot of the
multiple hits for the same page by bots.
Logged In: YES
user_id=634811
See http://www.articlecore.com/article/152.htm for a drop of
an explanation on my 2006-02-06 17:45 post
Logged In: YES
user_id=1278885
session.use_trans_sid and session.use_only_cookies should
probably be set within session.php .