#315 Spider aware session.php (PGV 3.3.8)

open
nobody
5
2006-01-25
2006-01-25
Laie Techie
No

This patch assigns a session id of 0 for common spiders.

Discussion

1 2 > >> (Page 1 of 2)
  • Laie Techie
    Laie Techie
    2006-01-25

     
    Attachments
  • KosherJava
    KosherJava
    2006-01-25

    Logged In: YES
    user_id=634811

    Thanks for the patch.
    Is there any reason anyone can think of to not apply this
    patch? Should the setting to use this be a config option?
    Thanks

     
  • Laie Techie
    Laie Techie
    2006-01-26

    Logged In: YES
    user_id=1278885

    Most other CMS software do something similar. Session IDs
    just mess up spiders and hence our site's ranking on that
    search engine. I don't see any reason to ever disable this,
    but I guess you can add a new configuration variable.

     
  • John Finlay
    John Finlay
    2006-01-26

    Logged In: YES
    user_id=300048

    I also don't see a reason to make it a config option.

    But maybe we can expand this. I wonder if we can do some
    other things to improve the site just for search engines...
    for example remove the printer friendly link if they are a
    search engine... remove the option to switch language if
    they are a spider... redirect all spider attempts to access
    the login page back to the index page.

    --John

     
  • KosherJava
    KosherJava
    2006-01-26

    Logged In: YES
    user_id=634811

    John,
    Making PGV more spider friendly has a lot of merit,
    especially since spiders seem to take up a good portion of
    all bandwidth to the site. Theme changes can also be
    removed, as can the contact links, help links (we don't need
    100,000 references to our help files in google) and the
    clipping cart. I think that the reports menu can probably
    also be removed, as can the portal link
    (index.php?command=user).

     
  • KosherJava
    KosherJava
    2006-02-06

    Logged In: YES
    user_id=634811

    I looked over the code and noticed that the session.php file
    has some code to ban some bad user agents (and it uses a
    slightly different user agent sniffing code). You can find
    this right at the start of the file. We should probably
    unify the code to have 2 arrays, 1 of banned user agents and
    one of known spiders. By the way, the Yahoo bot in my logs
    looks very different than the one you listed. I think that
    using eregi is probably preferable. By the way, have a look
    at bot detecting code at http://nes-emulator.com/x_bot.php
    and http://danzcontrib2.free.fr/en/activiterobots.php .

     
  • Kees Mollema
    Kees Mollema
    2006-02-09

    Logged In: YES
    user_id=1410859

    Dear kosherjava,

    Thank you for your suggestion about altering the
    session.php, I'm afraid though your solution doesn't work
    for me, I isolated your code and added some text I know to
    be included in my own userstring, which resulted in
    test1.php:

    <?php
    echo "Test for string in Useragent <br />";
    $spiders = array(
    'compatible',
    'molvis,
    'zyborg',
    'fast-webcrawler',
    'gigabot',
    'scrubby',
    'msnbot',
    'yahooseeker',
    'mozilla'
    );
    $useragent = (isset($_SERVER['HTTP_USER_AGENT'] )) ?
    $_SERVER['HTTP_USER_AGENT'] : "";
    $spider = in_array(strtolower($useragent), $spiders );
    $spider |= stristr($useragent, 'OmniExplorer_Bot/');
    if ($spider)
    {
    echo "Congratulations, string has been found <br />";
    }
    ?>

    This does not return a congratulations message. I then
    altered your script somwhat to test2.php:

    <?php
    echo "Test for string in Useragent <br />";
    $spidera = array(
    'compatible',
    'molvis',
    'zyborg',
    'fast-webcrawler',
    'gigabot',
    'scrubby',
    'msnbot',
    'yahooseeker',
    'mozilla'
    );
    $useragent = (isset( $_SERVER['HTTP_USER_AGENT'] )) ?
    $_SERVER['HTTP_USER_AGENT'] : "";
    foreach ($spidera as $spiderv)
    {
    if ((stristr($useragent,$spiderv)) != (stristr
    ($useragent,"OmniExplorer_Bot/")))
    {
    echo "Congratulations, string has been found <br />";
    }
    }
    ?>

    Which in my case returns 3 congratulations messages.

    Would someone please acknowledge what I found??

    I implemented the alteration at http://genealogie.molvis.nl

     
  • Laie Techie
    Laie Techie
    2006-02-09

    Improved includes/session.php

     
    Attachments
  • Laie Techie
    Laie Techie
    2006-02-09

    Logged In: YES
    user_id=1278885

    I've checked out the links posted by kosherjava and found
    them quite useful. Based on that code, I've rewritten
    session.php to use 3 arrays: $worms, $bots, and
    $bots_not_allowed. $worms and $bots are used to match the
    user agent string. $bots_not_allowed is a list of places
    that bots shouldn't go. It outputs a 403 (Forbidden) status
    and exits.

    I've also made it so that $spider contains the name of the
    bot that is found, in case you want to treat Googlebot
    different than Yahoo! Slurp.

     
1 2 > >> (Page 1 of 2)