Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#1 Obey robots.txt

open
nobody
None
5
2005-06-03
2005-06-03
cybersaga
No

As of this release, there are three ways to generate a
sitemap: specifying urls, specifying paths, or using logs.

However, I would image that many administrators will
use these methods to mimic exactly what their
robots.txt file specifies. Why not make this easier?

Something like this:
<robots url="http://www.example.com/robots.txt"
path="/var/www/html" bot="googlebot />

url: Address of robots.txt.

path: Root path of the site.

bot: Interest was shown in other search engines using
this software. This attribute will allow the sitemap
generation to follow the rules for a certain bot.
Values would include a bot name, or "*" to follow all
rules, regardless of the bot they are meant for.

This would essentially mimic a directory element, and a
few filter elements based on the rules within robots.txt.

Details will have to be ironed out, taking into account
aliased directories that a bot would see, but not
visible on the file system.

Thus, creation of the sitemap will follow the same
rules a bot would.

Discussion

  • gcb
    gcb
    2006-01-23

    Logged In: YES
    user_id=638018

    What's the point? right now it's not a substitute for
    robots.txt.

    so, if it's already at your robots.txt, you're only having
    extra work to do.

    Sorry if i failed to see the point. In this case, please,
    correct me.

     
  • Logged In: YES
    user_id=365576

    I would not recommend restricting results to a particular
    bot because the sitemap file might be used by other bots as
    well (hopefully not every bot will come up with its own format).

    Honoring the default exclusion rules looks like a good idea
    though so resources which are not supposed to be indexed
    will not be included in the sitemap either.

     
  • gcb
    gcb
    2006-02-20

    Logged In: YES
    user_id=638018

    > hopefully not every bot will come up with its own
    format

    But this is already happening. Yahoo has it's inclusion
    program, google has this 'site map' that should be called
    inclusion program also, IMHO.

    Again, i may fail to see the point. But google states that
    it will not ignore robots.txt
    Robots.txt is used mainly to deny acces, so, even if you
    put an url in your google map, it will obey robots.txt and
    not index it. (Don't know if it will mess your relevancy
    count tough)