#1 Obey robots.txt


As of this release, there are three ways to generate a
sitemap: specifying urls, specifying paths, or using logs.

However, I would image that many administrators will
use these methods to mimic exactly what their
robots.txt file specifies. Why not make this easier?

Something like this:
<robots url="http://www.example.com/robots.txt"
path="/var/www/html" bot="googlebot />

url: Address of robots.txt.

path: Root path of the site.

bot: Interest was shown in other search engines using
this software. This attribute will allow the sitemap
generation to follow the rules for a certain bot.
Values would include a bot name, or "*" to follow all
rules, regardless of the bot they are meant for.

This would essentially mimic a directory element, and a
few filter elements based on the rules within robots.txt.

Details will have to be ironed out, taking into account
aliased directories that a bot would see, but not
visible on the file system.

Thus, creation of the sitemap will follow the same
rules a bot would.


  • gcb

    gcb - 2006-01-23

    Logged In: YES

    What's the point? right now it's not a substitute for

    so, if it's already at your robots.txt, you're only having
    extra work to do.

    Sorry if i failed to see the point. In this case, please,
    correct me.

  • Klaus Johannes Rusch

    Logged In: YES

    I would not recommend restricting results to a particular
    bot because the sitemap file might be used by other bots as
    well (hopefully not every bot will come up with its own format).

    Honoring the default exclusion rules looks like a good idea
    though so resources which are not supposed to be indexed
    will not be included in the sitemap either.

  • gcb

    gcb - 2006-02-20

    Logged In: YES

    > hopefully not every bot will come up with its own

    But this is already happening. Yahoo has it's inclusion
    program, google has this 'site map' that should be called
    inclusion program also, IMHO.

    Again, i may fail to see the point. But google states that
    it will not ignore robots.txt
    Robots.txt is used mainly to deny acces, so, even if you
    put an url in your google map, it will obey robots.txt and
    not index it. (Don't know if it will mess your relevancy
    count tough)


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks