#286 robots.txt "allow" lines ignored

feature-requests
open
htdig (103)
5
2007-11-09
2007-11-08
Dan Muller
No

Noticed this problem while using 3.16, Verified the cause by examining the code. The code for the latest beta release, 3.2.0b6, appears to still have the problem.

According to my reading of the relevant RFC, robots.txt files are supposed to be processed by applying the first allow or disallow line that applies to given page. Order matters! For instance, I was trying to use lines like this:

Allow: /index.pl?action=index
Disallow: /index.pl?action=

This should have disallowed action= queries in general, but allowed action=index queries. htDig ignores the allow line, and consequently does not index the action=index page.

Looks like a fairly extensive rewrite of robots.txt processing would be necessary to get the correct behavior.

Discussion

    • milestone: --> feature-requests
    • assigned_to: nobody --> grdetil
    • status: open --> pending
     
  • Logged In: YES
    user_id=149687
    Originator: NO

    Hi, Dan. Could you please provide a reference to the relevant RFC? In my searching, I was unable to find it.

    The ietf.org site seems only to have this draft, which expired in 1997, and as far as I can tell, never obtained official RFC status:

    http://tools.ietf.org/html/draft-giudici-web-robots-cntrl-00

    The standard we follow is the one published at http://www.robotstxt.org/ which, as far as I can tell, is the closest thing to an established or official standard. It's the one that's been around the longest and seems to be the most widely followed. Elsewhere, I've found a few examples of extensions to the robots.txt standard, such as this one:

    http://www.conman.org/people/spc/robots2.html

    It and other pages make reference to an RFC draft on the robotstxt.org site, but these links don't seem to work anymore. I'm guessing that this site had once had a copy of the proposed draft from IETF, but no longer has it. Google and Yahoo seem also to have made their own extensions to the robots.txt standard, to mixed results. This posting -- http://www.jangro.com/a/2006/12/08/is-google-misreading-robotstxt/ -- suggests that Google, at least as of last December, was misparsing valid robots.txt files in its attempts to extend the standard. As you say, a fairly extensive rewrite of robots.txt processing is needed to get correct behavior, and it seems even Google has had trouble getting it right.

    The author of the existing standard has stated his objections to extensions to the standard here:

    http://www.robotstxt.org/eval.html

    and I'm inclined to side with those objections. Unless you can convince us otherwise, it seems the Allow directive is not part of an official or established standard, and not likely to be implemented well or at all by many or most indexing robots. So, relying on its use in a robots.txt file would be unwise. It would therefore also seem unwise to implement such an extension in ht://Dig, given the complexity involved and the lack of standards surrounding it.

    In any case, this would seem not to be an actual bug, but a feature request, as the code seems to correctly implement the established standard.

     
  • Dan Muller
    Dan Muller
    2007-11-09

    • status: pending --> open
     
  • Dan Muller
    Dan Muller
    2007-11-09

    Logged In: YES
    user_id=358502
    Originator: YES

    My bad. I ran across a copy of one of the two old Internet Drafts and misinterpreted it as an RFC, in part because of misinformation in the site that linked to it. Reading too fast, I guess.

    However, I did run across the description of Allow: on numerous sites about robots.txt -- try googling for "robots.txt" and "allow:" together. Some sites treat it as a generally accepted part of the syntax, others mention it as an extension. I was not previously aware of robotstxt.org.

    Given that there is no RFC, I guess you can pick whatever you standard you like. :) Your position of following the lead of robotstxt.org is reasonable. However, the "standard" they promulgate is quite limiting. How, for instance, can it accommodate my relatively simple situation?

    In any case, thanks for looking at this, and sorry for the misinformation in my bug report.