Noticed this problem while using 3.16, Verified the cause by examining the code. The code for the latest beta release, 3.2.0b6, appears to still have the problem.
According to my reading of the relevant RFC, robots.txt files are supposed to be processed by applying the first allow or disallow line that applies to given page. Order matters! For instance, I was trying to use lines like this:
This should have disallowed action= queries in general, but allowed action=index queries. htDig ignores the allow line, and consequently does not index the action=index page.
Looks like a fairly extensive rewrite of robots.txt processing would be necessary to get the correct behavior.
Log in to post a comment.