Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#289 user_agent not used for parsing robots.txt

resolved
closed-works-for-me
htdig (103)
5
2010-10-17
2010-10-17
mikes123
No

If you add an user_agent: config, that string is used for http gets. However, it is NOT used when parsing the returned robots.txt.

If I set "user_agent: agent-foo" in htdig.conf, and robots.txt to:
User-agent: agent-foo
Disallow:
User-agent: *
Disallow: /

...then htdig will http/get robots.txt using the user-agent string "agent-foo". However, it will parse robots.txt "using myname = htdig". It therefore thinks it is not allowed to dig.

I realize that robots.txt doesn't provide real security. But, this provides some means of allowing indexing only locally, since a unique user-agent string can be created.

Discussion

  • If you set...

    robotstxt_name: agent-foo

    in your htdig.conf it should work as you want it to. htdig uses the robotstxt_name attribute rather than the user_agent attribute for deternining what it's own name is for purposes of parsing the robots.txt file. The user_agent attribute is for what it reports to the HTTP servers, which can be different if you want it to be. See http://www.htdig.org/attrs.html#robotstxt_name and http://www.htdig.org/attrs.html#user_agent for details on both of these.

     
    • milestone: --> resolved
    • assigned_to: nobody --> grdetil
    • status: open --> closed-works-for-me