#289 user_agent not used for parsing robots.txt

resolved
closed-works-for-me
htdig (103)
5
2010-10-17
2010-10-17
mikes123
No

If you add an user_agent: config, that string is used for http gets. However, it is NOT used when parsing the returned robots.txt.

If I set "user_agent: agent-foo" in htdig.conf, and robots.txt to:
User-agent: agent-foo
Disallow:
User-agent: *
Disallow: /

...then htdig will http/get robots.txt using the user-agent string "agent-foo". However, it will parse robots.txt "using myname = htdig". It therefore thinks it is not allowed to dig.

I realize that robots.txt doesn't provide real security. But, this provides some means of allowing indexing only locally, since a unique user-agent string can be created.

Discussion

  • Gilles Detillieux

    If you set...

    robotstxt_name: agent-foo

    in your htdig.conf it should work as you want it to. htdig uses the robotstxt_name attribute rather than the user_agent attribute for deternining what it's own name is for purposes of parsing the robots.txt file. The user_agent attribute is for what it reports to the HTTP servers, which can be different if you want it to be. See http://www.htdig.org/attrs.html#robotstxt_name and http://www.htdig.org/attrs.html#user_agent for details on both of these.

     
  • Gilles Detillieux

    • milestone: --> resolved
    • assigned_to: nobody --> grdetil
    • status: open --> closed-works-for-me
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks