The WebCrawler should honor robots.txt when crawling websites.
Logged In: YES
I looked a little bit into Nutch's robots code, it seems like it would be pretty easy to pull out and use, so probably no need to implement from scratch
Logged In: YES
could you tell me which classes do you mean. The robot-handling code seems to have undergone some considerable refactoring between the last release and the current trunk. I can't figure out where to look. How to convert a stream from the robots.txt into some object that would have an isAllowed method.
I started adapting the RobotRulesParser from Nutch for Aperture. I believe I got it from the trunk. I asked a couple of question's on the list, just not sure how to proceed. I think once I figure them out, I can put up a patch.
Start of robots honoring stuff, using Nutch's parser
File Added: robots.patch
I attached a _draft_ _START_ of an attempt at hooking this in. It is by no means complete, I'm mostly doing it to get it out of my source tree and maybe give someone else a starting point. It uses Nutch's robots code, modified for Aperture. There would need to be the appropriate NOTICE/Credit given in Aperture per the Apache license in order to include this in Aperture.
The code does not hook into the WebCrawler yet.
Had a look at the patch. It's a little barebones. What would really help is an idea how to test this setup, a suite of test of the RulesParser itself, and a way to test whether or not the http accessor actually uses them correctly (via some mock socket factory, or httpclient library).