Re: [Aperture-devel] robots.txt

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Grant Ingersoll wrote:
> I know the WebCrawler isn't really designed for industrial strength  
> crawling, but I do have a few questions:
> 
> Does the WebCrawler honor robots.txt?

At the moment robots.txt is ignored but adding an option to adhere to it 
makes sense.

I wonder how much of the contents of a robots.txt can be mapped onto a 
DomainBoundaries. The strategy would then be to first download 
robots.txt, modify the current DomainBoundaries accordingly and continue 
crawling with the root URLs.

Grant: could you make a Feature Request ticket for this in Aperture's 
tracker?

> How about general crawling etiquette like not hitting a site too often?

I think this is the responsibility of the integrator using the WebCrawler.

This would be an interesting thing to discuss though: what exactly do we 
see as the scope for Aperture? I would instinctively expect such 
features in projects like Nutch, who deliver a ready-to-use crawler 
aimed at large scale web crawling, but much less in Aperture, whose 
focus is more on providing crawling and extraction building blocks that 
should support a wide range of use cases.

Kind regards,

Chris
--