From: Christiaan F. <chr...@ad...> - 2008-04-17 10:45:43
|
Grant Ingersoll wrote: > I know the WebCrawler isn't really designed for industrial strength > crawling, but I do have a few questions: > > Does the WebCrawler honor robots.txt? At the moment robots.txt is ignored but adding an option to adhere to it makes sense. I wonder how much of the contents of a robots.txt can be mapped onto a DomainBoundaries. The strategy would then be to first download robots.txt, modify the current DomainBoundaries accordingly and continue crawling with the root URLs. Grant: could you make a Feature Request ticket for this in Aperture's tracker? > How about general crawling etiquette like not hitting a site too often? I think this is the responsibility of the integrator using the WebCrawler. This would be an interesting thing to discuss though: what exactly do we see as the scope for Aperture? I would instinctively expect such features in projects like Nutch, who deliver a ready-to-use crawler aimed at large scale web crawling, but much less in Aperture, whose focus is more on providing crawling and extraction building blocks that should support a wide range of use cases. Kind regards, Chris -- |