#26 WebCrawler should follow robots.txt

open
Antoni Mylka
general (27)
5
2008-09-28
2008-04-17
Grant Ingersoll
No

The WebCrawler should honor robots.txt when crawling websites.

Discussion

    • milestone: --> 533943
     
  • Logged In: YES
    user_id=1917080
    Originator: YES

    I looked a little bit into Nutch's robots code, it seems like it would be pretty easy to pull out and use, so probably no need to implement from scratch

     
  • Antoni Mylka
    Antoni Mylka
    2008-08-12

    • assigned_to: nobody --> mylka
     
  • Antoni Mylka
    Antoni Mylka
    2008-08-12

    Logged In: YES
    user_id=1613065
    Originator: NO

    could you tell me which classes do you mean. The robot-handling code seems to have undergone some considerable refactoring between the last release and the current trunk. I can't figure out where to look. How to convert a stream from the robots.txt into some object that would have an isAllowed method.

     
  • Logged In: YES
    user_id=1917080
    Originator: YES

    I started adapting the RobotRulesParser from Nutch for Aperture. I believe I got it from the trunk. I asked a couple of question's on the list, just not sure how to proceed. I think once I figure them out, I can put up a patch.

     
  • Start of robots honoring stuff, using Nutch's parser

     
    Attachments
  • Logged In: YES
    user_id=1917080
    Originator: YES

    File Added: robots.patch

     
  • Logged In: YES
    user_id=1917080
    Originator: YES

    I attached a _draft_ _START_ of an attempt at hooking this in. It is by no means complete, I'm mostly doing it to get it out of my source tree and maybe give someone else a starting point. It uses Nutch's robots code, modified for Aperture. There would need to be the appropriate NOTICE/Credit given in Aperture per the Apache license in order to include this in Aperture.

    The code does not hook into the WebCrawler yet.

     
  • Antoni Mylka
    Antoni Mylka
    2008-09-11

    Had a look at the patch. It's a little barebones. What would really help is an idea how to test this setup, a suite of test of the RulesParser itself, and a way to test whether or not the http accessor actually uses them correctly (via some mock socket factory, or httpclient library).

     
  • Antoni Mylka
    Antoni Mylka
    2008-09-28

    • milestone: 533943 -->