Re: [Aperture-devel] how to add new crawler or datasource

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Leo Sauermann wrote:
> well, sftp is different from http, because of the clear user/password
> you need to login, and also crawling is trivial, just traverse all 
> directories. Whereas the web crawler does interpret the HTML files 
> and extracts links to find more for crawling
> 
> I would look into the filesystem crawler, as a reference, and start 
> from scratch.

I agree with this. Let me provide some more details to help you get the 
big picture of all this.

First, you probably need to write an implementation of the DataAccessor
interface. A DataAccessor is responsible for retrieval of a single
object, such as a file or webpage. It is protocol-specific, so you'd
have to write one for (s)ftp. This DataAccessor also embeds all logic
necessary for incremental crawling, i.e. detection whether a resource
has been modified or not since the last crawl. For this it is allowed to
store any information necessary in the AccessData that it is
(optionally) provided.

 From there on, it depends on the nature of your source whether to use a
WebCrawler of FileSystemCrawler(-like) crawler.

If your source contains a hypertext graph, I expect that the current
WebCrawler will do just fine as it is completely protocol-independent.
It only contains logic for traversing a hypertext graph and lets the
DataAccessor(s) handle the messy protocol-specific details. If it turns
out that this is not entirely the case, I'd certainly like to hear about
it ;)

When you crawl a hierarchy of FTP folders, the FileSystemCrawler is the
class to look at. Unfortunately for you, the FileSystemCrawler does not
completely abstract from the protocol that it uses: it still uses
java.io.File as an abstraction of the Files to crawl, which (to the best
of my knowledge) cannot be used to represent FTP resources. Also it
contains logic for e.g. handling MacOS X file systems in a more natural
way and lacks code for dealing with unreliable connections (although
this would be a welcome addition, given our experience with certain
network file systems). You can still copy the FileSystemCrawler source
though and adapt it to work for FTP. This would even be a welcome
contribution to the Aperture project.

This reminds me that we should take a look at the Commons VFS project,
perhaps that can help us to improve the FileSystemCrawler so that it
will be able to represent other types of file system hierarchies.

Regards,

Chris
--