From: Christiaan F. <chr...@ad...> - 2007-01-29 10:49:18
|
Leo Sauermann wrote: > well, sftp is different from http, because of the clear user/password > you need to login, and also crawling is trivial, just traverse all > directories. Whereas the web crawler does interpret the HTML files > and extracts links to find more for crawling > > I would look into the filesystem crawler, as a reference, and start > from scratch. I agree with this. Let me provide some more details to help you get the big picture of all this. First, you probably need to write an implementation of the DataAccessor interface. A DataAccessor is responsible for retrieval of a single object, such as a file or webpage. It is protocol-specific, so you'd have to write one for (s)ftp. This DataAccessor also embeds all logic necessary for incremental crawling, i.e. detection whether a resource has been modified or not since the last crawl. For this it is allowed to store any information necessary in the AccessData that it is (optionally) provided. From there on, it depends on the nature of your source whether to use a WebCrawler of FileSystemCrawler(-like) crawler. If your source contains a hypertext graph, I expect that the current WebCrawler will do just fine as it is completely protocol-independent. It only contains logic for traversing a hypertext graph and lets the DataAccessor(s) handle the messy protocol-specific details. If it turns out that this is not entirely the case, I'd certainly like to hear about it ;) When you crawl a hierarchy of FTP folders, the FileSystemCrawler is the class to look at. Unfortunately for you, the FileSystemCrawler does not completely abstract from the protocol that it uses: it still uses java.io.File as an abstraction of the Files to crawl, which (to the best of my knowledge) cannot be used to represent FTP resources. Also it contains logic for e.g. handling MacOS X file systems in a more natural way and lacks code for dealing with unreliable connections (although this would be a welcome addition, given our experience with certain network file systems). You can still copy the FileSystemCrawler source though and adapt it to work for FTP. This would even be a welcome contribution to the Aperture project. This reminds me that we should take a look at the Commons VFS project, perhaps that can help us to improve the FileSystemCrawler so that it will be able to represent other types of file system hierarchies. Regards, Chris -- |