1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

SubCrawlers

A SubCrawler accesses an InputStream and produces a stream of other DataObjects representing the resources found "inside".
An AccessData instance can optionally be specified to a SubCrawler, allowing it to perform incremental crawling, i.e. to scan and report the differences in the stream since the last crawl.
SubCrawler JavaDoc

Available SubCrawlers

[keep in alphabetical order]

SubCrawler Dependencies Remarks
BZip2SubCrawler org.apache.tools.bzip2 A SubCrawler Implementation working with BZIP2 archives.
GZipSubCrawler java.util.zip.GZIPInputStream A SubCrawler Implementation working with GZIP archives.
MimeSubCrawler javax.mail A SubCrawer implementation for message/rfc822-style messages.
TarSubCrawler org.apache.tools.tar A SubCrawler Implementation working with tar archives.

URIs generated by SubCrawlers

The uris of the data objects found inside other data objects have a fixed form, consisting of three basic parts:

 <prefix>:<parent-object-uri>!/<path>
  • <prefix> - the uri prefix, characteristic for a particular SubCrawler, returned by the SubCrawlerFactory.getUriPrefix() method
  • <parent-object-uri> - the uri of the parent data object, it is obtained from the parentMetadata parameter to the subCrawl() method, by calling RDFContainer.getDescribedUri()
  • <path> - an internal path of the 'child' data object inside the 'parent' data object

This scheme has been inspired by the apache commons VFS project, homepaged under http://commons.apache.org/vfs

Example

// Minimal code showing how subcrawling may work.
// Subcrawlers are usually invoked from CrawlerHandlers
ZipSubCrawler subCrawler = new ZipSubCrawler();
InputStream stream = new FileInputStream(filename);
RDFContainer parentMetadata = rdfcontainerFactory.newInstance("myuri");
subCrawler.subCrawl(null, stream, handler, null, null, null, null, parentMetadata);

Usually, SubCrawlers are invoked by CrawlerHandler. See the BaseCrawlerHandler for correct code:

SubCrawlerFactory subcrawlerfactory = (SubCrawlerFactory)sub;
SubCrawler subcrawler = subcrawlerfactory.get();
// Hand over control to the crawler again - the thread will return after the subcrawler is finished.
crawler.runSubCrawler(subcrawler, dataObject, bufferedStream, null, mimeType);