SubCrawlers
A SubCrawler accesses an InputStream and produces a stream of other DataObjects representing the resources found "inside".
An AccessData instance can optionally be specified to a SubCrawler, allowing it to perform incremental crawling, i.e. to scan and report the differences in the stream since the last crawl.
SubCrawler JavaDoc
Available SubCrawlers
[keep in alphabetical order]
| SubCrawler | Dependencies | Remarks |
| BZip2SubCrawler | org.apache.tools.bzip2 | A SubCrawler Implementation working with BZIP2 archives. |
| GZipSubCrawler | java.util.zip.GZIPInputStream | A SubCrawler Implementation working with GZIP archives. |
| MimeSubCrawler | javax.mail | A SubCrawer implementation for message/rfc822-style messages. |
| TarSubCrawler | org.apache.tools.tar | A SubCrawler Implementation working with tar archives. |
URIs generated by SubCrawlers
The uris of the data objects found inside other data objects have a fixed form, consisting of three basic parts:
<prefix>:<parent-object-uri>!/<path>
- <prefix> - the uri prefix, characteristic for a particular SubCrawler, returned by the SubCrawlerFactory.getUriPrefix() method
- <parent-object-uri> - the uri of the parent data object, it is obtained from the parentMetadata parameter to the subCrawl() method, by calling RDFContainer.getDescribedUri()
- <path> - an internal path of the 'child' data object inside the 'parent' data object
This scheme has been inspired by the apache commons VFS project, homepaged under http://commons.apache.org/vfs
Example
// Minimal code showing how subcrawling may work.
// Subcrawlers are usually invoked from CrawlerHandlers
ZipSubCrawler subCrawler = new ZipSubCrawler();
InputStream stream = new FileInputStream(filename);
RDFContainer parentMetadata = rdfcontainerFactory.newInstance("myuri");
subCrawler.subCrawl(null, stream, handler, null, null, null, null, parentMetadata);
Usually, SubCrawlers are invoked by CrawlerHandler. See the BaseCrawlerHandler for correct code:
SubCrawlerFactory subcrawlerfactory = (SubCrawlerFactory)sub; SubCrawler subcrawler = subcrawlerfactory.get(); // Hand over control to the crawler again - the thread will return after the subcrawler is finished. crawler.runSubCrawler(subcrawler, dataObject, bufferedStream, null, mimeType);