The SubCrawler API needs DataAccessor-like functionality, so that given an InputStream, a SubCrawler able to process that InputStream and the internal path of a child DataObject created earlier, you can reconstruct that DataObject.
Attached the first version of the patch that solves the issue... kind of. Prepared a trivial default implementation in the AbstractSubCrawler class. The subclasses are free to perform any optimizations they deem appropriate. It's basically commit-ready, there are some junit tests that pass.
The most important thing missing from this patch are the utility methods for dealing with subresource uris. They will come in due course.
File Added: aperture-sf2154832-ver1.patch
first version of the patch that solves the issue
this should be within the 1.2.0.beta group
a patch + two test docs that should go into the docs folder
Attached the second version of the patch. The highlights include
* the SubCrawlerUtil class that contains utility methods for dealing with subcrawled uris.
* The most important is getDataObject, it implements a generic accessor method that will work with uris of arbitrary nesting.
* Provided a more intelligent getDataObject implementations for Archivers (tar,zip), that skips irrelevant archive entries, chooses the right one and ensures that the archive stream is closed when the data object is disposed (it wasn't the case before).
* Ensured that the compressors also close the stream only when the data object is disposed.
* added some unit tests for the SubCrawlerUtils
The Integration test for the subcrawlers seems to work. I didn't do any extensive testing, just submitted the patch as soon as the bar turned green.
The attached file contains the patch and two files that should go into the docs folder.
File Added: aperture-sf2154832-ver2.zip
Reviewers haven't made it in time for the 1.2.0 release. I therefore postpone this issue.