File / zip file extractor

  • Chris Bamford

    Chris Bamford - 2010-06-25


    I was wondering what the most efficient way of doing the following would be with Aperture:

    My app is handed a file and is asked to extract just the text from it.  From what I have seen, this is quite straightforward.  However, if that file contains other files (zip, tar, etc), it needs to correctly handle that, too - i.e. recursively unpack and extract.
    Those are my only 2 use cases - plain or container files.

    I have looked at the example filesystem crawler, but it seems too heavy for what I want - is that correct?  If so, can I have some simple guidelines please of the API calls I need to make?


    - Chris

  • Antoni Mylka

    Antoni Mylka - 2010-06-25

    Well, it is a little heavy indeed.

    Assuming you have an InputStream, you'd need to write the code that detects the mime type, applies the zip subcrawler, passes in an appropriate instance of SubCrawlerHandler which does the recursive invocation for all DataObjects supplied.

    We haven't included this in ApertureRuntime class for two reasons:

    1. ApertureRuntime was needed fast :)
    2. Such a "deep" extractFrom(InputStream stream) method would still require some sort of a callback interface to allow you to get the fulltext and/or metadata. People have vastly different needs what to do with them, e.g. to process zips, but not bz2 files, but only if their names begin with "z" etc.

    What exactly do you need to do with the fulltext?

    Would a callback mechanism like

    public interface SimpleTextProcessor {
       public void process(URI id, String text);

    be sufficient for your needs?

    - no metadata
    - no incremental crawling
    - no finetutning what is to be crawled and what isn't
    - no additional processing of the byte streams (e.g. MD5 hashes, or file cache creation)
    - no way to add additional Aperture components to the mix (your own custom extractors/subcrawlers).

    If you don't need all of this, I may try to whip something along the lines of

    SomeApertureUtilClass.crawlStream(InputStream stream, SimpleTextProcessor processor)

    that would do the same as ApertureRuntme.extractFrom but include subcrawlers in the mix and just pass the fulltext to the supplied callback instance.

    I'd rather not include it in aperture-core though, at least not until we get more feedback from the community. Maybe such a simple tool would be enough for more people though…

  • Chris Bamford

    Chris Bamford - 2010-06-28

    Hi Mylka,

    What you are proposing sounds ideal for our purposes. We intend to use the full text for both indexing and document previewing, so if the whitespace could be preserved, all the better.
    From a functionality point-of-view, the more filetypes (including container types) it can handle, the better.  I was impressed by in this regard.


    - Chris

  • Chris Bamford

    Chris Bamford - 2010-07-01

    Hi again Mylka,

    I was wondering if you have kindly agreed to look at the crawlStream code, and if so, what a likely timescale might be?  I'm not hassling, just would like to understand so I can plan my own work schedule  :-)

    Thanks again

    - Chris


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks