Re: [Aperture-devel] TikaMimeTypeIdentifier in Aperture

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

W dniu 2011-06-14 15:02, Nick Burch pisze:
> On Tue, 14 Jun 2011, Antoni Mylka wrote:
>> You are right. There is still room for improvement. ZipContainerDetector
>> creates a temp file, which I'd rather avoid
>
> We'll need to buffer the whole file for zip either way. The current way
> will create a temp file if you start with an input stream (not if you have
> a file already), will scan through the file looking for entries that'll
> identify the file. The parser needs the whole file, so if we did a
> streaming parse of the file for detection we'd need to have buffered so we
> can rewind for the parser

Why?

Doesn't the "we'll need to buffer the whole file for zip anyway" boil 
down to the question of using the commons-compress ZipFile vs. 
ZipArchiveInputStream? I know that in a general case the zip file format 
isn't well suited for streaming processing, which makes 
ZipArchiveInputStream less reliable. The stream can contain entries 
which aren't supposed to appear in the zip or multiple entries with the 
same name. Yet if I agree to that, I can crawl 50M zips in email 
attachments without copying them.

You are already committed to using ZipFile in zip-processing code so 
using TikaInputStream.getFile() in ZipContainerDetector is not a 
problem. We stay with ZipArchiveInputStream (for the time being) and 
would therefore be interested in a stream-based ZipContainerDetector 
consuming just a few kilobytes, knowing that in certain cases the 
accuracy may drop, because the entries in a zip are in general unordered.

It's a reliability vs. performance tradeoff. Or am I missing something?

>> and with POI detector, the entire stream is parsed once in detector, and
>> for the second time in the extractor/parser, which is bad for
>> performance
>
> Pass in a TikaInputStream. That supports attaching the opened (and
> processed) container to the stream, so the parser can re-use it.

I know. I was referring to making the Aperture extractors aware of the 
fact that they can reuse the NPOIFileSystem, which is something I want 
to implement before we fully migrate. For us, the problem is that we 
give only the first few KB of a file to the mime type identifier, 
therefore for larger files the PoiContainerDetector can NOT build a 
proper poi filesystem for extractors to reuse. That's why I'm building a 
generic POI extractor which will get the entire stream, build a proper 
filesystem and perform the detection and extraction directly from it.

Antoni Myłka
ant...@gm...