From: Antoni M. <ant...@gm...> - 2011-06-14 13:55:59
|
W dniu 2011-06-14 15:02, Nick Burch pisze: > On Tue, 14 Jun 2011, Antoni Mylka wrote: >> You are right. There is still room for improvement. ZipContainerDetector >> creates a temp file, which I'd rather avoid > > We'll need to buffer the whole file for zip either way. The current way > will create a temp file if you start with an input stream (not if you have > a file already), will scan through the file looking for entries that'll > identify the file. The parser needs the whole file, so if we did a > streaming parse of the file for detection we'd need to have buffered so we > can rewind for the parser Why? Doesn't the "we'll need to buffer the whole file for zip anyway" boil down to the question of using the commons-compress ZipFile vs. ZipArchiveInputStream? I know that in a general case the zip file format isn't well suited for streaming processing, which makes ZipArchiveInputStream less reliable. The stream can contain entries which aren't supposed to appear in the zip or multiple entries with the same name. Yet if I agree to that, I can crawl 50M zips in email attachments without copying them. You are already committed to using ZipFile in zip-processing code so using TikaInputStream.getFile() in ZipContainerDetector is not a problem. We stay with ZipArchiveInputStream (for the time being) and would therefore be interested in a stream-based ZipContainerDetector consuming just a few kilobytes, knowing that in certain cases the accuracy may drop, because the entries in a zip are in general unordered. It's a reliability vs. performance tradeoff. Or am I missing something? >> and with POI detector, the entire stream is parsed once in detector, and >> for the second time in the extractor/parser, which is bad for >> performance > > Pass in a TikaInputStream. That supports attaching the opened (and > processed) container to the stream, so the parser can re-use it. I know. I was referring to making the Aperture extractors aware of the fact that they can reuse the NPOIFileSystem, which is something I want to implement before we fully migrate. For us, the problem is that we give only the first few KB of a file to the mime type identifier, therefore for larger files the PoiContainerDetector can NOT build a proper poi filesystem for extractors to reuse. That's why I'm building a generic POI extractor which will get the entire stream, build a proper filesystem and perform the detection and extraction directly from it. Antoni Myłka ant...@gm... |