From: Christiaan F. <chr...@ad...> - 2008-04-23 21:54:43
|
Grant Ingersoll wrote: > I think, if I am reading this right, that I can set the > "aperture.poiUtil.bufferSize" to use a bigger buffer, right? Should > this value be set to be the same as the maximum number of bytes > allowed per file? Doesn't make much sense to allow for larger files > than we can reliably read in, right? Yes, absolutely. Never realized this myself. Right now it's entirely up to the application developer to make sure that the buffer size used by PoiUtil is less than or equal to the file size thresholds used by the Crawlers. Not that anything goes wrong when it doesn't, but it has no effect as you won't get to process these files anyway. As the Crawler and Extractor frameworks are entirely separate (i.e. you can use one without the other), I see no way to manage this in a smarter way. Note that these limits also have different use cases. At the crawler side they prevent downloading of very large files, at the Extractor side they control main memory usage. Antoni Mylka wrote about this: > Is this ppt larger than 4MB? I must admit I'm not a POI specialist > (yet)? Is this reproducible? (i.e. does it always hurt to try to read an > MSOffice file larger than the buffer). If it is, maybe throwing an > ExtractorException is not a bad idea. I've seen it happen on my own files at several occasions. What happens is this: first, POI is applied on an MS Office file (or any other OLE file) to extract text and metadata. If this did not succeed, i.e. it threw an exception or returned empty text, the stream is reset and our heuristic string extractor is applied. This only works when POI has consumed less than 4 MB of the stream, else you get the "Resetting to invalid mark" IOException. A 4 MB buffer it not much for today's standards (assuming that you don't have Extractors working in parallel!). Perhaps we could make the default size somewhere around 10 MB? Note that this relates directly to file size, I've got plenty of 4MB+ office files, 10MB+ is a significantly smaller piece of the pie. Grant wrote: > The only problem with that is I > can't change it on a per crawl basis in a thread-safe way. Any > suggestions? Not at the moment. Any suggestions on how we could accommodate such settings? > Also, and this one is a bit more dangerous, I think, I am seeing: > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Unknown Source) > at java.io.CharArrayWriter.write(Unknown Source) > at org.semanticdesktop.aperture.util.IOUtil.readFully(IOUtil.java:256) > at org.semanticdesktop.aperture.util.IOUtil.readString(IOUtil.java:84) > at > org > .semanticdesktop > .aperture > .extractor > .plaintext.PlainTextExtractor.extract(PlainTextExtractor.java:84) As Antoni already explained, the PlainTextExtractor has some memory issues, due to Java using two bytes per char and the fact that the text is read into a temporary char buffer that doubles once it needs to grow. Add to that a toString on the buffer once it contains everything, which copies the part of the array with actual content. Based on this alone the worst case scenario for memory usage is 6 x the file size!! 2 bytes/char contained in an array that is filled for only 50.001 % + 2 bytes/char for the toString. Not included in this calculation is the fact that the PlainTextExtractor first reads a String of maximally 256 chars, determines if it looks even remotely like readable text (if it makes your terminal beep, it doesn't :) ) and if so, reads the rest in a separate String and concatenates these two. That last concatenation has now been prevented by the use of a PushbackReader. I don't think this lowers the maximum memory usage though, as this concatenation takes place when the CharArrayWriter (the biggest temporary buffer) can already be garbage-collected. > Not sure what is even possible here, as I know the issue is that we > have to read in the whole contents into memory for that triple. Is > there any other way to just deal with the stream? Options you have: - give your VM the largest max. heap space you can get away with - look at the code with the above explanation in mind, see where you think things can be changed. A possibility may be looking into the metadata inside the RDFContainer passed to the PTE and see if there is anything known about the file size (NIE.byteSize). This may be used to properly initialize buffer sizes. 6 x the file size then becomes 4 x the file size. > Will moving to the persistent model take care of this need? Normally, the models contained in a DataObject are kept entirely in main memory. Whether a persistent model makes sense depends on what you need to do with those DataObjects. When you want to store and/or index their metadata, a persistent store makes much sense, whether it is an RDF store, Lucene, or whatever. A ModelSet operating on top of a NativeStore (as explained in the persistent crawling example) does exactly this. Then you can process each DataObject sequentially and you always have at most one DataObject in main memory. Normally that's perfectly feasible, only your 100.000 files folder and certain large files are exceptions. Antoni wrote: > AFAIK RDF stores don't support streaming access to literals That's exactly one of the issues I want to tackle in my proposal for RDF usage in Aperture 2. Allowing Literals with Readers as labels rather than Strings is an option (similar to Lucene's Fields). Changing Strings into CharSequences (an interface implemented by String) could have been another if it weren't for the fact that you then still need to know the string length beforehand. Regards, Chris -- |