Re: [Aperture-devel] Couple of Exceptions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Grant Ingersoll wrote:
> I think, if I am reading this right, that I can set the  
> "aperture.poiUtil.bufferSize" to use a bigger buffer, right?  Should  
> this value be set to be the same as the maximum number of bytes  
> allowed per file?  Doesn't make much sense to allow for larger files  
> than we can reliably read in, right? 

Yes, absolutely. Never realized this myself.

Right now it's entirely up to the application developer to make sure 
that the buffer size used by PoiUtil is less than or equal to the file 
size thresholds used by the Crawlers. Not that anything goes wrong when 
it doesn't, but it has no effect as you won't get to process these files 
anyway.

As the Crawler and Extractor frameworks are entirely separate (i.e. you 
can use one without the other), I see no way to manage this in a smarter 
way. Note that these limits also have different use cases. At the 
crawler side they prevent downloading of very large files, at the 
Extractor side they control main memory usage.

Antoni Mylka wrote about this:
> Is this ppt larger than 4MB? I must admit I'm not a POI specialist 
> (yet)? Is this reproducible? (i.e. does it always hurt to try to read an 
> MSOffice file larger than the buffer). If it is, maybe throwing an 
> ExtractorException is not a bad idea.

I've seen it happen on my own files at several occasions. What happens 
is this: first, POI is applied on an MS Office file (or any other OLE 
file) to extract text and metadata. If this did not succeed, i.e. it 
threw an exception or returned empty text, the stream is reset and our 
heuristic string extractor is applied. This only works when POI has 
consumed less than 4 MB of the stream, else you get the "Resetting to 
invalid mark" IOException.

A 4 MB buffer it not much for today's standards (assuming that you don't 
have Extractors working in parallel!). Perhaps we could make the default 
size somewhere around 10 MB? Note that this relates directly to file 
size, I've got plenty of 4MB+ office files, 10MB+ is a significantly 
smaller piece of the pie.

Grant wrote:
> The only problem with that is I
> can't change it on a per crawl basis in a thread-safe way.  Any  
> suggestions?

Not at the moment. Any suggestions on how we could accommodate such 
settings?

> Also, and this one is a bit more dangerous, I think, I am seeing:
> java.lang.OutOfMemoryError: Java heap space
> 	at java.util.Arrays.copyOf(Unknown Source)
> 	at java.io.CharArrayWriter.write(Unknown Source)
> 	at org.semanticdesktop.aperture.util.IOUtil.readFully(IOUtil.java:256)
> 	at org.semanticdesktop.aperture.util.IOUtil.readString(IOUtil.java:84)
> 	at  
> org 
> .semanticdesktop 
> .aperture 
> .extractor 
> .plaintext.PlainTextExtractor.extract(PlainTextExtractor.java:84)

As Antoni already explained, the PlainTextExtractor has some memory 
issues, due to Java using two bytes per char and the fact that the text 
is read into a temporary char buffer that doubles once it needs to grow. 
Add to that a toString on the buffer once it contains everything, which 
copies the part of the array with actual content. Based on this alone 
the worst case scenario for memory usage is 6 x the file size!! 2 
bytes/char contained in an array that is filled for only 50.001 % + 2 
bytes/char for the toString.

Not included in this calculation is the fact that the PlainTextExtractor 
first reads a String of maximally 256 chars, determines if it looks even 
remotely like readable text (if it makes your terminal beep, it doesn't 
:) ) and if so, reads the rest in a separate String and concatenates 
these two. That last concatenation has now been prevented by the use of 
a PushbackReader. I don't think this lowers the maximum memory usage 
though, as this concatenation takes place when the CharArrayWriter (the 
biggest temporary buffer) can already be garbage-collected.

> Not sure what is even possible here, as I know the issue is that we  
> have to read in the whole contents into memory for that triple.  Is  
> there any other way to just deal with the stream?

Options you have:

- give your VM the largest max. heap space you can get away with

- look at the code with the above explanation in mind, see where you 
think things can be changed. A possibility may be looking into the 
metadata inside the RDFContainer passed to the PTE and see if there is 
anything known about the file size (NIE.byteSize). This may be used to 
properly initialize buffer sizes. 6 x the file size then becomes 4 x the 
file size.

> Will moving to the persistent model take care of this need? 

Normally, the models contained in a DataObject are kept entirely in main 
memory. Whether a persistent model makes sense depends on what you need 
to do with those DataObjects.

When you want to store and/or index their metadata, a persistent store 
makes much sense, whether it is an RDF store, Lucene, or whatever. A 
ModelSet operating on top of a NativeStore (as explained in the 
persistent crawling example) does exactly this. Then you can process 
each DataObject sequentially and you always have at most one DataObject 
in main memory. Normally that's perfectly feasible, only your 100.000 
files folder and certain large files are exceptions.

Antoni wrote:
> AFAIK RDF stores don't support streaming access to literals

That's exactly one of the issues I want to tackle in my proposal for RDF 
usage in Aperture 2. Allowing Literals with Readers as labels rather 
than Strings is an option (similar to Lucene's Fields). Changing Strings 
into CharSequences (an interface implemented by String) could have been 
another if it weren't for the fact that you then still need to know the 
string length beforehand.

Regards,

Chris
--