We frequently encounter OutOfMemoryErrors when calling modifyDatastreamByValue and other API-M methods on relatively large digital objects using Fedora Commons 3.0b1, 3.0b2 and 3.0. In order to better understand the issue we triggered heap dumps and analyzed them. The dumps revealed that up to 140M of heap space get used by Fedora when calling modifyDatastreamByValue on a digital object of 15M.
In order to provoke heap dumps at each api call the heap size was reduced. Additionally we triggered heap dumps at specific locations programmatically using the Java6 HotSpotDiagnosticMXBean.
The OutOfMemoryError always occurs at DOTranslationUtility.writeToStream() after the serialization. This appears to be the peak of heap usage for modifyDatastreamByValue.
The heap dump shows the following composition of objects at the time of writeToStream() (see attached screenshot):
* StringBuffer (60M) (15M * 2 (internal UTF-16 representation)) + 30M memory allocated by StringBuffer (StringBuffer doubles its capacity automatically when unsufficient capacity is left for appending a new String. Hence the capacity is likely to exceed the actual memory needed unless explicitly allocated).
* char array at writeToStream (StringBuffer.toString()) (31M) (15M * 2 + overhead)
* BasicDigitalObject 24M (15M DatastreamXMLMetadata, 9M AuditRecord)
* DOReaderCache 25M (1 BasicDigitalObject in cache at the time)
* Some other small objects
If the heap space is already consumed to a large extent, allocating another chunk of memory may fail and subsequently trigger an OutOfMemoryError. Explicitly calling the garbage collector is not a viable option, because most of the objects involved are still bound locally to the thread, so they are still reachable.
Increasing the heap will solve the issue temporarily. Depending on the size of the digital object the problem may however resurface: Suppose the digital object is 30M, then according to our findings a heap space of 60M*2 StringBuffer + 60M char array + ~50M DO + ~50M cache = 280M would be needed for a single digital object (we haven't tried this however).
We modified the Fedora code and tried the following options:
* We removed the indentation in the FOXMLDOSerializer and DOTranslationUtility. Removing most of the nonessential whitespaces (or replacing indentation whitespaces with tabs) results in a much smaller DO size (about 20% in our test case) and therefore reduces memory footprint.
* As for the StringBuffer problem we basically tried two approaches. We trimmed the StringBuffer in FOXMLDOSerializer before the call to writeToStream() using the trimToSize() method. This adjusts the capacity of the StringBuffer to the actual size of characters contained within. Another option is to explicitly size the buffer.
* The 64 bit version of Java consumes considerably more heap space compared to the 32 bit version. Using a 32 bit version reduces memory usage.
All options mentioned above work well and reduce memory consumption significantly, but solve the underlying problem only partially.
Perhaps a better solution would be to load and process only those parts of the digital object needed for the current operation (not viable for ingest, but e.g. modifyDatastreamByX), but that would probably involve lots of refactoring...