Re: [GeoNetwork-devel] [GeoNetwork opensource Developer website] #175: Index of 20, 000 records max

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

The way the local filesystem harvester currently works is that all XML 
files in the directory tree are read and stored as JDOM Element objects 
before they are written to the database - 20,000 XML files in your 
directory tree will likely cause you to run out of memory/thrash before 
anything ever gets written (did for me anyway).

A simple modification to the harvester is to capture the filename only 
when the directory tree is traversed and defer reading it into a JDOM 
Element object until we need to write it to the database - much less 
heavy on RAM and its working now on your 20,000 record test case. I'll 
commit the change later tonight if all is ok.

Note this is not an indexing issue - just a memory use oversight for 
large harvests - and it only applies to the local filesystem harvester.

Cheers,
Simon

Doug Nebert wrote:
> Gustavo González wrote:
>   
>> In this page:
>> http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.o
>> om
>>
>> say that using -XX:-UseGCOverheadLimit should eliminate this error (GC
>> overhead limit exceeded)
>>
>>     
>
> It still seems that there is something terribly inefficient in the 
> harvest, still. Will try it. I am afraid we could set this and then 
> suffer when anyone wants to visit the portal as it is spending 98% of 
> its time in garbage collection. It would be better to improve the code 
> than to require a 64-bit environment with tuned settings to behave 
> properly on moderate-sized collections. We may need to routinely ingest 
> up to 100K records without choking the system.
>
>  From your reference:
>
> "Excessive GC Time and OutOfMemoryError
>
> The parallel collector will throw an OutOfMemoryError if too much time 
> is being spent in garbage collection: if more than 98% of the total time 
> is spent in garbage collection and less than 2% of the heap is 
> recovered, an OutOfMemoryError will be thrown. This feature is designed 
> to prevent applications from running for an extended period of time 
> while making little or no progress because the heap is too small. If 
> necessary, this feature can be disabled by adding the option 
> -XX:-UseGCOverheadLimit to the command line. "
>
> Doug.
>
>
>   
>> -----Original Message-----
>> From: Doug Nebert [mailto:ddn...@us...] 
>> Sent: miércoles, 02 de diciembre de 2009 15:08
>> To: Gustavo González
>> Cc: geo...@li...
>> Subject: Re: [GeoNetwork-devel] [GeoNetwork opensource Developer website]
>> #175: Index of 20, 000 records maxes out java heap space
>>
>> Gustavo González wrote:
>>     
>>> May be putting this in your settings can help
>>>
>>> -XX:MaxPermSize=512m
>>>
>>>       
>> I just tried that and still run out of memory:
>>
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>> The harvest process needs to be more thoughtful and perform incremental 
>> commits to the database and lucene and clear memory as it goes.
>>
>> Doug.
>>
>>     
>>> -----Original Message-----
>>> From: GeoNetwork opensource Developer website
>>>       
>> [mailto:tra...@os...] 
>>     
>>> Sent: miércoles, 02 de diciembre de 2009 11:52
>>> To: undisclosed-recipients:
>>> Subject: [GeoNetwork-devel] [GeoNetwork opensource Developer website]
>>>       
>> #175:
>>     
>>> Index of 20, 000 records maxes out java heap space
>>>
>>> #175: Index of 20,000 records maxes out java heap space
>>>
>>>       
>> ---------------------------------------+------------------------------------
>>     
>>>  Reporter:  ddnebert                   |       Owner:
>>> geo...@li...
>>>      Type:  defect                     |      Status:  new
>>>
>>>  Priority:  major                      |   Milestone:  v2.4.2
>>>
>>> Component:  General                    |     Version:  v2.4.2
>>>
>>>  Keywords:  java heap, index, harvest  |  
>>>
>>>       
>> ---------------------------------------+------------------------------------
>>     
>>>  We have tried all manner of memory settings on the jetty servlet to
>>>       
>> enable
>>     
>>>  the indexing of 20K records in ISO format, each only ~40kb in size. The
>>>  current setting is: java -Xms48m -Xmx1024m -Xss36m with the guidance
>>>       
>> being
>>     
>>>  that one does not want to exceed the overall process limits of 2GB on
>>>       
>> this
>>     
>>>  32-bit Linux environment.
>>>
>>>  We are using the Harvest Management function to access Local File System;
>>>  the metadata are in ISO 19139 valid format; running 2.4.2 under Linux 32-
>>>  bit with Jetty and MySQL.
>>>
>>>  We don't want to split up the files into smaller groups as this defeats
>>>  the purpose of identifying and working with a collection and setting
>>>  harvest rules on it. There appears to be a defect in the code that does
>>>  not step the commitment of records being harvested until too late in the
>>>  process, maxing-out the java heap space. the process should be more
>>>       
>> serial
>>     
>>>  and progressive, allowing for records to be visible soon after they are
>>>  processed. As it stands, the system is not viable for use on these larger
>>>  collections until the defect is fixed.
>>>
>>>  The 113MB file of metadata records is available at
>>>  http://mapcontext.com/gcmdiso.zip for testing purposes.
>>>
>>>       
>>     
>
>
>   

Re: [GeoNetwork-devel] [GeoNetwork opensource Developer website] #175: Index of 20, 000 records max

Re: [GeoNetwork-devel] [GeoNetwork opensource Developer website] #175: Index of 20, 000 records maxes out java heap space