Re: [Samtools-help] Picard mark duplicates memory usage

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 4/26/10 11:21 AM, Sendu Bala wrote:
> When I run picard I normally give the jvm 5GB of memory to play with.
Assuming you mean a maximum heap size of 5GB, then I think what you need 
to do is to request
a slightly higher memory limit from lsf.  The java process will consume 
more memory/swap than the
maximum heap size, in my experience often by a couple of GB.  Telling 
lsf you will use 7G or 8G will
probably let this run to completion.
-Bob
>   I
> was just running MarkDuplicates and it said:
>
> [...]
> INFO    2010-04-26 14:48:58     MarkDuplicates  Read 3435000000 records.
> Tracking 26040 as yet unmatched pairs. 15382 records in RAM.  Last
> sequence index: 82
> INFO    2010-04-26 14:49:05     MarkDuplicates  Read 3435702578 records.
> 0 pairs never matched.
> INFO    2010-04-26 14:52:32     MarkDuplicates  After
> buildSortedReadEndLists freeMemory: 5096165648; totalMemory: 5127602176;
> maxMemory: 5127602176
> INFO    2010-04-26 14:52:32     MarkDuplicates  Will retain up to
> 160237568 duplicate indices before spilling to disk.
> INFO    2010-04-26 14:52:34     MarkDuplicates  Traversing read pair
> information and detecting duplicates.
>
> Then it output nothing until at 2010-04-26 15:20:15 my job got killed
> because it had used over 6GB.
>
> Is there anything I can do to avoid this on my end right now, short of
> specifying it might use all the memory on the system? Can it's memory
> handling be improved at all?
>
>
> On 23/04/2010 17:55, Alec Wysoker wrote:
>    
>> Hi Keiran,
>>
>> Yes, we typically use the default of 500,000 and run with 2GB RAM, so
>> multiplying both by 5 sounds plausible. Unfortunately I don't have a
>> good method for figuring out the right number given a particular JVM
>> size. Way back when I picked 500,000 as the default, it seemed
>> reasonable for the # of reads we were sorting at the time, and it's
>> worked well enough, so we haven't looked very hard at it. The
>> fundamental question of the memory footprint of a single SAMRecord
>> depends on a number of factors:
>>
>>      * read length
>>      * tag content. E.g. OQ and E2 tags can be large
>>      * SAM input generally has larger memory footprint than BAM input,
>>        but if validation stringency is not silent, then BAM can actually
>>        be larger.
>>      * Also, setting variable-length attributes onto a record read from a
>>        BAM file can expand its memory footprint even if validation
>>        stringency is silent.
>>
>> It might be possible to spill to disk when the sorter hit a configurable
>> RAM threshold rather than a # of records threshold, but figuring out the
>> memory footprint of a Java object is a bit of a challenge and this
>> hasn't been a big enough problem for us to feel motivated to change it.
>>      
>
>