Re: [Samtools-help] Picard MarkDuplicates memory error on very large file

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi André,

thanks a lot for answer, it did the trick!
I don't really know why, but it worked.

So great, thanks!
Christoph

Am 27.05.2011 12:04, schrieb André Götze:
> Hi Christoph,
>
> I had a similar problem with a 170 GB BAM lately. Strangley I had to
> realize, that Picard's MarkDuplicates is actually running more stable
> with less heap. So I would recommend to use only -Xmx4g, not Xmx120g. At
> least that's the value working for me.
>
> André
>
>
> Am 27.05.2011 11:40, schrieb Christoph Bartenhagen:
>    
>> Hello everyone,
>>
>> I have a quite large alignment of paired-end reads in BAM format (ca.
>> 120GB, almost 2 billion reads of 90 bp length). The file is coordinate
>> sorted and has been generated by merging the alignment of 8 single lanes
>> with MergeSamFiles.
>> When I try to remove duplicates (really removing not just marking them)
>> from this huge file with MarkDuplicates, I'm running into serious memory
>> problems. Duplicate removal seemed to work (it said
>> "net.sf.picard.sam.MarkDuplicates done." after 12 hours). But when it
>> comes to sorting java says:
>>
>> Exception in thread "main" java.lang.OutOfMemoryError: Requested array
>> size exceeds VM limit
>>            at
>> net.sf.samtools.util.SortingLongCollection.<init>(SortingLongCollection.java:101)
>>            at
>> net.sf.picard.sam.MarkDuplicates.generateDuplicateIndexes(MarkDuplicates.java:426)
>>            at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:111)
>>            at
>> net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:165)
>>            at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:93)
>>
>> I allowed up to 120GB Heap-Space for Java when I started MarkDuplicates
>> (the machine has 128GB of RAM):
>>
>> java -Xmx120g -jar MarkDuplicates.jar INPUT=[...] OUTPUT=[...]
>> METRICS_FILE=[...] REMOVE_DUPLICATES=true ASSUME_SORTED=true
>> VALIDATION_STRINGENCY=LENIENT TMP_DIR=[...]
>>
>> Well, I used the quite old Picard version 1.33. Is this a problem? Did
>> the memory requirements change in the newer versions?
>> If not, does someone have ideas or workarounds to get this thing running
>> (like some Java- Picard options I'm not aware of)? What is the usual
>> practice for such large datasets?
>> I also tried to reduce the amount of data by first removing the
>> duplicates on every single lane, merging the duplicate free alignments
>> and then removing the duplicates again on this ca. 25% smaller file. But
>> I got the same error.
>> Would make me very happy if someone could help me out here. Thanks in
>> advance!
>>
>> Cheers,
>> Christoph
>>
>> ------------------------------------------------------------------------------
>> vRanger cuts backup time in half-while increasing security.
>> With the market-leading solution for virtual backup and recovery,
>> you get blazing-fast, flexible, and affordable data protection.
>> Download your free trial now.
>> http://p.sf.net/sfu/quest-d2dcopy1
>> _______________________________________________
>> Samtools-help mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-help
>>      
>
> ------------------------------------------------------------------------------
> vRanger cuts backup time in half-while increasing security.
> With the market-leading solution for virtual backup and recovery,
> you get blazing-fast, flexible, and affordable data protection.
> Download your free trial now.
> http://p.sf.net/sfu/quest-d2dcopy1
> _______________________________________________
> Samtools-help mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-help
>
>    

-- 
Christoph Bartenhagen
Institute of Medical Informatics
University of Münster
Albert-Schweizer-Campus 1
Building A11
48149 Münster, Germany

phone: +49 (0)251/83-58367
mail:  Chr...@uk...
web:   http://imi.uni-muenster.de