Thread: [Samtools-help] Picard MarkDuplicates memory error on very large file

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello everyone,

I have a quite large alignment of paired-end reads in BAM format (ca. 
120GB, almost 2 billion reads of 90 bp length). The file is coordinate 
sorted and has been generated by merging the alignment of 8 single lanes 
with MergeSamFiles.
When I try to remove duplicates (really removing not just marking them) 
from this huge file with MarkDuplicates, I'm running into serious memory 
problems. Duplicate removal seemed to work (it said 
"net.sf.picard.sam.MarkDuplicates done." after 12 hours). But when it 
comes to sorting java says:

Exception in thread "main" java.lang.OutOfMemoryError: Requested array 
size exceeds VM limit
         at 
net.sf.samtools.util.SortingLongCollection.<init>(SortingLongCollection.java:101)
         at 
net.sf.picard.sam.MarkDuplicates.generateDuplicateIndexes(MarkDuplicates.java:426)
         at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:111)
         at 
net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:165)
         at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:93)

I allowed up to 120GB Heap-Space for Java when I started MarkDuplicates 
(the machine has 128GB of RAM):

java -Xmx120g -jar MarkDuplicates.jar INPUT=[...] OUTPUT=[...] 
METRICS_FILE=[...] REMOVE_DUPLICATES=true ASSUME_SORTED=true 
VALIDATION_STRINGENCY=LENIENT TMP_DIR=[...]

Well, I used the quite old Picard version 1.33. Is this a problem? Did 
the memory requirements change in the newer versions?
If not, does someone have ideas or workarounds to get this thing running 
(like some Java- Picard options I'm not aware of)? What is the usual 
practice for such large datasets?
I also tried to reduce the amount of data by first removing the 
duplicates on every single lane, merging the duplicate free alignments 
and then removing the duplicates again on this ca. 25% smaller file. But 
I got the same error.
Would make me very happy if someone could help me out here. Thanks in 
advance!

Cheers,
Christoph

Thread: [Samtools-help] Picard MarkDuplicates memory error on very large file

samtools-help