From: Christoph B. <Chr...@uk...> - 2011-05-27 09:40:05
|
Hello everyone, I have a quite large alignment of paired-end reads in BAM format (ca. 120GB, almost 2 billion reads of 90 bp length). The file is coordinate sorted and has been generated by merging the alignment of 8 single lanes with MergeSamFiles. When I try to remove duplicates (really removing not just marking them) from this huge file with MarkDuplicates, I'm running into serious memory problems. Duplicate removal seemed to work (it said "net.sf.picard.sam.MarkDuplicates done." after 12 hours). But when it comes to sorting java says: Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit at net.sf.samtools.util.SortingLongCollection.<init>(SortingLongCollection.java:101) at net.sf.picard.sam.MarkDuplicates.generateDuplicateIndexes(MarkDuplicates.java:426) at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:111) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:165) at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:93) I allowed up to 120GB Heap-Space for Java when I started MarkDuplicates (the machine has 128GB of RAM): java -Xmx120g -jar MarkDuplicates.jar INPUT=[...] OUTPUT=[...] METRICS_FILE=[...] REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=[...] Well, I used the quite old Picard version 1.33. Is this a problem? Did the memory requirements change in the newer versions? If not, does someone have ideas or workarounds to get this thing running (like some Java- Picard options I'm not aware of)? What is the usual practice for such large datasets? I also tried to reduce the amount of data by first removing the duplicates on every single lane, merging the duplicate free alignments and then removing the duplicates again on this ca. 25% smaller file. But I got the same error. Would make me very happy if someone could help me out here. Thanks in advance! Cheers, Christoph |
From: André G. <a.g...@dk...> - 2011-05-27 10:05:02
|
Hi Christoph, I had a similar problem with a 170 GB BAM lately. Strangley I had to realize, that Picard's MarkDuplicates is actually running more stable with less heap. So I would recommend to use only -Xmx4g, not Xmx120g. At least that's the value working for me. André Am 27.05.2011 11:40, schrieb Christoph Bartenhagen: > Hello everyone, > > I have a quite large alignment of paired-end reads in BAM format (ca. > 120GB, almost 2 billion reads of 90 bp length). The file is coordinate > sorted and has been generated by merging the alignment of 8 single lanes > with MergeSamFiles. > When I try to remove duplicates (really removing not just marking them) > from this huge file with MarkDuplicates, I'm running into serious memory > problems. Duplicate removal seemed to work (it said > "net.sf.picard.sam.MarkDuplicates done." after 12 hours). But when it > comes to sorting java says: > > Exception in thread "main" java.lang.OutOfMemoryError: Requested array > size exceeds VM limit > at > net.sf.samtools.util.SortingLongCollection.<init>(SortingLongCollection.java:101) > at > net.sf.picard.sam.MarkDuplicates.generateDuplicateIndexes(MarkDuplicates.java:426) > at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:111) > at > net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:165) > at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:93) > > I allowed up to 120GB Heap-Space for Java when I started MarkDuplicates > (the machine has 128GB of RAM): > > java -Xmx120g -jar MarkDuplicates.jar INPUT=[...] OUTPUT=[...] > METRICS_FILE=[...] REMOVE_DUPLICATES=true ASSUME_SORTED=true > VALIDATION_STRINGENCY=LENIENT TMP_DIR=[...] > > Well, I used the quite old Picard version 1.33. Is this a problem? Did > the memory requirements change in the newer versions? > If not, does someone have ideas or workarounds to get this thing running > (like some Java- Picard options I'm not aware of)? What is the usual > practice for such large datasets? > I also tried to reduce the amount of data by first removing the > duplicates on every single lane, merging the duplicate free alignments > and then removing the duplicates again on this ca. 25% smaller file. But > I got the same error. > Would make me very happy if someone could help me out here. Thanks in > advance! > > Cheers, > Christoph > > ------------------------------------------------------------------------------ > vRanger cuts backup time in half-while increasing security. > With the market-leading solution for virtual backup and recovery, > you get blazing-fast, flexible, and affordable data protection. > Download your free trial now. > http://p.sf.net/sfu/quest-d2dcopy1 > _______________________________________________ > Samtools-help mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-help |
From: Christoph B. <Chr...@uk...> - 2011-05-28 15:10:18
|
Hi André, thanks a lot for answer, it did the trick! I don't really know why, but it worked. So great, thanks! Christoph Am 27.05.2011 12:04, schrieb André Götze: > Hi Christoph, > > I had a similar problem with a 170 GB BAM lately. Strangley I had to > realize, that Picard's MarkDuplicates is actually running more stable > with less heap. So I would recommend to use only -Xmx4g, not Xmx120g. At > least that's the value working for me. > > André > > > Am 27.05.2011 11:40, schrieb Christoph Bartenhagen: > >> Hello everyone, >> >> I have a quite large alignment of paired-end reads in BAM format (ca. >> 120GB, almost 2 billion reads of 90 bp length). The file is coordinate >> sorted and has been generated by merging the alignment of 8 single lanes >> with MergeSamFiles. >> When I try to remove duplicates (really removing not just marking them) >> from this huge file with MarkDuplicates, I'm running into serious memory >> problems. Duplicate removal seemed to work (it said >> "net.sf.picard.sam.MarkDuplicates done." after 12 hours). But when it >> comes to sorting java says: >> >> Exception in thread "main" java.lang.OutOfMemoryError: Requested array >> size exceeds VM limit >> at >> net.sf.samtools.util.SortingLongCollection.<init>(SortingLongCollection.java:101) >> at >> net.sf.picard.sam.MarkDuplicates.generateDuplicateIndexes(MarkDuplicates.java:426) >> at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:111) >> at >> net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:165) >> at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:93) >> >> I allowed up to 120GB Heap-Space for Java when I started MarkDuplicates >> (the machine has 128GB of RAM): >> >> java -Xmx120g -jar MarkDuplicates.jar INPUT=[...] OUTPUT=[...] >> METRICS_FILE=[...] REMOVE_DUPLICATES=true ASSUME_SORTED=true >> VALIDATION_STRINGENCY=LENIENT TMP_DIR=[...] >> >> Well, I used the quite old Picard version 1.33. Is this a problem? Did >> the memory requirements change in the newer versions? >> If not, does someone have ideas or workarounds to get this thing running >> (like some Java- Picard options I'm not aware of)? What is the usual >> practice for such large datasets? >> I also tried to reduce the amount of data by first removing the >> duplicates on every single lane, merging the duplicate free alignments >> and then removing the duplicates again on this ca. 25% smaller file. But >> I got the same error. >> Would make me very happy if someone could help me out here. Thanks in >> advance! >> >> Cheers, >> Christoph >> >> ------------------------------------------------------------------------------ >> vRanger cuts backup time in half-while increasing security. >> With the market-leading solution for virtual backup and recovery, >> you get blazing-fast, flexible, and affordable data protection. >> Download your free trial now. >> http://p.sf.net/sfu/quest-d2dcopy1 >> _______________________________________________ >> Samtools-help mailing list >> Sam...@li... >> https://lists.sourceforge.net/lists/listinfo/samtools-help >> > > ------------------------------------------------------------------------------ > vRanger cuts backup time in half-while increasing security. > With the market-leading solution for virtual backup and recovery, > you get blazing-fast, flexible, and affordable data protection. > Download your free trial now. > http://p.sf.net/sfu/quest-d2dcopy1 > _______________________________________________ > Samtools-help mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-help > > -- Christoph Bartenhagen Institute of Medical Informatics University of Münster Albert-Schweizer-Campus 1 Building A11 48149 Münster, Germany phone: +49 (0)251/83-58367 mail: Chr...@uk... web: http://imi.uni-muenster.de |