Re: [Samtools-help] memory error with MarkDuplicates

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Jessica,

You appear to be trying all the right things.  There was a problem with 
an older version of snappy-java in which it would initially allocated 
large buffers for reading (and we create lots of them), and would grow 
them but never shrink them.  However, the version of snappy-java you are 
using should not have that problem.  Note that if you are running on a 
Linux box then you shouldn't need to provide the snappy-java jar on the 
classpath.  Some things you might try:

    * Make sure you are using Picard 1.48.
    * If running on Linux, omit putting snappy-java on the classpath as
      it is included in MarkDuplicates.jar.
    * If not running on Linux, grab the latest snappy-java from Picard
      repository:
      http://picard.svn.sourceforge.net/viewvc/picard/trunk/lib/snappy-java-1.0.3-rc3.jar?revision=878
      (this shouldn't matter, but I'm grasping at straws).
    * Send me the stack trace in case that give me some other idea.

-Alec

On 6/27/11 12:56 PM, Jessica Maia wrote:
> Hi there,
>
> I'm using Picard 1.48 to remove duplicates using the snappy library. I'm encountering similar memory issues as before when I used an earlier version of Picard:
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded.
>
> We routinely use Picard to remove duplicates from our whole genome sequencing samples which have about 30-40x coverage. Our alignments are generated with bwa-0.5.5. Samtools 'rmdup' is able to remove duplicates for these 4 samples in question. The size of the BAM files before and after applying Samtools rmdup have not changed by more than 10% so it seems unlikely that duplication is rampant.
>
> This is how I'm running Picard:
> java -jar -Xmx8g -Dsnappy.loader.verbosity=true -classpath snappy-java-1.0.3-rc3-20110610.011644-1.jar $picard_dir/MarkDuplicates.jar TMP_DIR=$out_dir VALIDATION_STRINGENCY=SILENT INPUT=$bam_file OUTPUT=$combined_rmdup_file METRICS_FILE=$duplicate_metrics REMOVE_DUPLICATES=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=5000000 MAX_RECORDS_IN_RAM=5000000  VERBOSITY=WARNING ASSUME_SORTED=true SORTING_COLLECTION_SIZE_RATIO=0.005
>
> Log file:
> [Fri Jun 24 13:42:01 EDT 2011] net.sf.picard.sam.MarkDuplicates INPUT=bam  OUTPUT=rmdup.bam  METRICS_FILE=duplicate_metrics  REMOVE_DUPLICATES=true  ASSUME_SORTED=true  MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=5000000 SORTING_COLLECTION_SIZE_RATIO=0.0050 TMP_DIR=picard_148 VERBOSITY=WARNING VALIDATION_STRINGENCY=SILENT MAX_RECORDS_IN_RAM=5000000    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 QUIET=false COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false
> Snappy stream classes loaded.
> [Fri Jun 24 14:05:56 EDT 2011] net.sf.picard.sam.MarkDuplicates done. Elapsed time: 24.55 minutes.
> Runtime.totalMemory()=3295150080
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded.
>
> I've tried changing SORTING_COLLECTION_SIZE_RATIO, and the -Xmx parameters:
> *-Xmx *MAX_FILE_HANDLES_FOR_READ_ENDS_MAP *MAX_RECORDS_IN_RAM SORTING_COLLECTION_SIZE_RATIO
> 12g	5 million	 5 million	0.1
> 14g	5 million 5 million	0.1
> 14g	5 million	 5 million	0.15
> 14g	5 million	 5 million	0.05
> 14g	5 million	 5 million	0.01
> 14g	5 million	 5 million	0.005
> 8g	5 million	 5 million	0.005.
>
> Picard has failed to remove duplicates in all instances. Are there any other suggestions to solve this issue?
> Thanks,
>
> Jessica
>
>
>