Re: [Samtools-help] Speed of picard markduplicates

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Lorenzo,

Yes - the changes are quite recent - within the last 4 weeks.

There's really not a whole lot you can do to make it go a lot faster.  Or rather I can't conceive of a way of making it go faster via parallelization that wouldn't use many times the amount of system resources that the non-parallel version does.  If you have lots of compute and I/O to spare you could conceivably:

1) Split the input BAM so that both ends of a pair always end up in the same shard, organized by the chromosome of the first in pair in cases where the reads are on different chromosome.

2) Run MarkDuplicates on each independently on each file

3) Merge them back together

But, that's a ton of extra I/O to do, and with the latest version around 50% of the runtime is in that final file write that you'd be duplicating in step 3.

Other things you can do to try and squeeze out some more speed:
- If you don't care about the output file size you can use COMPRESSION_LEVEL=0 which will run faster
- If you're temp filesystem is blazingly fast you can disable snappy by passing "-Dsnappy.disable=true" on the command line. But if your temp filesystem isn't very fast this might actually slow things down.

-t

On Jan 17, 2012, at 11:31 AM, Lorenzo Pesce wrote:

> 
> On Jan 17, 2012, at 10:14 AM, Tim Fennell wrote:
> 
>> Hi Lorenzo,
>> 
>> MarkDuplicates has three major phases:
>> 	- Reading the input SAM/BAM file and collecting information about read positions
>> 	- Traversing the information gathered and detecting duplicates
>> 	- Writing the output file with duplicates marked/removed
>> 
>> I can tell from your last part (11 hours) that you're running an older version of MarkDuplicates because we recently added some more progress logging in that part so that you can see what the program is doing and how far it has gotten.  Around the same time we made a small change that dramatically improved performance, cutting runtime by about 50%.  So firstly I'd suggest updating to the latest version :)
> 
> I was using picard-tools-1.57. Are the changes you describe in 1.60? 
> I installed it just a couple of weeks ago, not a decade ago! ;-)
> 
> Sounds good. Any idea about what else can I change about my running flags? Did you add anything about multi-threading (I have 24 cores per node)
> 
> I have no idea about how some of the flags can be changed
> 
> 
>> Looking at my own server logs I see a recent run of MarkDuplicates where we processed ~1.1bn records on a fairly modern Dell x86 compute node using a single CPU and 4GB of memory.  The runtime was ~13 hours, and I would expect that a file of approximately twice the size would take about twice as long.
> 
> How can I best make use of my much larger memory, high performance disks and multi-threading here? (I know how to parallelize calculations and run massive calculations in c++, but java and picard are alien to me).
> 
> Thanks!
> 
> Lorenzo
> 
>> 
>> -t
>> 
>> 
>> On Jan 17, 2012, at 11:06 AM, Lorenzo Pesce wrote:
>> 
>>> Hi --
>>> I am a novice to picard. I have run picard markduplicates on a 2 billion records bam file with the command:
>>> 
>>> java -Djava.io.tmpdir=${TMPDIR}  -Xmx28g -jar MarkDuplicates.jar I=<input> O=<output> METRICS_FILE=<metric> ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=20000 SORTING_COLLECTION_SIZE_RATIO=0.25
>>> MAX_RECORDS_IN_RAM=5000000 VALIDATION_STRINGENCY=LENIENT TMP_DIR=$TMPDIR >& <log>
>>> 
>>>> java -version
>>> java version "1.6.0_22"
>>> Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
>>> Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
>>> 
>>> The calculation took 56 hours on a Cray XE single node with 32 GB of RAM, and a 24 core double Magni-cours processor.
>>> 
>>> 40 hours were spent tracking pairs
>>> 5 hours tracking clusters and 
>>> 11 hours in not so clear activities (writing to disk?)
>>> 
>>> My questions are:
>>> 1) I look at various descriptions of issues affecting performance, but I could find anything that would help me here. Any suggestions?
>>> 2) What is picard doing in the last 11 hours and am I doing something wrong there?
>>> 
>>> If you have any references where you can point me to, I would be happy to read them (one side effect of being called "picard" it that doing searches produces a lot of unwanted hits).
>>> 
>>> Thanks a lot,
>>> 
>>> Lorenzo
>>> 
>>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> Keep Your Developer Skills Current with LearnDevNow!
>>> The most comprehensive online learning library for Microsoft developers
>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>> http://p.sf.net/sfu/learndevnow-d2d
>>> _______________________________________________
>>> Samtools-help mailing list
>>> Sam...@li...
>>> https://lists.sourceforge.net/lists/listinfo/samtools-help
>> 
>