From: Christopher B. <Chr...@ph...> - 2012-01-17 17:33:40
|
Hi all, Forgive me for an apparent ignorance but when would you ever care about pairs on different chromosomes? Isn't that clearly an invalid read? Cheers, Chris -- Christopher Beck | Director of Bioinformatics | www.pgx.ca Univ. de Montreal Pharmacogenomics Centre | 514.670.7663 5000 Belanger Est, Suite S-2070, Montreal, Quebec, H1T 1C8 > -----Original Message----- > From: Tim Fennell [mailto:tfe...@br...] > Sent: Tuesday, January 17, 2012 12:15 PM > To: Lorenzo Pesce > Cc: sam...@li... > Subject: Re: [Samtools-help] Speed of picard markduplicates > > Hi Lorenzo, > > Yes - the changes are quite recent - within the last 4 weeks. > > There's really not a whole lot you can do to make it go a lot faster. Or > rather I can't conceive of a way of making it go faster via > parallelization that wouldn't use many times the amount of system > resources that the non-parallel version does. If you have lots of compute > and I/O to spare you could conceivably: > > 1) Split the input BAM so that both ends of a pair always end up in the > same shard, organized by the chromosome of the first in pair in cases > where the reads are on different chromosome. > > 2) Run MarkDuplicates on each independently on each file > > 3) Merge them back together > > But, that's a ton of extra I/O to do, and with the latest version around > 50% of the runtime is in that final file write that you'd be duplicating > in step 3. > > Other things you can do to try and squeeze out some more speed: > - If you don't care about the output file size you can use > COMPRESSION_LEVEL=0 which will run faster > - If you're temp filesystem is blazingly fast you can disable snappy by > passing "-Dsnappy.disable=true" on the command line. But if your temp > filesystem isn't very fast this might actually slow things down. > > -t > > On Jan 17, 2012, at 11:31 AM, Lorenzo Pesce wrote: > > > > > On Jan 17, 2012, at 10:14 AM, Tim Fennell wrote: > > > >> Hi Lorenzo, > >> > >> MarkDuplicates has three major phases: > >> - Reading the input SAM/BAM file and collecting information about > read positions > >> - Traversing the information gathered and detecting duplicates > >> - Writing the output file with duplicates marked/removed > >> > >> I can tell from your last part (11 hours) that you're running an > >> older version of MarkDuplicates because we recently added some more > >> progress logging in that part so that you can see what the program is > >> doing and how far it has gotten. Around the same time we made a > >> small change that dramatically improved performance, cutting runtime > >> by about 50%. So firstly I'd suggest updating to the latest version > >> :) > > > > I was using picard-tools-1.57. Are the changes you describe in 1.60? > > I installed it just a couple of weeks ago, not a decade ago! ;-) > > > > Sounds good. Any idea about what else can I change about my running > > flags? Did you add anything about multi-threading (I have 24 cores per > > node) > > > > I have no idea about how some of the flags can be changed > > > > > >> Looking at my own server logs I see a recent run of MarkDuplicates > where we processed ~1.1bn records on a fairly modern Dell x86 compute node > using a single CPU and 4GB of memory. The runtime was ~13 hours, and I > would expect that a file of approximately twice the size would take about > twice as long. > > > > How can I best make use of my much larger memory, high performance disks > and multi-threading here? (I know how to parallelize calculations and run > massive calculations in c++, but java and picard are alien to me). > > > > Thanks! > > > > Lorenzo > > > >> > >> -t > >> > >> > >> On Jan 17, 2012, at 11:06 AM, Lorenzo Pesce wrote: > >> > >>> Hi -- > >>> I am a novice to picard. I have run picard markduplicates on a 2 > billion records bam file with the command: > >>> > >>> java -Djava.io.tmpdir=${TMPDIR} -Xmx28g -jar MarkDuplicates.jar > >>> I=<input> O=<output> METRICS_FILE=<metric> ASSUME_SORTED=true > >>> MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=20000 > >>> SORTING_COLLECTION_SIZE_RATIO=0.25 > >>> MAX_RECORDS_IN_RAM=5000000 VALIDATION_STRINGENCY=LENIENT > >>> TMP_DIR=$TMPDIR >& <log> > >>> > >>>> java -version > >>> java version "1.6.0_22" > >>> Java(TM) SE Runtime Environment (build 1.6.0_22-b04) Java > >>> HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode) > >>> > >>> The calculation took 56 hours on a Cray XE single node with 32 GB of > RAM, and a 24 core double Magni-cours processor. > >>> > >>> 40 hours were spent tracking pairs > >>> 5 hours tracking clusters and > >>> 11 hours in not so clear activities (writing to disk?) > >>> > >>> My questions are: > >>> 1) I look at various descriptions of issues affecting performance, but > I could find anything that would help me here. Any suggestions? > >>> 2) What is picard doing in the last 11 hours and am I doing something > wrong there? > >>> > >>> If you have any references where you can point me to, I would be happy > to read them (one side effect of being called "picard" it that doing > searches produces a lot of unwanted hits). > >>> > >>> Thanks a lot, > >>> > >>> Lorenzo > >>> > >>> > >>> > >>> > >>> -------------------------------------------------------------------- > >>> ---------- Keep Your Developer Skills Current with LearnDevNow! > >>> The most comprehensive online learning library for Microsoft > >>> developers is just $99.99! Visual Studio, SharePoint, SQL - plus > >>> HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when > you subscribe now! > >>> http://p.sf.net/sfu/learndevnow-d2d > >>> _______________________________________________ > >>> Samtools-help mailing list > >>> Sam...@li... > >>> https://lists.sourceforge.net/lists/listinfo/samtools-help > >> > > > > > ------------------------------------------------------------------------ -- > ---- > Keep Your Developer Skills Current with LearnDevNow! > The most comprehensive online learning library for Microsoft developers is > just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-d2d > _______________________________________________ > Samtools-help mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-help |