From: Rusch, M. <Mic...@ST...> - 2011-03-23 15:33:53
|
Thanks for the corrections/clarifications. Specifically on read groups: I am sloppy when it comes to read groups, so you'll want to be aware of that if you want to not be sloppy. Heng/others: does samtools sort report sort order in the output in the newer versions? If not, and if you have downstream analyses that read the sort order header and rely on it being correct, then you'll need to do one of the following: 1. Change the downstream analyses so that there is a way to tell them the sort order outside of reading it from the file (or to have them assume it's correct). This is a good practice anyway if you are the one writing your downstream analyses. 2. Use Picard for sort/merge 3. Add a step to inject the header with the proper sort order It's a nitpicky thing, but if you get a bam file lacking one declaration in the header and your downstream tool refuses to use it, it's annoying. With regards to Picard validation stringency, I still recommend at least starting with LENIENT. You will get a report of records that fail validation printed to stderr. If all records pass, then you could go back to the default of STRICT. With STRICT, the Picard command will exit with an exception on the first failed record. It can be maddening to find and deal with these one-by-one. Some popular aligners have been known to produce some records that Picard does not consider to be valid. There have been debates about whether these records are or are not valid. In my experience there are usually a very small number of these records and by using lenient validation and not "fixing" the records I avoid a lot of pain and effort. Just my $0.02 Michael -----Original Message----- From: Heng Li [mailto:lh...@sa...] Sent: Wednesday, March 23, 2011 10:06 AM To: Rusch, Michael Cc: 'Denis Reshetov'; sam...@li... Subject: Re: [Samtools-help] sorting and merging order. .. . On Mar 23, 2011, at 10:24 AM, Rusch, Michael wrote: > If you have a cluster or at least more than one CPU available, then you'll want to merge as late as possible in the process so that you can take advantage of having more than one process running in parallel. > > I suggest using Picard to do the sorting, merging, and marking duplicates, as it has some advantages over the samtools equivalents for all of these steps. Some of these may have changed in newer versions of samtools, though. For example, Picard SortSam can read sam and write bam, which handles your sam->bam conversion and sorting in a single step. The following does SAM sorting with one command line too: samtools view -uS in.sam | samtools sort - out.srt Usually when I run bwa, I get BAMs by: bwa sampe ref.fa r1.sai r2.sai r1.fq.gz r2.fq.gz | samtools view -bS - > aln.bam or at least get compressed SAM by piping to gzip. For processing large-scale data, it is recommended to always have your data compressed. > SortSam and MergeBamFiles also produce more complete headers than their samtools equivalents. When read groups are not marked in the input SAM/BAM, samtools merge allows to add read groups to each read on the fly. This is frequently very handy. > Just make sure that you specify VALIDATION_STRINGENCY=LENIENT when using the tools, as almost all bams break the spec at some point to some extent, and that will crash Picard without this option. If "almost all" real-world BAMs break the spec, then that spec must be worth nothing. No, the SAM/BAM spec is not that bad. Most BAMs break Picard because in addition to the spec, Picard by default also checks recommended practices which are good to have but not required. If you are developing a pipeline for long-term uses, it would be good to pass the default Picard validation; if you are processing data for quick results, conforming to the spec alone is usually sufficient. > > Also, there is no reason to index the single-lane bams unless you also want to use the individual bams for some purpose later. That would be an unusual workflow, though. > > Also, when you merge sorted files the result is already sorted, so you don't need that sort. > > You will probably also want to include a MarkDuplicates step. Please beware that for RNA/ChIP-seq, you may not always want to run MarkDuplicate. Heng > > So, I suggest: > > 1. Run SortSam on each sam file to create coordinate-sorted bam files. This can be done in parallel if you have the CPUs to do it. > 2. Run MergeSamFiles on the sorted bams to create a merged (and sorted) bam > 3. Run MarkDuplicates on that bam to create the final bam > 4. Index that final bam using samtools index or Picard BuildBamIndex > > Attempt at ascii art: > > sam --(SortSam)--> sorted bam > \ > ... --(MergeSamFiles)--> bam --(MarkDuplicates)--> bam --(index) > / > sam --(SortSam)--> sorted bam > > > Michael > > -----Original Message----- > From: Denis Reshetov [mailto:res...@gm...] > Sent: Wednesday, March 23, 2011 8:49 AM > To: sam...@li... > Subject: [Samtools-help] sorting and merging order. . > > Dear colleagues I'm trying to write a pipeline that runs bwa on each > lane of sequence run and then merges the result into a single bam file. > I'm doing now > sam->bam->sorted bam->index > for each lane > then merging bams files from all lanes together, sorting and indexing > the resulting file. > But it's very slow process. > > Could you suggest is it possible to merge sam files together and then > do bam conversion? > Is it the right way to save processor time? > > Best regards, > > -- > Reshetov Denis > tel.: +7-917-523-26-84 > skype: reshetovdenis1 > > ------------------------------------------------------------------------------ > Enable your software for Intel(R) Active Management Technology to meet the > growing manageability and security demands of your customers. Businesses > are taking advantage of Intel(R) vPro (TM) technology - will your software > be a part of the solution? Download the Intel(R) Manageability Checker > today! http://p.sf.net/sfu/intel-dev2devmar > _______________________________________________ > Samtools-help mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-help > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > > ------------------------------------------------------------------------------ > Enable your software for Intel(R) Active Management Technology to meet the > growing manageability and security demands of your customers. Businesses > are taking advantage of Intel(R) vPro (TM) technology - will your software > be a part of the solution? Download the Intel(R) Manageability Checker > today! http://p.sf.net/sfu/intel-dev2devmar > _______________________________________________ > Samtools-help mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-help -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |