|
From: Alec W. <al...@br...> - 2011-02-20 17:56:01
|
Hi Folks, The COMMENT argument to MergeSamFiles can't be used to add @RG lines. It adds the (possibly ill-conceived) @CO record. Cliff needs either to add @RG records somewhere along the line before calling MarkDuplicates using Picard ReplaceSamHeader or Samtools reheader. Ideally the thing to do would be to add them before merging, and then use Picard's MergeSamFiles instead of samtools to merge. However, you could add @RG records on the file that you pass to MarkDuplicates. You need to know what the read groups IDs are on the SAM records in that file, and create an @RG header record for each different RG value in the SAM records, The importance of @RG records for MarkDuplicates is that it will not consider two alignments as duplicates if they come from different libraries, which is determined by looking at the LB field of the @RG record. If all your reads are from the same library, then having them all have the same read group will not be a problem. A read group typically corresponds with a single run from an instrument, e.g. an Illumina lane. If the same library was run on multiple lanes, then having the reads from different lanes having distinct RG IDs allows MarkDuplicates to better calculate the rate of optical duplication (as opposed to molecular duplication), since reads from different lanes can never be optical duplicates of one another. This is probably not very important for most users. -Alec On 2/19/11 1:41 PM, James W. MacDonald wrote: > Hi Cliff, > > On 2/19/2011 12:54 PM, Cliff Yiu wrote: >> >> Hi, James >> >> Thanks very much for your response! >> >> I just tried Picard's MergeSamFiles.jar to sort and merge two bam >> files which were generated from two flow cells. >> >> java -Xmx4g -jar >> /home/picard-tools-1.35/picard-tools-1.35/MergeSamFiles.jar >> INPUT=/home/FC42TW7AAXX_1.bam INPUT=/home/FC61517AAXX_3.bam >> OUTPUT=/home/test.bam TMP_DIR=/home/temp SORT_ORDER=coordinate >> VALIDATION_STRINGENCY=LENIENT >> >> Before I ran this command, I checked the two input bam files by >> "samtools view -H" and neither of them has any @RG header lines. I >> got these two input bam files by "samtools view -bS input.sam> >> input.bam". Does this mean the original sam file doesn't have any @RG >> lines? If this is true, even if I use MergeSamFiles.jar, it won't >> pass any @RG lines to the merged bam files.. Am I right? > > Possibly. I have never needed to merge, so can't say for sure. Note > that most aligners these days have an argument to add the read group > data to the header, so in the future you might want to do that at the > alignment stage. > > But it is clearly not worth it to realign just to add some lines to > the header. > > There is a COMMENT argument to MergeSamFiles that you could probably > use to add the @RG header; I am assuming here that this will just add > whatever you use as the comment verbatim into the header. Since you > can have 0 or more comments, you could just do > > COMMENT=@RG\tID:MyFirstReadgroup COMMENT=@RG\tID:MySecondReadgroup > > Here assuming that you don't need quotation marks, and should include > tab characters. > > However, as I mentioned in the previous email, I tend to just cat > things together and run back through samtools view. This way I know > exactly what I am doing, and it isn't much (if at all) slower than > other available methods. > > Since I don't know *exactly* what is expected for the COMMENTS > argument, I can't say for sure the above will work correctly. So you > will just have to play around with it until it does what you want. > >> >> Also, MergeSamFiles.jar has an option >> "MERGE_SEQUENCE_DICTIONARIES=BooleanMerge". Do I need to add this >> into the command? > > I think the argument value is 'Boolean', not BooleanMerge. Anyway, it > appears to me that this would only be necessary if you had used > different references to align the various samples that you are > merging. Otherwise, the sequence dictionaries should be identical (and > it's not necessary to do any merging). > > Best, > > Jim > > >> >> Thanks >> >> -C >> >>> Date: Sat, 19 Feb 2011 10:56:22 -0500 From: jm...@me... >>> To: mm...@ho... CC: kr...@sa...; >>> al...@br...; sam...@li... >>> Subject: Re: [Samtools-help] Error "RG ID on SAMRecord not found in >>> header" from Picard's MarkDuplicates.jar >>> >>> Hi Cliff, >>> >>> On 2/19/2011 10:27 AM, Cliff Yiu wrote: >>>> >>>> Hi, Keiran >>>> >>>> I just checked my bam file with your command and got lots of >>>> lines as >>>> >>>> @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ >>>> SN:chr11_gl000202_random LN:40103 @SQ SN:chr12 LN:133851895 >>>> >>>> There is no @RG line.. What kind of output should I expect to see >>>> if there are @RG header lines? One @SQ line following by an @RG >>>> line? >>> >>> No. You should have a @SQ line per contig, followed by the @RG >>> lines for each read group you have in the bam file. >>>> >>>> Also, when I merged all the lanes or different flowcells, I used >>>> "merge -r" . The "-r" option is to attach an RG tag to each >>>> alignment based on SAMtools manual... Why don't I get any @RG >>>> header lines.? >>> >>> Because it doesn't do that. As stated in the help: >>> >>> Note: Samtools' merge does not reconstruct the @RG dictionary in >>> the header. Users must provide the correct header with -h, or uses >>> Picard which properly maintains the header dictionary in merging. >>> >>> So you could either use Picard's MergeSamFiles, which will do this >>> for you, or you can put the read groups in the header yourself. >>> There is samtools reheader, but I have found that to be less than >>> useful. Personally I usually just use unix command line tools to do >>> the deed. >>> >>> An easy way is >>> >>> samtools view -H yourbamfile.bam> header >>> >>> then use vi or emacs or whatever to add the @RG lines you need. If >>> you are not sure what they all are, you can use >>> >>> samtools view yourbamfile.bam | head >>> >>> to see which field contains the read groups, then >>> >>> samtools view yourbamfile.bam | cut -f<thecorrectfield> - | sort | >>> uniq >>>> thereadgroups >>> >>> then >>> >>> (cat header thereadgroups; samtools view yourbamfile;) | samtools >>> view -bS -> yourbamfilewithreadgroups.bam >>> >>> Best, >>> >>> Jim >>> >>> >>> Do >>>> you have an impression that which reader command I can use to add >>>> @RG lines? I just went through the SAMtools manual and couldn't >>>> figure it out... >>>> >>>> Thanks >>>> >>>> -C >>>> >>>> >>>> >>>> >>>> >>>> >>>>> To: al...@br...; mm...@ho... From: >>>>> kr...@sa... CC: sam...@li... Date: >>>>> Sat, 19 Feb 2011 14:18:35 +0000 Subject: RE: [Samtools-help] >>>>> Error "RG ID on SAMRecord not found in header" from Picard's >>>>> MarkDuplicates.jar >>>>> >>>>> Hi, >>>>> >>>>> I suspect you do not have any @RG header lines. Take a look >>>>> using: >>>>> >>>>> samtools view -H your.bam >>>>> >>>>> If not I believe there is a reader command you can pass a >>>>> valid header and the bam to that can fix this. >>>>> >>>>> Regards, Keiran >>>>> >>>>> Keiran Raine >>>>> >>>>> Senior Computer Biologist The Cancer Genome Project Ext: 7703 >>>>> >>>>> Sent from my HTC Desire Z >>>>> >>>>> >>>>> -----Original Message----- From: Cliff Yiu >>>>> [mm...@ho...] Received: Saturday, 19 Feb 2011, 1:30 To: >>>>> al...@br... [al...@br...] CC: >>>>> sam...@li... >>>>> [sam...@li...] Subject: [Samtools-help] >>>>> Error "RG ID on SAMRecord not found in header" from Picard's >>>>> MarkDuplicates.jar >>>>> >>>>> >>>>> >>>>> Hi >>>>> >>>>> I am trying the following command: >>>>> >>>>> java -Xmx4g -jar /home/picard-tools-1.35/MarkDuplicates.jar >>>>> INPUT=$dir/input.bam OUTPUT=$dir/output.bam TMP_DIR=/home/temp >>>>> METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true >>>>> ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT >>>>> >>>>> And getting lots of errors like "RG ID on SAMRecord not found >>>>> in header". >>>>> >>>>> The command is still running and I haven't seen output.bam >>>>> coming out...> From what I saw previously, it is most likely >>>>> that I will end up getting no output when the command stops. >>>>> >>>>> Anyone knows how to fix the error "RG ID on SAMRecord not found >>>>> in header"? >>>>> >>>>> BTW, I used BWA for reads mapping, then SAMtools for sorting >>>>> reads from each Illumina lane and merging all the lanes to the >>>>> input.bam. >>>>> >>>>> I am running commands in a cluster node which has 4GB RAM and >>>>> 10Tb space. >>>>> >>>>> Thanks >>>>> >>>>> -C >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- The Wellcome Trust Sanger Institute is operated by Genome >>>>> Research = >>>>> >>>>> Limited, a charity registered in England with number 1021457 >>>>> and a compa= ny registered in England with number 2742969, >>>>> whose registered office is 2= 15 Euston Road, London, NW1 2BE. >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> >>>> >>> >>>> > The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: >>>> Pinpoint memory and threading errors before they happen. Find and >>>> fix more than 250 security defects in the development cycle. >>>> Locate bottlenecks in serial and parallel code that limit >>>> performance. http://p.sf.net/sfu/intel-dev2devfeb >>>> >>>> >>>> >>>> _______________________________________________ Samtools-help >>>> mailing list Sam...@li... >>>> https://lists.sourceforge.net/lists/listinfo/samtools-help >>> >>> -- James W. MacDonald, M.S. Biostatistician Douglas Lab University >>> of Michigan Department of Human Genetics 5912 Buhl 1241 E. >>> Catherine St. Ann Arbor MI 48109-5618 734-615-7826 >>> ********************************************************** >>> Electronic Mail is not secure, may not be read every day, and >>> should not be used for urgent or sensitive issues >> > |