|
From: Alec W. <al...@br...> - 2011-02-22 14:01:56
|
Hi Tom, Actually, the code only looks at the LB field of the read group, and not the SM field. I don't think it would make sense, however, to have two read groups with the same value for LB but different values for SM. I make that statement somewhat reluctantly because of the creativity with which people put stuff into a a flowcell, but I believe it is generally true. -Alec On 2/21/11 8:44 AM, Tom Blackwell wrote: > > No, Richard Varhol's issue - which he correctly identified - was that > the sample identifiers as well as the library identifiers need to be > the same, in order to mark duplicates across read groups. > > - tom blackwell - > > On Sun, 20 Feb 2011, Tadigotla, Vasisht wrote: > >> Hi Alec, >> >> Just to clarify, Is MarkDuplicates removing duplicate reads (PCR >> duplicates) across different readgroups as long as the LB field of >> the @RG record is the same? As Richard Varhol pointed out in his >> email on Thu, when we modify the bam file so that the readgroup is >> the same for all reads, MarkDuplicates seems to find a lot more >> duplicate reads (~20% more). Is there a possible explanation? >> >> >> Thanks, >> Vasisht >> >> -----Original Message----- >> From: Alec Wysoker [mailto:al...@br...] >> Sent: Sunday, February 20, 2011 12:56 PM >> To: James W. MacDonald >> Cc: sam...@li... >> Subject: Re: [Samtools-help] Error "RG ID on SAMRecord not found in >> header" from Picard's MarkDuplicates.jar >> >> Hi Folks, >> >> The COMMENT argument to MergeSamFiles can't be used to add @RG lines. >> It adds the (possibly ill-conceived) @CO record. >> >> Cliff needs either to add @RG records somewhere along the line before >> calling MarkDuplicates using Picard ReplaceSamHeader or Samtools >> reheader. Ideally the thing to do would be to add them before merging, >> and then use Picard's MergeSamFiles instead of samtools to merge. >> However, you could add @RG records on the file that you pass to >> MarkDuplicates. You need to know what the read groups IDs are on the >> SAM records in that file, and create an @RG header record for each >> different RG value in the SAM records, >> >> The importance of @RG records for MarkDuplicates is that it will not >> consider two alignments as duplicates if they come from different >> libraries, which is determined by looking at the LB field of the @RG >> record. If all your reads are from the same library, then having them >> all have the same read group will not be a problem. >> >> A read group typically corresponds with a single run from an instrument, >> e.g. an Illumina lane. If the same library was run on multiple lanes, >> then having the reads from different lanes having distinct RG IDs allows >> MarkDuplicates to better calculate the rate of optical duplication (as >> opposed to molecular duplication), since reads from different lanes can >> never be optical duplicates of one another. This is probably not very >> important for most users. >> >> -Alec >> >> On 2/19/11 1:41 PM, James W. MacDonald wrote: >>> Hi Cliff, >>> >>> On 2/19/2011 12:54 PM, Cliff Yiu wrote: >>>> >>>> Hi, James >>>> >>>> Thanks very much for your response! >>>> >>>> I just tried Picard's MergeSamFiles.jar to sort and merge two bam >>>> files which were generated from two flow cells. >>>> >>>> java -Xmx4g -jar >>>> /home/picard-tools-1.35/picard-tools-1.35/MergeSamFiles.jar >>>> INPUT=/home/FC42TW7AAXX_1.bam INPUT=/home/FC61517AAXX_3.bam >>>> OUTPUT=/home/test.bam TMP_DIR=/home/temp SORT_ORDER=coordinate >>>> VALIDATION_STRINGENCY=LENIENT >>>> >>>> Before I ran this command, I checked the two input bam files by >>>> "samtools view -H" and neither of them has any @RG header lines. I >>>> got these two input bam files by "samtools view -bS input.sam> >>>> input.bam". Does this mean the original sam file doesn't have any @RG >>>> lines? If this is true, even if I use MergeSamFiles.jar, it won't >>>> pass any @RG lines to the merged bam files.. Am I right? >>> >>> Possibly. I have never needed to merge, so can't say for sure. Note >>> that most aligners these days have an argument to add the read group >>> data to the header, so in the future you might want to do that at the >>> alignment stage. >>> >>> But it is clearly not worth it to realign just to add some lines to >>> the header. >>> >>> There is a COMMENT argument to MergeSamFiles that you could probably >>> use to add the @RG header; I am assuming here that this will just add >>> whatever you use as the comment verbatim into the header. Since you >>> can have 0 or more comments, you could just do >>> >>> COMMENT=@RG\tID:MyFirstReadgroup COMMENT=@RG\tID:MySecondReadgroup >>> >>> Here assuming that you don't need quotation marks, and should include >>> tab characters. >>> >>> However, as I mentioned in the previous email, I tend to just cat >>> things together and run back through samtools view. This way I know >>> exactly what I am doing, and it isn't much (if at all) slower than >>> other available methods. >>> >>> Since I don't know *exactly* what is expected for the COMMENTS >>> argument, I can't say for sure the above will work correctly. So you >>> will just have to play around with it until it does what you want. >>> >>>> >>>> Also, MergeSamFiles.jar has an option >>>> "MERGE_SEQUENCE_DICTIONARIES=BooleanMerge". Do I need to add this >>>> into the command? >>> >>> I think the argument value is 'Boolean', not BooleanMerge. Anyway, it >>> appears to me that this would only be necessary if you had used >>> different references to align the various samples that you are >>> merging. Otherwise, the sequence dictionaries should be identical (and >>> it's not necessary to do any merging). >>> >>> Best, >>> >>> Jim >>> >>> >>>> >>>> Thanks >>>> >>>> -C >>>> >>>>> Date: Sat, 19 Feb 2011 10:56:22 -0500 From: jm...@me... >>>>> To: mm...@ho... CC: kr...@sa...; >>>>> al...@br...; sam...@li... >>>>> Subject: Re: [Samtools-help] Error "RG ID on SAMRecord not found in >>>>> header" from Picard's MarkDuplicates.jar >>>>> >>>>> Hi Cliff, >>>>> >>>>> On 2/19/2011 10:27 AM, Cliff Yiu wrote: >>>>>> >>>>>> Hi, Keiran >>>>>> >>>>>> I just checked my bam file with your command and got lots of >>>>>> lines as >>>>>> >>>>>> @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ >>>>>> SN:chr11_gl000202_random LN:40103 @SQ SN:chr12 LN:133851895 >>>>>> >>>>>> There is no @RG line.. What kind of output should I expect to see >>>>>> if there are @RG header lines? One @SQ line following by an @RG >>>>>> line? >>>>> >>>>> No. You should have a @SQ line per contig, followed by the @RG >>>>> lines for each read group you have in the bam file. >>>>>> >>>>>> Also, when I merged all the lanes or different flowcells, I used >>>>>> "merge -r" . The "-r" option is to attach an RG tag to each >>>>>> alignment based on SAMtools manual... Why don't I get any @RG >>>>>> header lines.? >>>>> >>>>> Because it doesn't do that. As stated in the help: >>>>> >>>>> Note: Samtools' merge does not reconstruct the @RG dictionary in >>>>> the header. Users must provide the correct header with -h, or uses >>>>> Picard which properly maintains the header dictionary in merging. >>>>> >>>>> So you could either use Picard's MergeSamFiles, which will do this >>>>> for you, or you can put the read groups in the header yourself. >>>>> There is samtools reheader, but I have found that to be less than >>>>> useful. Personally I usually just use unix command line tools to do >>>>> the deed. >>>>> >>>>> An easy way is >>>>> >>>>> samtools view -H yourbamfile.bam> header >>>>> >>>>> then use vi or emacs or whatever to add the @RG lines you need. If >>>>> you are not sure what they all are, you can use >>>>> >>>>> samtools view yourbamfile.bam | head >>>>> >>>>> to see which field contains the read groups, then >>>>> >>>>> samtools view yourbamfile.bam | cut -f<thecorrectfield> - | sort | >>>>> uniq >>>>>> thereadgroups >>>>> >>>>> then >>>>> >>>>> (cat header thereadgroups; samtools view yourbamfile;) | samtools >>>>> view -bS -> yourbamfilewithreadgroups.bam >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> Do >>>>>> you have an impression that which reader command I can use to add >>>>>> @RG lines? I just went through the SAMtools manual and couldn't >>>>>> figure it out... >>>>>> >>>>>> Thanks >>>>>> >>>>>> -C >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> To: al...@br...; mm...@ho... From: >>>>>>> kr...@sa... CC: sam...@li... Date: >>>>>>> Sat, 19 Feb 2011 14:18:35 +0000 Subject: RE: [Samtools-help] >>>>>>> Error "RG ID on SAMRecord not found in header" from Picard's >>>>>>> MarkDuplicates.jar >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I suspect you do not have any @RG header lines. Take a look >>>>>>> using: >>>>>>> >>>>>>> samtools view -H your.bam >>>>>>> >>>>>>> If not I believe there is a reader command you can pass a >>>>>>> valid header and the bam to that can fix this. >>>>>>> >>>>>>> Regards, Keiran >>>>>>> >>>>>>> Keiran Raine >>>>>>> >>>>>>> Senior Computer Biologist The Cancer Genome Project Ext: 7703 >>>>>>> >>>>>>> Sent from my HTC Desire Z >>>>>>> >>>>>>> >>>>>>> -----Original Message----- From: Cliff Yiu >>>>>>> [mm...@ho...] Received: Saturday, 19 Feb 2011, 1:30 To: >>>>>>> al...@br... [al...@br...] CC: >>>>>>> sam...@li... >>>>>>> [sam...@li...] Subject: [Samtools-help] >>>>>>> Error "RG ID on SAMRecord not found in header" from Picard's >>>>>>> MarkDuplicates.jar >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi >>>>>>> >>>>>>> I am trying the following command: >>>>>>> >>>>>>> java -Xmx4g -jar /home/picard-tools-1.35/MarkDuplicates.jar >>>>>>> INPUT=$dir/input.bam OUTPUT=$dir/output.bam TMP_DIR=/home/temp >>>>>>> METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true >>>>>>> ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT >>>>>>> >>>>>>> And getting lots of errors like "RG ID on SAMRecord not found >>>>>>> in header". >>>>>>> >>>>>>> The command is still running and I haven't seen output.bam >>>>>>> coming out...> From what I saw previously, it is most likely >>>>>>> that I will end up getting no output when the command stops. >>>>>>> >>>>>>> Anyone knows how to fix the error "RG ID on SAMRecord not found >>>>>>> in header"? >>>>>>> >>>>>>> BTW, I used BWA for reads mapping, then SAMtools for sorting >>>>>>> reads from each Illumina lane and merging all the lanes to the >>>>>>> input.bam. >>>>>>> >>>>>>> I am running commands in a cluster node which has 4GB RAM and >>>>>>> 10Tb space. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> -C >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- The Wellcome Trust Sanger Institute is operated by Genome >>>>>>> Research = >>>>>>> >>>>>>> Limited, a charity registered in England with number 1021457 >>>>>>> and a compa= ny registered in England with number 2742969, >>>>>>> whose registered office is 2= 15 Euston Road, London, NW1 2BE. >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>>> >>> The ultimate all-in-one performance toolkit: Intel(R) Parallel >>> Studio XE: >>>>>> Pinpoint memory and threading errors before they happen. Find and >>>>>> fix more than 250 security defects in the development cycle. >>>>>> Locate bottlenecks in serial and parallel code that limit >>>>>> performance. http://p.sf.net/sfu/intel-dev2devfeb >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ Samtools-help >>>>>> mailing list Sam...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-help >>>>> >>>>> -- James W. MacDonald, M.S. Biostatistician Douglas Lab University >>>>> of Michigan Department of Human Genetics 5912 Buhl 1241 E. >>>>> Catherine St. Ann Arbor MI 48109-5618 734-615-7826 >>>>> ********************************************************** >>>>> Electronic Mail is not secure, may not be read every day, and >>>>> should not be used for urgent or sensitive issues >>>> >>> >> >> ------------------------------------------------------------------------------ >> >> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >> XE: >> Pinpoint memory and threading errors before they happen. >> Find and fix more than 250 security defects in the development cycle. >> Locate bottlenecks in serial and parallel code that limit performance. >> http://p.sf.net/sfu/intel-dev2devfeb >> _______________________________________________ >> Samtools-help mailing list >> Sam...@li... >> https://lists.sourceforge.net/lists/listinfo/samtools-help >> >> ------------------------------------------------------------------------------ >> >> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio >> XE: >> Pinpoint memory and threading errors before they happen. >> Find and fix more than 250 security defects in the development cycle. >> Locate bottlenecks in serial and parallel code that limit performance. >> http://p.sf.net/sfu/intel-dev2devfeb >> _______________________________________________ >> Samtools-help mailing list >> Sam...@li... >> https://lists.sourceforge.net/lists/listinfo/samtools-help >> >> >> |