Re: [Samtools-help] Error "RG ID on SAMRecord not found in header" from Picard's MarkDuplicates.jar

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Tom,

Actually, the code only looks at the LB field of the read group, and not 
the SM field.  I don't think it would make sense, however, to have two 
read groups with the same value for LB but different values for SM.  I 
make that statement somewhat reluctantly because of the creativity with 
which people put stuff into a a flowcell, but I believe it is generally 
true.

-Alec

On 2/21/11 8:44 AM, Tom Blackwell wrote:
>
> No, Richard Varhol's issue - which he correctly identified - was that 
> the sample identifiers as well as the library identifiers need to be 
> the same, in order to mark duplicates across read groups.
>
>                              -  tom blackwell  -
>
> On Sun, 20 Feb 2011, Tadigotla, Vasisht wrote:
>
>> Hi Alec,
>>
>> Just to clarify, Is MarkDuplicates removing duplicate reads (PCR 
>> duplicates) across different readgroups as long as the LB field of 
>> the @RG record is the same? As Richard Varhol pointed out in his 
>> email on Thu, when we modify the bam file so that the readgroup is 
>> the same for all reads, MarkDuplicates seems to find a lot more 
>> duplicate reads (~20% more). Is there a possible explanation?
>>
>>
>> Thanks,
>> Vasisht
>>
>> -----Original Message-----
>> From: Alec Wysoker [mailto:al...@br...]
>> Sent: Sunday, February 20, 2011 12:56 PM
>> To: James W. MacDonald
>> Cc: sam...@li...
>> Subject: Re: [Samtools-help] Error "RG ID on SAMRecord not found in 
>> header" from Picard's MarkDuplicates.jar
>>
>> Hi Folks,
>>
>> The COMMENT argument to MergeSamFiles can't be used to add @RG lines.
>> It adds the (possibly ill-conceived) @CO record.
>>
>> Cliff needs either to add @RG records somewhere along the line before
>> calling MarkDuplicates using Picard ReplaceSamHeader or Samtools
>> reheader.  Ideally the thing to do would be to add them before merging,
>> and then use Picard's MergeSamFiles instead of samtools to merge.
>> However, you could add @RG records on the file that you pass to
>> MarkDuplicates.  You need to know what the read groups IDs are on the
>> SAM records in that file, and create an @RG header record for each
>> different RG value in the SAM records,
>>
>> The importance of @RG records for MarkDuplicates is that it will not
>> consider two alignments as duplicates if they come from different
>> libraries, which is determined by looking at the LB field of the @RG
>> record.  If all your reads are from the same library, then having them
>> all have the same read group will not be a problem.
>>
>> A read group typically corresponds with a single run from an instrument,
>> e.g. an Illumina lane.  If the same library was run on multiple lanes,
>> then having the reads from different lanes having distinct RG IDs allows
>> MarkDuplicates to better calculate the rate of optical duplication (as
>> opposed to molecular duplication), since reads from different lanes can
>> never be optical duplicates of one another.  This is probably not very
>> important for most users.
>>
>> -Alec
>>
>> On 2/19/11 1:41 PM, James W. MacDonald wrote:
>>> Hi Cliff,
>>>
>>> On 2/19/2011 12:54 PM, Cliff Yiu wrote:
>>>>
>>>> Hi, James
>>>>
>>>> Thanks very much for your response!
>>>>
>>>> I just tried Picard's MergeSamFiles.jar to sort and merge two bam
>>>> files which were generated from two flow cells.
>>>>
>>>> java -Xmx4g -jar
>>>> /home/picard-tools-1.35/picard-tools-1.35/MergeSamFiles.jar
>>>> INPUT=/home/FC42TW7AAXX_1.bam INPUT=/home/FC61517AAXX_3.bam
>>>> OUTPUT=/home/test.bam TMP_DIR=/home/temp SORT_ORDER=coordinate
>>>> VALIDATION_STRINGENCY=LENIENT
>>>>
>>>> Before I ran this command, I checked the two input bam files by
>>>> "samtools view -H" and neither of them has any @RG header lines. I
>>>> got these two input bam files by "samtools view -bS input.sam>
>>>> input.bam". Does this mean the original sam file doesn't have any @RG
>>>> lines? If this is true, even if I use MergeSamFiles.jar, it won't
>>>> pass any @RG lines to the merged bam files.. Am I right?
>>>
>>> Possibly. I have never needed to merge, so can't say for sure. Note
>>> that most aligners these days have an argument to add the read group
>>> data to the header, so in the future you might want to do that at the
>>> alignment stage.
>>>
>>> But it is clearly not worth it to realign just to add some lines to
>>> the header.
>>>
>>> There is a COMMENT argument to MergeSamFiles that you could probably
>>> use to add the @RG header; I am assuming here that this will just add
>>> whatever you use as the comment verbatim into the header. Since you
>>> can have 0 or more comments, you could just do
>>>
>>> COMMENT=@RG\tID:MyFirstReadgroup COMMENT=@RG\tID:MySecondReadgroup
>>>
>>> Here assuming that you don't need quotation marks, and should include
>>> tab characters.
>>>
>>> However, as I mentioned in the previous email, I tend to just cat
>>> things together and run back through samtools view. This way I know
>>> exactly what I am doing, and it isn't much (if at all) slower than
>>> other available methods.
>>>
>>> Since I don't know *exactly* what is expected for the COMMENTS
>>> argument, I can't say for sure the above will work correctly. So you
>>> will just have to play around with it until it does what you want.
>>>
>>>>
>>>> Also, MergeSamFiles.jar has an option
>>>> "MERGE_SEQUENCE_DICTIONARIES=BooleanMerge". Do I need to add this
>>>> into the command?
>>>
>>> I think the argument value is 'Boolean', not BooleanMerge. Anyway, it
>>> appears to me that this would only be necessary if you had used
>>> different references to align the various samples that you are
>>> merging. Otherwise, the sequence dictionaries should be identical (and
>>> it's not necessary to do any merging).
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>>
>>>> Thanks
>>>>
>>>> -C
>>>>
>>>>> Date: Sat, 19 Feb 2011 10:56:22 -0500 From: jm...@me...
>>>>> To: mm...@ho... CC: kr...@sa...;
>>>>> al...@br...; sam...@li...
>>>>> Subject: Re: [Samtools-help] Error "RG ID on SAMRecord not found in
>>>>> header" from Picard's MarkDuplicates.jar
>>>>>
>>>>> Hi Cliff,
>>>>>
>>>>> On 2/19/2011 10:27 AM, Cliff Yiu wrote:
>>>>>>
>>>>>> Hi, Keiran
>>>>>>
>>>>>> I just checked my bam file with your command and got lots of
>>>>>> lines as
>>>>>>
>>>>>> @SQ SN:chr10  LN:135534747 @SQ SN:chr11  LN:135006516 @SQ
>>>>>> SN:chr11_gl000202_random    LN:40103 @SQ SN:chr12  LN:133851895
>>>>>>
>>>>>> There is no @RG line.. What kind of output should I expect to see
>>>>>> if there are @RG header lines? One @SQ line following by an @RG
>>>>>> line?
>>>>>
>>>>> No. You should have a @SQ line per contig, followed by the @RG
>>>>> lines for each read group you have in the bam file.
>>>>>>
>>>>>> Also, when I merged all the lanes or different flowcells, I used
>>>>>> "merge -r" . The "-r" option is to attach an RG tag to each
>>>>>> alignment based on SAMtools manual... Why don't I get any @RG
>>>>>> header lines.?
>>>>>
>>>>> Because it doesn't do that. As stated in the help:
>>>>>
>>>>> Note: Samtools' merge does not reconstruct the @RG dictionary in
>>>>> the header. Users must provide the correct header with -h, or uses
>>>>> Picard which properly maintains the header dictionary in merging.
>>>>>
>>>>> So you could either use Picard's MergeSamFiles, which will do this
>>>>> for you, or you can put the read groups in the header yourself.
>>>>> There is samtools reheader, but I have found that to be less than
>>>>> useful. Personally I usually just use unix command line tools to do
>>>>> the deed.
>>>>>
>>>>> An easy way is
>>>>>
>>>>> samtools view -H yourbamfile.bam>  header
>>>>>
>>>>> then use vi or emacs or whatever to add the @RG lines you need. If
>>>>> you are not sure what they all are, you can use
>>>>>
>>>>> samtools view yourbamfile.bam | head
>>>>>
>>>>> to see which field contains the read groups, then
>>>>>
>>>>> samtools view yourbamfile.bam | cut -f<thecorrectfield>  - | sort |
>>>>> uniq
>>>>>> thereadgroups
>>>>>
>>>>> then
>>>>>
>>>>> (cat header thereadgroups; samtools view yourbamfile;) | samtools
>>>>> view -bS ->  yourbamfilewithreadgroups.bam
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>> Do
>>>>>> you have an impression that which reader command I can use to add
>>>>>> @RG lines? I just went through the SAMtools manual and couldn't
>>>>>> figure it out...
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> -C
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> To: al...@br...; mm...@ho... From:
>>>>>>> kr...@sa... CC: sam...@li... Date:
>>>>>>> Sat, 19 Feb 2011 14:18:35 +0000 Subject: RE: [Samtools-help]
>>>>>>> Error "RG ID on SAMRecord not found in header" from Picard's
>>>>>>> MarkDuplicates.jar
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I suspect you do not have any @RG header lines. Take a look
>>>>>>> using:
>>>>>>>
>>>>>>> samtools view -H your.bam
>>>>>>>
>>>>>>> If not I believe there is a reader command you can pass a
>>>>>>> valid header and the bam to that can fix this.
>>>>>>>
>>>>>>> Regards, Keiran
>>>>>>>
>>>>>>> Keiran Raine
>>>>>>>
>>>>>>> Senior Computer Biologist The Cancer Genome Project Ext: 7703
>>>>>>>
>>>>>>> Sent from my HTC Desire Z
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message----- From: Cliff Yiu
>>>>>>> [mm...@ho...] Received: Saturday, 19 Feb 2011, 1:30 To:
>>>>>>> al...@br... [al...@br...] CC:
>>>>>>> sam...@li...
>>>>>>> [sam...@li...] Subject: [Samtools-help]
>>>>>>> Error "RG ID on SAMRecord not found in header" from Picard's
>>>>>>> MarkDuplicates.jar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> I am trying the following command:
>>>>>>>
>>>>>>> java -Xmx4g -jar /home/picard-tools-1.35/MarkDuplicates.jar
>>>>>>> INPUT=$dir/input.bam OUTPUT=$dir/output.bam TMP_DIR=/home/temp
>>>>>>> METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true
>>>>>>> ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT
>>>>>>>
>>>>>>> And getting lots of errors like "RG ID on SAMRecord not found
>>>>>>> in header".
>>>>>>>
>>>>>>> The command is still running and I haven't seen output.bam
>>>>>>> coming out...>   From what I saw previously, it is most likely
>>>>>>> that I will end up getting no output when the command stops.
>>>>>>>
>>>>>>> Anyone knows how to fix the error "RG ID on SAMRecord not found
>>>>>>> in header"?
>>>>>>>
>>>>>>> BTW, I used BWA for reads mapping, then SAMtools for sorting
>>>>>>> reads from each Illumina lane and merging all the lanes to the
>>>>>>> input.bam.
>>>>>>>
>>>>>>> I am running commands in a cluster node which has 4GB RAM and
>>>>>>> 10Tb space.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> -C
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- The Wellcome Trust Sanger Institute is operated by Genome
>>>>>>> Research =
>>>>>>>
>>>>>>> Limited, a charity registered in England with number 1021457
>>>>>>> and a compa= ny registered in England with number 2742969,
>>>>>>> whose registered office is 2= 15 Euston Road, London, NW1 2BE.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------ 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>>
>>> The ultimate all-in-one performance toolkit: Intel(R) Parallel 
>>> Studio XE:
>>>>>> Pinpoint memory and threading errors before they happen. Find and
>>>>>> fix more than 250 security defects in the development cycle.
>>>>>> Locate bottlenecks in serial and parallel code that limit
>>>>>> performance. http://p.sf.net/sfu/intel-dev2devfeb
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________ Samtools-help
>>>>>> mailing list Sam...@li...
>>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-help
>>>>>
>>>>> -- James W. MacDonald, M.S. Biostatistician Douglas Lab University
>>>>> of Michigan Department of Human Genetics 5912 Buhl 1241 E.
>>>>> Catherine St. Ann Arbor MI 48109-5618 734-615-7826
>>>>> **********************************************************
>>>>> Electronic Mail is not secure, may not be read every day, and
>>>>> should not be used for urgent or sensitive issues
>>>>
>>>
>>
>> ------------------------------------------------------------------------------ 
>>
>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio 
>> XE:
>> Pinpoint memory and threading errors before they happen.
>> Find and fix more than 250 security defects in the development cycle.
>> Locate bottlenecks in serial and parallel code that limit performance.
>> http://p.sf.net/sfu/intel-dev2devfeb
>> _______________________________________________
>> Samtools-help mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-help
>>
>> ------------------------------------------------------------------------------ 
>>
>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio 
>> XE:
>> Pinpoint memory and threading errors before they happen.
>> Find and fix more than 250 security defects in the development cycle.
>> Locate bottlenecks in serial and parallel code that limit performance.
>> http://p.sf.net/sfu/intel-dev2devfeb
>> _______________________________________________
>> Samtools-help mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-help
>>
>>
>>