The current spec says that for unmapped reads MAPQ should be 0 and CIGAR
should be *.
There is already a good use of CIGAR for unmapped reads to hold clip values.
Actually though, this raises a set of related issues. In using CIGAR
for clipping we are in fact
using it for two sorts of things: (1) what should be aligned, (2) what
actually is aligned.
Information of type (1) remains valid even when the read is unmapped.
Another
example of this distinction is more subtle. When GATK or DINDEL write
back a BAM that
has been locally realigned, they need to decide whether particular
insertions and deletions in
reads are real, i.e. in the originating sample from which the read was
obtained, or sequencing
errors, just present in this read. We know that there are sequencing
indel errors, albeit at a
low level. DINDEL in particular has a non-trivial model of read errors,
and uses a Bayesian
approach to decide how likely any indel is to be in each category.
However, there is no way
to transmit this information in the CIGAR string. DINDEL can't write
back a re-aligned
BAM file and indicate which insertions it thinks are real, and which are
sequencing errors.
Similarly I guess there are read error correction tools used by
assemblers, but these would
not be able to write back to SAM saying which mismatches should be more
trusted. I guess,
as with recalibrated qualities, we could write the original read into a
tag, and the read to be
aligned into the SEQ, QUAL and CIGAR fields. Hmmmmm. Food for thought.
My proposed assignment of meaning to non-zero MAPQ0 values for unmapped
reads is much
more straightforward. The more I think about it the more natural it is
to encode there in the
normal units -10 log_10 probability of the read being incorrectly
unmapped, i.e. the chance
that it should be mapped to the current reference. 0 is a perfectly
good and harmless default
for this, indicating no information.
Richard
On 15/03/2010 12:44, Keiran Raine wrote:
> Hi
>
> The edits I suggested were to highlight that the spec was incorrect
> for the way that it is being used, even by those who wrote it.
>
> I acknowledge you points below, most of which were discussed over a
> week ago, however, I have no control over the documentation, all I can
> do is make suggestions, if you have an alternative suggestion for how
> this should be documented I recommend that you post it to the list and
> we can hope that it makes it into the published documentation.
>
> Regards,
>
> Keiran
>
> The message that came through today was one from 5th March that had
> been stuck in moderation due to size issues (I removed the remaining
> thread and reposted).
>
>
> On 15 Mar 2010, at 15:30, Goncalo Abecasis wrote:
>
>
>> Hi Keiran,
>>
>> I think it would be helpful that we shoud be a bit more explicit as
>> to how
>> RNAME, POS, MAPQ and CIGAR should be interpreted for unmapped reads.
>>
>> Specifically, we should say that RNAME and POS can be set so that
>> unmapped reads would be sorted close to their mates, which is
>> usually convenient for processing.
>>
>> I don't think we have a meaning for MAPQ for unmapped reads, and
>> there were some differences of opinion. Perhaps one way of looking
>> at it is that, for current implementations, MAPQ really should
>> should be set as zero -- although it may have other uses in the
>> future. You could write in the spec that MAPQ for unmapped reads is
>> undefined and "could be set to 0".
>>
>> Finally, a similar wording might apply to the CIGAR string (if we
>> want it to represent clipping, we should say so in the spec and
>> request that downstream processors -- e.g. mappers that use BAM as
>> input -- respect it).
>>
>> Goncalo
>>
>> On Fri, 5 Mar 2010, Keiran Raine wrote:
>>
>>
>>> Hi,
>>>
>>> I don't think anyone objects to the specification being changed for
>>> the reasons that are being discussed. The problem is that the
>>> specification specifically states
>>>
>>>
>>> 1. QNAME and FLAG are required for all alignments. If the mapping
>>> position of the query is not available, RNAME and
>>> CIGAR are set as “*”, and POS and MAPQ as 0. If the query is
>>> unpaired or pairing information is not available, MRNM
>>> equals “*”, and MPOS and ISIZE equal 0. SEQ and QUAL can both be
>>> absent, represented as a star “*”. If QUAL is
>>> not a star, it must be of the same length as SEQ.
>>>
>>> Heng has stated that this is not the case, so unless we want to
>>> continually have the same queries posed to the devel and help lists
>>> the documentation should be updated to state
>>>
>>> 1. QNAME and FLAG are required for all alignments. If the mapping
>>> position of the query is not available, RNAME and
>>> CIGAR may be set as “*”, and POS and MAPQ as 0. If the query is
>>> unpaired or pairing information is not available, MRNM
>>> equals “*”, and MPOS and ISIZE equal 0. SEQ and QUAL can both be
>>> absent, represented as a star “*”. If QUAL is
>>> not a star, it must be of the same length as SEQ. The flag field is
>>> the only absolute method to determine if a read is unmapped.
>>>
>>> Regards,
>>>
>>> Keiran
>>>
>>>
>>> On 5 Mar 2010, at 17:46, Richard Durbin wrote:
>>>
>>>
>>>> It seems to me clear that we need to support giving a soft quality
>>>> clip
>>>> in the BAM record for an unmapped read.
>>>> I don't understand the problem in doing this in the CIGAR string.
>>>> As I
>>>> see it, this is not re-use for a different purpose.
>>>> It is perhaps important for people to know that bwa implements a
>>>> quality
>>>> clipping strategy and records it as
>>>> a soft clip in the CIGAR, in both mapped and I presume unmapped
>>>> reads(*). This is what will be in the next
>>>> 1000 Genomes bam file release. Surely it is good for the clip
>>>> information to be stored consistently independent
>>>> of whether the reads are mapped or not. For an mapped read, the
>>>> clip
>>>> info in the CIGAR indicates what part of
>>>> the read was aligned. For an unmapped read the clip info in the
>>>> CIGAR
>>>> should indicate what part of the read
>>>> could have been aligned, or was attempted to be aligned.
>>>> Furthermore, recording quality clipping in this way is consistent
>>>> with
>>>> modularisation and using different tools
>>>> for clipping and mapping. If the material beyond some point
>>>> should not
>>>> be used, then it shouldn't be aligned.
>>>> i.e. the same soft clip should be in the same place if the read is
>>>> subsequently aligned. There is lots of discussion
>>>> about moving away from fastq for unmapped data and using BAM for
>>>> it, as
>>>> a format able to support richer
>>>> representation and markup, and I believe that some mappers can
>>>> already
>>>> read BAM and output a new BAM.
>>>> In my view the goal should be for these to respect any existing
>>>> clipping
>>>> information in CIGAR field in the input.
>>>> Richard
>>>> (*) I haven't checked this statement about recording the clip
>>>> point for
>>>> unmapped reads, but the behaviour I describe
>>>> seems to me the logical one.
>>>> On 05/03/2010 11:44, Tim Fennell wrote:
>>>>
>>>>> We never actually finalized that that was how we should store
>>>>> this -
>>>>> and I'm the one who started/requested this feature. In fact, I
>>>>> particularly don't like the solution of using the CIGAR field to
>>>>> store
>>>>> this - though I think the idea of using the CIGAR format in an
>>>>> optional tag would work well. If we reuse the CIGAR string, and
>>>>> the
>>>>> information is lost once the read is aligned as we have no way to
>>>>> know
>>>>> if any post-alignment clipping was due to adapter sequences,
>>>>> quality
>>>>> or the aligner's own decision during pairwise alignment of the
>>>>> read.
>>>>> -t
>>>>> On Mar 5, 2010, at 11:12 AM, Mark A. DePristo wrote:
>>>>>
>>>>>> One reason a CIGAR was added for unmapped reads was to allow us to
>>>>>> represent soft clipping in unmapped reads, in the situation
>>>>>> where you
>>>>>> have done a lane wide clip of all reads. Not sure what MAPQ
>>>>>> would be
>>>>>> good for, though.
>>>>>> On Mar 5, 2010, at 11:09 AM, Bob Handsaker wrote:
>>>>>>
>>>>>>> What interpretation should tools give to the mapQ or cigar
>>>>>>> values on
>>>>>>> unmapped reads?
>>>>>>> If the answer is "they should ignore them", then I think it's
>>>>>>> bad to
>>>>>>> encourage including random data values in sam files.
>>>>>>> -Bob
>>>>>>> On 3/5/10 10:40 AM, Heng Li wrote:
>>>>>>>
>>>>>>>> On Fri, Mar 05, 2010 at 03:03:18PM +0000, Keiran Raine wrote:
>>>>>>>>
>>>>>>>>> If that is the case the published specification should be
>>>>>>>>> changed...
>>>>>>>>> Under the current published specification for an unmapped
>>>>>>>>> read the
>>>>>>>>> values should be
>>>>>>>>> RNAME = *
>>>>>>>>> POS = 0
>>>>>>>>> MAPQ = 0
>>>>>>>>> CIGAR = *
>>>>>>>>>
>>>>>>>> I think in spec when I wrote that paragraph I really meant
>>>>>>>> "may be
>>>>>>>> set
>>>>>>>> as...". There is almost no requirement on all fields for an
>>>>>>>> unmapped
>>>>>>>> read except that length(SEQ)==length(QUAL) if both present.
>>>>>>>>
>>>>>>>>> I cannot however see what value there is in not adhering to the
>>>>>>>>> specification for MAPQ or CIGAR. CIGAR not being set to '*'
>>>>>>>>> for an
>>>>>>>>> unmapped read causes confusion for those new to use of BWA as
>>>>>>>>> if
>>>>>>>>> you
>>>>>>>>> are not used to working with bitwise flags it makes complete
>>>>>>>>> sense
>>>>>>>>> to follow the specification and consider reads with '*'
>>>>>>>>> unmapped
>>>>>>>>> and
>>>>>>>>> not when set to a none '*' value.
>>>>>>>>>
>>>>>>>> Flag is the only reliable way to tell if a read is mapped or
>>>>>>>> not. I
>>>>>>>> do
>>>>>>>> not see the need of imposing two redundant requirements.
>>>>>>>>
>>>>>>>>> In addition, these files are large, and getting larger.
>>>>>>>>> Space can
>>>>>>>>> be saved by setting these values as described in the spec.
>>>>>>>>>
>>>>>>>> I am sure for most real-world alignments, changing CIGAR and
>>>>>>>> MAPQ
>>>>>>>> will
>>>>>>>> only make file<0.1% smaller.
>>>>>>>> I still think we should remove unnecessary redundant
>>>>>>>> requirements
>>>>>>>> in SAM
>>>>>>>> validation.
>>>>>>>> Heng
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Keiran Raine
>>>>>>>>> Senior Computer Biologist
>>>>>>>>> The Cancer Genome Project
>>>>>>>>> Ext: 2100
>>>>>>>>> kr2@...
>>>>>>>>> The Wellcome Trust Sanger Institute is operated by Genome
>>>>>>>>> Research
>>>>>>>>> Limited, a charity registered in England with number 1021457
>>>>>>>>> and a
>>>>>>>>> company registered in England with number 2742969, whose
>>>>>>>>> registered
>>>>>>>>> office is 215 Euston Road, London, NW1 2BE.
>>>>>>>>> On 5 Mar 2010, at 14:32, Heng Li wrote:
>>>>>>>>>
>>>>>>>>>> Can Picard remove the requirement of zero mapping quality and
>>>>>>>>>> empty
>>>>>>>>>> CIGAR for unmapped reads? I do not see how a SAM file is
>>>>>>>>>> invalid
>>>>>>>>>> due to
>>>>>>>>>> this and how this may affect downstream analyses in any way.
>>>>>>>>>> Thanks,
>>>>>>>>>> Heng
>>>>>>>>>> On Fri, Mar 05, 2010 at 10:59:56AM -0000, John Marshall wrote:
>>>>>>>>>>
>>>>>>>>>>>> On 5 Mar 2010, at 09:56, Kevin Lam wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi I am not too sure what to make of this.
>>>>>>>>>>>>> the sam file was from bwa
>>>>>>>>>>>>> is it a bug in bwa output?
>>>>>>>>>>>>> using bwa-0.5.6
>>>>>>>>>>>>>
>>>>>>>>>>> [...]
>>>>>>>>>>>
>>>>>>>>>>>>> Exception in thread "main"
>>>>>>>>>>>>> net.sf.samtools.SAMFormatException: Error
>>>>>>>>>>>>> parsing text SAM file. MAPQ must should be 0 for unmapped
>>>>>>>>>>>>> read.;
>>>>>>>>>>>>> File sorted.sam; Line 8910023
>>>>>>>>>>>>> Line: ./S2:747_1696_219 4 chr18 90771994 25 48M * 0 0
>>>>>>>>>>>>> AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGTTTGGGGCTT
>>>>>>>>>>>>> ]]]]]]PLQ]]XW[]QA6H]]VI+3]M9FSIFG@... XT:A:U
>>>>>>>>>>>>> CM:i:1 XN:i:
>>>>>>>>>>>>> 10 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:40C7
>>>>>>>>>>>>>
>>>>>>>>>>> You don't say what creature this is or how long its chr18
>>>>>>>>>>> sequence is, but
>>>>>>>>>>> this looks a lot like our favourite bwa "feature" in which
>>>>>>>>>>> spurious
>>>>>>>>>>> mappings bridging adjacent reference sequences are marked as
>>>>>>>>>>> unmapped but
>>>>>>>>>>> otherwise left intact (see the FAQ at http://bio-
>>>>>>>>>>> bwa.sourceforge.net/ ).
>>>>>>>>>>> It would appear that bwa's leaving such non-mappings with a
>>>>>>>>>>> non-
>>>>>>>>>>> zero MAPQ
>>>>>>>>>>> and non-empty CIGAR conflicts with Picard's validity
>>>>>>>>>>> checking.
>>>>>>>>>>> (And other
>>>>>>>>>>> non-zero/empty fields are not great either, but those are the
>>>>>>>>>>> main ones
>>>>>>>>>>> that Picard currently checks.)
>>>>>>>>>>> Depending on your reference, this may or may not explain your
>>>>>>>>>>> particular
>>>>>>>>>>> problem. Is your chr18 reference about 90772000 bases?
>>>>>>>>>>> (Also: picard typo :-))
>>>>>>>>>>> Cheers,
>>>>>>>>>>> John
>>>>>>>>>>> --
>>>>>>>>>>> The Wellcome Trust Sanger Institute is operated by Genome
>>>>>>>>>>> Research
>>>>>>>>>>> Limited, a charity registered in England with number 1021457
>>>>>>>>>>> and a
>>>>>>>>>>> company registered in England with number 2742969, whose
>>>>>>>>>>> registered
>>>>>>>>>>> office is 215 Euston Road, London, NW1 2BE.
>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>> Download Intel® Parallel Studio Eval
>>>>>>>>>>> Try the new software tools for yourself. Speed compiling,
>>>>>>>>>>> find
>>>>>>>>>>> bugs
>>>>>>>>>>> proactively, and fine-tune applications for parallel
>>>>>>>>>>> performance.
>>>>>>>>>>> See why Intel Parallel Studio got high marks during beta.
>>>>>>>>>>> http://p.sf.net/sfu/intel-sw-dev
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bio-bwa-help mailing list
>>>>>>>>>>> Bio-bwa-help@...
>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/bio-bwa-help
>>>>>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Download Intel® Parallel Studio Eval
>>>>>>> Try the new software tools for yourself. Speed compiling, find
>>>>>>> bugs
>>>>>>> proactively, and fine-tune applications for parallel performance.
>>>>>>> See why Intel Parallel Studio got high marks during beta.
>>>>>>> http://p.sf.net/sfu/intel-sw-dev
>>>>>>> _______________________________________________
>>>>>>> Samtools-devel mailing list
>>>>>>> Samtools-devel@...
>>>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>>>>>
>>>>>> Mark A. DePristo, Ph.D.
>>>>>> Manager, Medical and Population Genetics Analysis
>>>>>> Broad Institute of Harvard and MIT
>>>>>> depristo@...
>>>>>> mark@...
>>>>>> ------------------------------------------------------------------------------
>>>>>> Download Intel® Parallel Studio Eval
>>>>>> Try the new software tools for yourself. Speed compiling, find
>>>>>> bugs
>>>>>> proactively, and fine-tune applications for parallel performance.
>>>>>> See why Intel Parallel Studio got high marks during beta.
>>>>>> http://p.sf.net/sfu/intel-sw-dev
>>>>>> _______________________________________________
>>>>>> Samtools-devel mailing list
>>>>>> Samtools-devel@...
>>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Download Intel® Parallel Studio Eval
>>>>> Try the new software tools for yourself. Speed compiling, find bugs
>>>>> proactively, and fine-tune applications for parallel performance.
>>>>> See why Intel Parallel Studio got high marks during beta.
>>>>> http://p.sf.net/sfu/intel-sw-dev
>>>>> _______________________________________________
>>>>> Samtools-devel mailing list
>>>>> Samtools-devel@...
>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>>>
>>>> --
>>>> The Wellcome Trust Sanger Institute is operated by Genome Research
>>>> Limited, a charity registered in England with number 1021457 and a
>>>> company registered in England with number 2742969, whose registered
>>>> office is 215 Euston Road, London, NW1 2BE.
>>>> ------------------------------------------------------------------------------
>>>> Download Intel® Parallel Studio Eval
>>>> Try the new software tools for yourself. Speed compiling, find bugs
>>>> proactively, and fine-tune applications for parallel performance.
>>>> See why Intel Parallel Studio got high marks during beta.
>>>> http://p.sf.net/sfu/intel-sw-dev
>>>> _______________________________________________
>>>> Samtools-devel mailing list
>>>> Samtools-devel@...
>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>>
>>>
>>>
>>>
>>> --
>>> The Wellcome Trust Sanger Institute is operated by Genome Research
>>> Limited, a charity registered in England with number 1021457 and
>>> acompa
>>> ny registered in England with number 2742969, whose
>>> registeredoffice is 2
>>> 15 Euston Road, London, NW1 2BE.
>>>
>>>
>
>
>
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
|