When using the Maq cns2cssnp command, the rules applied for calling single SNPs in color space are incorrect: too strict in one sense and not strict enough in another.
The current functionality discards many, many putative SNPs, and simultaneously allows a vaery small minority of incorrect SNPs through.
An excellent reference on color space SNP calling rules may be found here: http://marketing.appliedbiosystems.com/mk/submit/SOLID_KNOWLEDGE_RD?_JS=T&rd=ex
Specifically: in aux_utils.c (0.6.6) at line 120-121:
if (r != 15 && r != 15 && b != 15 && b != 15 && r != b && r != b
&& (b&r) == (b&r))
The final constraint (b&r) == (b&r) is too strict. What this says is that in order to be considered a candidate SNP, the bitwise and of colors at the first and second positions must be the same, e.g.
11 -> 22 1&2 == 1&2
48 -> 84 4&8 == 8&4
but NOT: 12 -> 48 1&4 != 2&8
However, that last example IS a permissable pair of color changes according to ABIs base transition color codes (see whitepaper linked above)...
This can be corrected by removing (b&r) == (b&r) from line 121 of aux_utils.
In this case, an additional constraint is needed further down to disallow these cases:
11 -> 23 1&2 == 1&3
12 -> 23 1&2 == 2&3
This can be accomplished by adding following constraint at line 127
(nst_nt16_count_table[b] == nst_nt16_count_table[b])
requiring both of the consensus positions to contain the same level of degeneracy (1 or 2 alleles, not one of each).
Also, since the consensus sequence often contains "degenerate codes" mapped to 4 bit codes with two bits "on" (corresponding to two alleles), additional verification is required (at some point) to ensure that the reads leading to the degenerate consensus comply with the color transition rules. For example:
11 -> 33 seems OK, since 11 in the reference is assumed to be 11 and 22 in the underlying reads...
And 12 -> 33 is assumed to be 12 and 21 in the reads.
But the degenerate codes "33" do not actually contain enough information to validate the Assumptions above.
For that, we must return to the reads mapping to a given position to ensure that they are the allowed color combinations. Problems with this should be infrequent, but high error rates and low coverage will conspire to occasionally create these situations.
Thanks for making a great toolset here, I understand that I may be amoung the first to try to make it fully work with SOLiD data.
vsi [at] u.washington.edu