Hi,
I was wondering if there were details on the internal evolutionary model used by the alignment(match/substitution/gap) scores for BBMap. I know that standard aligners tend to use an evolutionary model based on Chapter 2.2 of Durbin et al's book.
To quickly summarize here, they determine the score for a match or mismatch pair as:
s(a,b) = log (p_ab / q_a q_b)
where s(a,b) is the score for two base pairs a and b and p_ab is the probability of observing them together if they shared common ancestry and q_a and q_b is the probability of observing the base pairs at random.
BBMap states that it assigns a score of +100 for each match and -127 for each mismatch. (Let's ignore gaps for ease of exposition). Thus, this implies that
s(a,a) = 100 = log(p_aa) - log(q_a q_a)
s(a,b) = -127 = log(p_ab) - log(q_a q_b) if a \neq b
It seems reasonable to assume q_a = q_b or at least very close, so we get:
s(a,a) - s(a,b) = 227 = log(p_aa/p_ab)
e^(227) = p_aa/p_ab
When I saw this number it shocked me because this implies that if we assume independence between base pairs in our evolutionary model (which I believe is implied by the additive nature of matches and mismatches in the alignment scores), then a perfectly aligned read is said to be e^227 times MORE likely than a read that is hamming distance 1 away from the reference irrespective of length.
This does not make intuitive sense to me, and I was wondering if you could shed some light as to what the underlying evolutionary model that is being used is, or if any of my calculations are incorrect.
Thanks!
Jeffrey