From: Alexey L. <ale...@ho...> - 2013-10-12 23:21:43
|
Dear list, EstimateLibraryComplexity Picard tool attempts to calculate number of unique molecules in "the whole library" basing on the Lander-Waterman equation. This is how it is described on Picard's manual: "Estimates the size of a library based on the number of paired end molecules observed and the number of unique pairs observed. Based on the Lander-Waterman equation that states: C/X = 1 - exp( -N/X ) where X = number of distinct molecules in library N = number of read pairs C = number of distinct fragments observed in read pairs" In fact the Lander-Waterman equation estimates probability of genome coverage basing on size of genome, number of fragments and length of each fragment: E = 1 - exp(-NL/G) where N is number of fragments, L is size of fragment and G is size of genome (from Wikipedia) It looks to me that the Picard's interpretation of the equation requires more explanation than is given. My initial instinct is that the interpretation is merely wrong, for it does not account (i) for the size of genome and (ii) for the fact that only a fraction of actual library was captured in the experiment. I have seen that question about the way how Picard calculates the library size comes again and again in this list. A typical reply is that the algorithm is openly available and that it is based on some complex statistics that is beyond a biologically educated person. Could you please explain for statistically educated people how exactly you derived your interpretation of the Lander-Waterman equation and give a link to the relevant paper(s). Please, I do not mean link to the seminal paper of Lander-Waterman, but any paper explaining how exactly the interpretation was derived. Thank you, Alexey |