From: Alexey L. <ale...@ho...> - 2013-10-19 21:56:23
|
Hi Nancy, Thank you very much for the detailed reply. I am relieved to hear that Lander-Waterman equation has nothing to do with the library size estimate :) I have some practical experience with PCR; this is why I like your suggestion, that in fact we have a sampling from a library X by N elements, where X is the number of unique elements and N is the number of repetitions of each unique element. Not sure yet how the Picard's equation can be derived from this. However, I am more inclined to believe in it now, when I know that it is based on a proper assumption. I also agree that complexity of library preparation procedures, which could include a couple rounds of capture and a couple of PCRs, raises a question about the accuracy of assumptions and estimates... Of course, providing the estimate was correct, this estimate would be very useful for scaling-up experiments. I hope that development of the new PCR-free library preparation kits may render this issue less practical :) Thanks again, Alexey -----Original Message----- From: Hansen, Nancy (NIH/NHGRI) [E] [mailto:nh...@ma...] Sent: 17 October 2013 16:18 To: Alexey Larionov; sam...@li... Cc: al...@ca... Subject: Re: [Samtools-help] Pickard estimate for the size of a library - wrong or non-transparent? Hi Alexey, I didn't see an answer to your question sent to the list, so in case others are interested as well: You are right that claiming that Picard's equation for library size is "based on the Lander-Waterman equation" is highly misleading. The two cases are mathematically similar, but the sampling problems they describe are vastly different. In the case of the Picard library size estimate, it looks as if the assumption being made is that in sequencing a library, we are choosing (with replacement) a random fragment from a library of X unique fragments N times, and then using C, the number of unique fragments we've observed, to estimate X, the total number of fragments in the population from which we are sampling. Given all of the assumptions made (see next paragraph), the equation used by Picard is valid enough (see http://math.stackexchange.com/questions/32800/probability-distribution-of-co verage-of-a-set-after-x-independently-randomly for a description of the statistics). But when we sequence a library using typical next generation protocols, there is much more happening behind the scenes. Very importantly, there is an initial step in which the library is amplified, so that rather than having one copy of each unique fragment, we have a distribution of multiplicities for our original fragments. Once a fragment is multiplied, it is not only more likely to be selected for sequencing: it is also more likely to be amplified again during the amplification step! In addition, different capture efficiencies will also affect our distribution of fragments (much to the dismay of developers of copy number calling software ;-)) The true statistical analysis gets very complicated very quickly, so I wouldn't even want to hazard a guess as to how accurate the Picard equation is in practice. Still, I think the equation is good enough for a rough estimate and useful for analyses of return on investment in sequencing, since if we have the ideal situation (equal representation of all the desired fragments in our library), the equation gives us a good picture of how much of the library we've sampled so far with our sequencing. This library size estimate can be highly valuable in deciding whether sequencing more of the library is likely to yield more unique information, or simply duplicate information, so it's a very nice thing that it's including in the Picard package. Regards, --Nancy -- ************************************* Nancy F. Hansen, PhD nh...@nh... Comparative Genomics Unit, NHGRI 5625 Fishers Lane Rockville, MD 20852 Phone: (301) 435-1560 Fax: (301) 435-6170 From: Alexey Larionov <ale...@ho...<mailto:ale...@ho...>> Date: Sun, 13 Oct 2013 00:21:29 +0100 To: <sam...@li...<mailto:sam...@li...urceforge. net>> Cc: <al...@ca...<mailto:al...@ca...>> Subject: [Samtools-help] Pickard estimate for the size of a library - wrong or non-transparent? Dear list, EstimateLibraryComplexity Picard tool attempts to calculate number of unique molecules in "the whole library" basing on the Lander-Waterman equation. This is how it is described on Picard's manual: "Estimates the size of a library based on the number of paired end molecules observed and the number of unique pairs observed. Based on the Lander-Waterman equation that states: C/X = 1 - exp( -N/X ) where X = number of distinct molecules in library N = number of read pairs C = number of distinct fragments observed in read pairs" In fact the Lander-Waterman equation estimates probability of genome coverage basing on size of genome, number of fragments and length of each fragment: E = 1 - exp(-NL/G) where N is number of fragments, L is size of fragment and G is size of genome (from Wikipedia) It looks to me that the Picard's interpretation of the equation requires more explanation than is given. My initial instinct is that the interpretation is merely wrong, for it does not account (i) for the size of genome and (ii) for the fact that only a fraction of actual library was captured in the experiment. I have seen that question about the way how Picard calculates the library size comes again and again in this list. A typical reply is that the algorithm is openly available and that it is based on some complex statistics that is beyond a biologically educated person. Could you please explain for statistically educated people how exactly you derived your interpretation of the Lander-Waterman equation and give a link to the relevant paper(s). Please, I do not mean link to the seminal paper of Lander-Waterman, but any paper explaining how exactly the interpretation was derived. Thank you, Alexey ---------------------------------------------------------------------------- -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk_ ______________________________________________ Samtools-help mailing list Sam...@li...<mailto:Sam...@li...urceforge.n et> https://lists.sourceforge.net/lists/listinfo/samtools-help |