## samtools-help

 [Samtools-help] Pickard estimate for the size of a library - wrong or non-transparent? From: Alexey Larionov - 2013-10-12 23:21:43 Attachments: Message as HTML ```Dear list, EstimateLibraryComplexity Picard tool attempts to calculate number of unique molecules in "the whole library" basing on the Lander-Waterman equation. This is how it is described on Picard's manual: "Estimates the size of a library based on the number of paired end molecules observed and the number of unique pairs observed. Based on the Lander-Waterman equation that states: C/X = 1 - exp( -N/X ) where X = number of distinct molecules in library N = number of read pairs C = number of distinct fragments observed in read pairs" In fact the Lander-Waterman equation estimates probability of genome coverage basing on size of genome, number of fragments and length of each fragment: E = 1 - exp(-NL/G) where N is number of fragments, L is size of fragment and G is size of genome (from Wikipedia) It looks to me that the Picard's interpretation of the equation requires more explanation than is given. My initial instinct is that the interpretation is merely wrong, for it does not account (i) for the size of genome and (ii) for the fact that only a fraction of actual library was captured in the experiment. I have seen that question about the way how Picard calculates the library size comes again and again in this list. A typical reply is that the algorithm is openly available and that it is based on some complex statistics that is beyond a biologically educated person. Could you please explain for statistically educated people how exactly you derived your interpretation of the Lander-Waterman equation and give a link to the relevant paper(s). Please, I do not mean link to the seminal paper of Lander-Waterman, but any paper explaining how exactly the interpretation was derived. Thank you, Alexey ```
 Re: [Samtools-help] Pickard estimate for the size of a library - wrong or non-transparent? From: Hansen, Nancy (NIH/NHGRI) [E] - 2013-10-17 15:18:28 ```Hi Alexey, I didn't see an answer to your question sent to the list, so in case others are interested as well: You are right that claiming that Picard's equation for library size is "based on the Lander-Waterman equation" is highly misleading. The two cases are mathematically similar, but the sampling problems they describe are vastly different. In the case of the Picard library size estimate, it looks as if the assumption being made is that in sequencing a library, we are choosing (with replacement) a random fragment from a library of X unique fragments N times, and then using C, the number of unique fragments we've observed, to estimate X, the total number of fragments in the population from which we are sampling. Given all of the assumptions made (see next paragraph), the equation used by Picard is valid enough (see http://math.stackexchange.com/questions/32800/probability-distribution-of-coverage-of-a-set-after-x-independently-randomly for a description of the statistics). But when we sequence a library using typical next generation protocols, there is much more happening behind the scenes. Very importantly, there is an initial step in which the library is amplified, so that rather than having one copy of each unique fragment, we have a distribution of multiplicities for our original fragments. Once a fragment is multiplied, it is not only more likely to be selected for sequencing: it is also more likely to be amplified again during the amplification step! In addition, different capture efficiencies will also affect our distribution of fragments (much to the dismay of developers of copy number calling software ;-)) The true statistical analysis gets very complicated very quickly, so I wouldn't even want to hazard a guess as to how accurate the Picard equation is in practice. Still, I think the equation is good enough for a rough estimate and useful for analyses of return on investment in sequencing, since if we have the ideal situation (equal representation of all the desired fragments in our library), the equation gives us a good picture of how much of the library we've sampled so far with our sequencing. This library size estimate can be highly valuable in deciding whether sequencing more of the library is likely to yield more unique information, or simply duplicate information, so it's a very nice thing that it's including in the Picard package. Regards, --Nancy -- ************************************* Nancy F. Hansen, PhD nhansen@... Comparative Genomics Unit, NHGRI 5625 Fishers Lane Rockville, MD 20852 Phone: (301) 435-1560 Fax: (301) 435-6170 From: Alexey Larionov > Date: Sun, 13 Oct 2013 00:21:29 +0100 To: > Cc: > Subject: [Samtools-help] Pickard estimate for the size of a library - wrong or non-transparent? Dear list, EstimateLibraryComplexity Picard tool attempts to calculate number of unique molecules in “the whole library” basing on the Lander-Waterman equation. This is how it is described on Picard’s manual: “Estimates the size of a library based on the number of paired end molecules observed and the number of unique pairs observed. Based on the Lander-Waterman equation that states: C/X = 1 - exp( -N/X ) where X = number of distinct molecules in library N = number of read pairs C = number of distinct fragments observed in read pairs” In fact the Lander-Waterman equation estimates probability of genome coverage basing on size of genome, number of fragments and length of each fragment: E = 1 – exp(-NL/G) where N is number of fragments, L is size of fragment and G is size of genome (from Wikipedia) It looks to me that the Picard’s interpretation of the equation requires more explanation than is given. My initial instinct is that the interpretation is merely wrong, for it does not account (i) for the size of genome and (ii) for the fact that only a fraction of actual library was captured in the experiment… I have seen that question about the way how Picard calculates the library size comes again and again in this list. A typical reply is that the algorithm is openly available and that it is based on some complex statistics that is beyond a biologically educated person. Could you please explain for statistically educated people how exactly you derived your interpretation of the Lander-Waterman equation and give a link to the relevant paper(s). Please, I do not mean link to the seminal paper of Lander-Waterman, but any paper explaining how exactly the interpretation was derived. Thank you, Alexey ------------------------------------------------------------------------------ October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk_______________________________________________ Samtools-help mailing list Samtools-help@... https://lists.sourceforge.net/lists/listinfo/samtools-help ```
 Re: [Samtools-help] Pickard estimate for the size of a library - wrong or non-transparent? From: Heng Li - 2013-10-17 15:39:31 ```On Oct 17, 2013, at 11:18 AM, "Hansen, Nancy (NIH/NHGRI) [E]" wrote: > But when we sequence a library using typical next generation protocols, there is much more happening behind the scenes. Very importantly, there is an initial step in which the library is amplified, so that rather than having one copy of each unique fragment, we have a distribution of multiplicities for our original fragments. Once a fragment is multiplied, it is not only more likely to be selected for sequencing: it is also more likely to be amplified again during the amplification step! Actually, if we assume the number of amplifications follows a Poisson distribution, the Picard formula is still true. You can find the detailed derivation here ;. Equation 1.2 in the document is the same as the Picard equation but with different notations. Heng -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ```
 Re: [Samtools-help] Pickard estimate for the size of a library - wrong or non-transparent? From: Heng Li - 2013-10-17 16:33:20 ```On Oct 17, 2013, at 11:39 AM, Heng Li wrote: > > On Oct 17, 2013, at 11:18 AM, "Hansen, Nancy (NIH/NHGRI) [E]" wrote: > >> But when we sequence a library using typical next generation protocols, there is much more happening behind the scenes. Very importantly, there is an initial step in which the library is amplified, so that rather than having one copy of each unique fragment, we have a distribution of multiplicities for our original fragments. Once a fragment is multiplied, it is not only more likely to be selected for sequencing: it is also more likely to be amplified again during the amplification step! > > Actually, if we assume the number of amplifications follows a Poisson distribution, the Picard formula is still true. You can find the detailed derivation here ;. Equation 1.2 in the document is the same as the Picard equation but with different notations. Hmm.. The note was written 5 years ago. When I revisit it, I realize that the Picard equation is quite accurate if 1) amplifications always work at 100% efficiency; or 2) the distribution is Poisson. For uniform or binomial distribution, the first order approximation is still good enough if we assume N< > Heng > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > ------------------------------------------------------------------------------ > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from > the latest Intel processors and coprocessors. See abstracts and register > > http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk > _______________________________________________ > Samtools-help mailing list > Samtools-help@... > https://lists.sourceforge.net/lists/listinfo/samtools-help -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ```
 Re: [Samtools-help] Pickard estimate for the size of a library - wrong or non-transparent? From: Richard Corbett - 2013-10-17 16:42:38 Attachments: downSampleCorrelations2.png ```Hi all, I was wrestling a little with comparing dup rates, specifically trying to answer the question "how much genome sequence do we need to get a confident measure of the library complexity?". I ended up doing some testing and this is what I found... At first, since all my libraries had different numbers of reads, it seemed logical to compare the "ESTIMATED_LIBRARY_SIZE" (ELS) to compare between different libraries. Overall, it seemed that libraries with low ELS were, in fact, the least complex, or otherwise bad libraries. So I strongly agree with Nancy that these estimates are functionally very useful. However, there were a few outlier libraries where the ELS didn't seem to correlate as well with my expectation (loosely based on successfully getting X useful coverage from some amount of sequence reads.). Looking in to the outliers further I was surprised that some of the outliers were actually just the libraries that had more, or less sequence than expected. So next, I tried downsampling thousands of libraries at different levels, and checking the ELS. This is where it got interesting. When downsampling (and re-dup marking) within single lanes of sequence the ELS estimates showed a linear relationship with the number of reads used. See attached. There seems to be a not-insignificant offset when comparing ELS for libraries with different numbers of reads. Long story short(ish), I ended up just downsampling 1500 genome libraries to 10 million reads to build a distribution of the ELS values at that level. We can reasonably place any new libraries on the curve to see how they compare to the old libraries, which seems to give us the level of feedback we need to know if the libraries are complex enough for our needs. RIchard On 10/17/2013 08:39 AM, Heng Li wrote: > On Oct 17, 2013, at 11:18 AM, "Hansen, Nancy (NIH/NHGRI) [E]" wrote: > >> But when we sequence a library using typical next generation protocols, there is much more happening behind the scenes. Very importantly, there is an initial step in which the library is amplified, so that rather than having one copy of each unique fragment, we have a distribution of multiplicities for our original fragments. Once a fragment is multiplied, it is not only more likely to be selected for sequencing: it is also more likely to be amplified again during the amplification step! > Actually, if we assume the number of amplifications follows a Poisson distribution, the Picard formula is still true. You can find the detailed derivation here ;. Equation 1.2 in the document is the same as the Picard equation but with different notations. > > Heng > > > ```
 Re: [Samtools-help] Pickard estimate for the size of a library - wrong or non-transparent? From: Alexey Larionov - 2013-10-19 21:56:23 ```Hi Nancy, Thank you very much for the detailed reply. I am relieved to hear that Lander-Waterman equation has nothing to do with the library size estimate :) I have some practical experience with PCR; this is why I like your suggestion, that in fact we have a sampling from a library X by N elements, where X is the number of unique elements and N is the number of repetitions of each unique element. Not sure yet how the Picard's equation can be derived from this. However, I am more inclined to believe in it now, when I know that it is based on a proper assumption. I also agree that complexity of library preparation procedures, which could include a couple rounds of capture and a couple of PCRs, raises a question about the accuracy of assumptions and estimates... Of course, providing the estimate was correct, this estimate would be very useful for scaling-up experiments. I hope that development of the new PCR-free library preparation kits may render this issue less practical :) Thanks again, Alexey -----Original Message----- From: Hansen, Nancy (NIH/NHGRI) [E] [mailto:nhansen@...] Sent: 17 October 2013 16:18 To: Alexey Larionov; samtools-help@... Cc: al720@... Subject: Re: [Samtools-help] Pickard estimate for the size of a library - wrong or non-transparent? Hi Alexey, I didn't see an answer to your question sent to the list, so in case others are interested as well: You are right that claiming that Picard's equation for library size is "based on the Lander-Waterman equation" is highly misleading. The two cases are mathematically similar, but the sampling problems they describe are vastly different. In the case of the Picard library size estimate, it looks as if the assumption being made is that in sequencing a library, we are choosing (with replacement) a random fragment from a library of X unique fragments N times, and then using C, the number of unique fragments we've observed, to estimate X, the total number of fragments in the population from which we are sampling. Given all of the assumptions made (see next paragraph), the equation used by Picard is valid enough (see http://math.stackexchange.com/questions/32800/probability-distribution-of-co verage-of-a-set-after-x-independently-randomly for a description of the statistics). But when we sequence a library using typical next generation protocols, there is much more happening behind the scenes. Very importantly, there is an initial step in which the library is amplified, so that rather than having one copy of each unique fragment, we have a distribution of multiplicities for our original fragments. Once a fragment is multiplied, it is not only more likely to be selected for sequencing: it is also more likely to be amplified again during the amplification step! In addition, different capture efficiencies will also affect our distribution of fragments (much to the dismay of developers of copy number calling software ;-)) The true statistical analysis gets very complicated very quickly, so I wouldn't even want to hazard a guess as to how accurate the Picard equation is in practice. Still, I think the equation is good enough for a rough estimate and useful for analyses of return on investment in sequencing, since if we have the ideal situation (equal representation of all the desired fragments in our library), the equation gives us a good picture of how much of the library we've sampled so far with our sequencing. This library size estimate can be highly valuable in deciding whether sequencing more of the library is likely to yield more unique information, or simply duplicate information, so it's a very nice thing that it's including in the Picard package. Regards, --Nancy -- ************************************* Nancy F. Hansen, PhD nhansen@... Comparative Genomics Unit, NHGRI 5625 Fishers Lane Rockville, MD 20852 Phone: (301) 435-1560 Fax: (301) 435-6170 From: Alexey Larionov > Date: Sun, 13 Oct 2013 00:21:29 +0100 To: > Cc: > Subject: [Samtools-help] Pickard estimate for the size of a library - wrong or non-transparent? Dear list, EstimateLibraryComplexity Picard tool attempts to calculate number of unique molecules in "the whole library" basing on the Lander-Waterman equation. This is how it is described on Picard's manual: "Estimates the size of a library based on the number of paired end molecules observed and the number of unique pairs observed. Based on the Lander-Waterman equation that states: C/X = 1 - exp( -N/X ) where X = number of distinct molecules in library N = number of read pairs C = number of distinct fragments observed in read pairs" In fact the Lander-Waterman equation estimates probability of genome coverage basing on size of genome, number of fragments and length of each fragment: E = 1 - exp(-NL/G) where N is number of fragments, L is size of fragment and G is size of genome (from Wikipedia) It looks to me that the Picard's interpretation of the equation requires more explanation than is given. My initial instinct is that the interpretation is merely wrong, for it does not account (i) for the size of genome and (ii) for the fact that only a fraction of actual library was captured in the experiment. I have seen that question about the way how Picard calculates the library size comes again and again in this list. A typical reply is that the algorithm is openly available and that it is based on some complex statistics that is beyond a biologically educated person. Could you please explain for statistically educated people how exactly you derived your interpretation of the Lander-Waterman equation and give a link to the relevant paper(s). Please, I do not mean link to the seminal paper of Lander-Waterman, but any paper explaining how exactly the interpretation was derived. Thank you, Alexey ---------------------------------------------------------------------------- -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk_ ______________________________________________ Samtools-help mailing list Samtools-help@... https://lists.sourceforge.net/lists/listinfo/samtools-help ```