Re: [Samtools-help] Pickard estimate for the size of a library - wrong or non-transparent?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Nancy,  

Thank you very much for the detailed reply.  I am relieved to hear that
Lander-Waterman equation has nothing to do with the library size estimate :)

I have some practical experience with PCR; this is why I like your
suggestion, that in fact we have a sampling from a library X by N elements,
where X is the number of unique elements and N is the number of repetitions
of each unique element.  Not sure yet how the Picard's equation can be
derived from this.  However, I am more inclined to believe in it now, when I
know that it is based on a proper assumption.  

I also agree that complexity of library preparation procedures, which could
include a couple rounds of capture and a couple of PCRs, raises a question
about the accuracy of assumptions and estimates...  Of course, providing the
estimate was correct, this estimate would be very useful for scaling-up
experiments.  

I hope that development of the new PCR-free library preparation kits may
render this issue less practical :)

Thanks again,
Alexey

-----Original Message-----
From: Hansen, Nancy (NIH/NHGRI) [E] [mailto:nh...@ma...] 
Sent: 17 October 2013 16:18
To: Alexey Larionov; sam...@li...
Cc: al...@ca...
Subject: Re: [Samtools-help] Pickard estimate for the size of a library -
wrong or non-transparent?

Hi Alexey,

I didn't see an answer to your question sent to the list, so in case others
are interested as well:

You are right that claiming that Picard's equation for library size is
"based on the Lander-Waterman equation" is highly misleading.  The two cases
are mathematically similar, but the sampling problems they describe are
vastly different.

In the case of the Picard library size estimate, it looks as if the
assumption being made is that in sequencing a library, we are choosing (with
replacement) a random fragment from a library of X unique fragments N times,
and then using C, the number of unique fragments we've observed, to estimate
X, the total number of fragments in the population from which we are
sampling.  Given all of the assumptions made (see next paragraph), the
equation used by Picard is valid enough (see
http://math.stackexchange.com/questions/32800/probability-distribution-of-co
verage-of-a-set-after-x-independently-randomly for a description of the
statistics).

But when we sequence a library using typical next generation protocols,
there is much more happening behind the scenes.  Very importantly, there is
an initial step in which the library is amplified, so that rather than
having one copy of each unique fragment, we have a distribution of
multiplicities for our original fragments.  Once a fragment is multiplied,
it is not only more likely to be selected for sequencing: it is also more
likely to be amplified again during the amplification step!  In addition,
different capture efficiencies will also affect our distribution of
fragments (much to the dismay of developers of copy number calling software
;-)) The true statistical analysis gets very complicated very quickly, so I
wouldn't even want to hazard a guess as to how accurate the Picard equation
is in practice.

Still, I think the equation is good enough for a rough estimate and useful
for analyses of return on investment in sequencing, since if we have the
ideal situation (equal representation of all the desired fragments in our
library), the equation gives us a good picture of how much of the library
we've sampled so far with our sequencing.  This library size estimate can be
highly valuable in deciding whether sequencing more of the library is likely
to yield more unique information, or simply duplicate information, so it's a
very nice thing that it's including in the Picard package.

Regards,
--Nancy

--
*************************************
Nancy F. Hansen, PhD    nh...@nh...
Comparative Genomics Unit, NHGRI
5625 Fishers Lane
Rockville, MD 20852
Phone: (301) 435-1560   Fax: (301) 435-6170

From: Alexey Larionov
<ale...@ho...<mailto:ale...@ho...>>
Date: Sun, 13 Oct 2013 00:21:29 +0100
To:
<sam...@li...<mailto:sam...@li...urceforge.
net>>
Cc: <al...@ca...<mailto:al...@ca...>>
Subject: [Samtools-help] Pickard estimate for the size of a library - wrong
or non-transparent?

Dear list,

EstimateLibraryComplexity Picard tool attempts to calculate number of unique
molecules in "the whole library" basing on the Lander-Waterman equation.
This is how it is described on Picard's manual:

"Estimates the size of a library based on the number of paired end molecules
observed and the number of unique pairs observed. Based on the
Lander-Waterman equation that states:
C/X = 1 - exp( -N/X )
where X = number of distinct molecules in library N = number of read pairs C
= number of distinct fragments observed in read pairs"

In fact the Lander-Waterman equation estimates probability of genome
coverage basing on size of genome, number of fragments and length of each
fragment:
E = 1 - exp(-NL/G)
where N is number of fragments, L is size of fragment and G is size of
genome (from Wikipedia)

It looks to me that the Picard's interpretation of the equation requires
more explanation than is given.  My initial instinct is that the
interpretation is merely wrong, for it does not account (i) for the size of
genome and (ii) for the fact that only a fraction of actual library was
captured in the experiment.

I have seen that question about the way how Picard calculates the library
size comes again and again in this list.  A typical reply is that the
algorithm is openly available and that it is based on some complex
statistics that is beyond a biologically educated person.

Could you please explain for statistically educated people how exactly you
derived your interpretation of the Lander-Waterman equation and give a link
to the relevant paper(s).  Please, I do not mean link to the seminal paper
of Lander-Waterman, but any paper explaining how exactly the interpretation
was derived.

Thank you,
Alexey
----------------------------------------------------------------------------
-- October Webinars: Code for Performance Free Intel webinars can help you
accelerate application performance. Explore tips for MPI, OpenMP, advanced
profiling, and more. Get the most from the latest Intel processors and
coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk_
______________________________________________ Samtools-help mailing list
Sam...@li...<mailto:Sam...@li...urceforge.n
et> https://lists.sourceforge.net/lists/listinfo/samtools-help