[Samtools-help] Pickard estimate for the size of a library - wrong or non-transparent?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Dear list, 

EstimateLibraryComplexity Picard tool attempts to calculate number of unique
molecules in "the whole library" basing on the Lander-Waterman equation.
This is how it is described on Picard's manual: 

"Estimates the size of a library based on the number of paired end molecules
observed and the number of unique pairs observed. Based on the
Lander-Waterman equation that states: 

C/X = 1 - exp( -N/X ) 

where X = number of distinct molecules in library N = number of read pairs C
= number of distinct fragments observed in read pairs"

In fact the Lander-Waterman equation estimates probability of genome
coverage basing on size of genome, number of fragments and length of each
fragment:

E = 1 - exp(-NL/G)

where N is number of fragments, L is size of fragment and G is size of
genome (from Wikipedia)

It looks to me that the Picard's interpretation of the equation requires
more explanation than is given.  My initial instinct is that the
interpretation is merely wrong, for it does not account (i) for the size of
genome and (ii) for the fact that only a fraction of actual library was
captured in the experiment. 

I have seen that question about the way how Picard calculates the library
size comes again and again in this list.  A typical reply is that the
algorithm is openly available and that it is based on some complex
statistics that is beyond a biologically educated person.  

Could you please explain for statistically educated people how exactly you
derived your interpretation of the Lander-Waterman equation and give a link
to the relevant paper(s).  Please, I do not mean link to the seminal paper
of Lander-Waterman, but any paper explaining how exactly the interpretation
was derived.  

Thank you,

Alexey