Re: [Samtools-help] Question about MarkDuplicates tool from Picard

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

See http://en.wikipedia.org/wiki/DNA_sequencing_theory#Lander-Waterman_theory :

One of the more frequently used results from this model is the expected number of contigs, given the number of fragments sequenced. If one neglects the amount of sequence that is essentially "wasted" by having to detect overlaps, their theory yields

See also the definition of estimated library size: http://picard.sourceforge.net/picard-metric-definitions.shtml#DuplicationMetrics

ESTIMATED_LIBRARY_SIZE: The estimated number of unique molecules in the library based on PE duplication.

-Alec

On Mar 26, 2013, at 8:13 AM, Emeric Dubois <Eme...@mg...> wrote:

> Hi Alec,
> 
> It was not easy to understand your definition of "library size" because in most software "library size" is sequencing depth for each sample.
> Where can I find the algorithm of this estimate?
> 
> Emeric
> On 25/03/2013 17:58, Alec Wysoker wrote:
>> Hi Emeric,
>> 
>> I think you are not understanding the concept of library size.  Library size is the number of unique molecules in your sample.  When you sequence your sample, you sequence some subset of the molecules in your sample.  I'm guessing that 171 million is the number of molecules that you sequenced, rather than the entire library.  
>> 
>> As for why changing pixel distance changes the estimate library size, I don't know that I can provide a clearer explanation than the one I already provided.  
>> 
>> You might want to do some googling about duplication detection, or get a colleague to walk you through these concepts.
>> 
>> -Alec
>> 
>> 
>> 
>> On Mar 21, 2013, at 5:06 AM, Emeric Dubois <Eme...@mg...> wrote:
>> 
>>> Hi Alec,
>>> 
>>> Thanks for your response.
>>> But I don't understand why there is 317 706 634 for "ESTIMATED_LIBRARY_SIZE" in the first test (100 pixels)?
>>> There are only 171M clusters in my library...
>>> And why the "ESTIMATED_LIBRARY_SIZE" in the second test (10 pixels) is 248 569 048? The difference between the two tests is only 5M for READ_"PAIR_OPTICAL_DUPLICATES"...
>>> Thanks for your help
>>> 
>>> Emeric
>>> 
>>> On 20/03/2013 17:31, Alec Wysoker wrote:
>>>> Hi Emeric,
>>>> 
>>>> There are 3 possible reasons that reads could be identified as duplicates of one another:
>>>> the same fragment from the original sample was duplicated via PCR;
>>>> the (highly unlikely) case that two fragments from the original sample were identical.  This number is assumed to be zero;
>>>> a single cluster sequenced by the instrument was mis-identified as as two clusters.
>>>> 
>>>> The algorithm for determining library size only considers category (1) above.  As you change OPTICAL_DUPLICATE_PIXEL_DISTANCE, you change the number of reads in category 1 vs. category 3, and therefore the estimated library size is changed.
>>>> 
>>>> -Alec
>>>> 
>>>> 
>>>> On Mar 20, 2013, at 11:46 AM, Emeric Dubois <Eme...@mg...> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I'm testing the MarkDuplicates tool from Picard and I don't understand the results.
>>>>> When I change only the "OPTICAL_DUPLICATE_PIXEL_DISTANCE" option, in metrics file, I get very different results for "ESTIMATED_LIBRARY_SIZE" while the other metrics do not change much (only the number of read pairs duplicates that were caused by optical duplication: 18 770 305 to 13 320 324).
>>>>> Why is the estimated number of unique molecules in the library that different? (317 706 634 and 248 569 048). The percentages of mapped sequence that is marked as duplicate are the same....
>>>>> Thanks for your help
>>>>> 
>>>>> Emeric
>>>>> 
>>>>> First test :
>>>>> 
>>>>> java -Xmx8g -jar /data/software/picard-tools-1.87/MarkDuplicates.jar \
>>>>> INPUT=accepted_hits.sort.bam \
>>>>> OUTPUT=accepted_hits.sort.Picard-MarkDuplicates.bam \
>>>>> METRICS_FILE=Picard-MarkDuplicates.metrics.txt \
>>>>> REMOVE_DUPLICATES=true \
>>>>> ASSUME_SORTED=true \
>>>>> MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024
>>>>> 
>>>>> Picard-MarkDuplicates.metrics.txt :
>>>>> 
>>>>> ## net.sf.picard.metrics.StringHeader
>>>>> # net.sf.picard.sam.MarkDuplicates INPUT=[accepted_hits.sort.bam] OUTPUT=accepted_hits.sort.Picard-MarkDuplicates.bam METRICS_FILE=Picard-MarkDuplicates.metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
>>>>> ## net.sf.picard.metrics.StringHeader
>>>>> # Started on: Tue Mar 19 18:10:42 CET 2013
>>>>> 
>>>>> ## METRICS CLASS        net.sf.picard.sam.DuplicationMetrics
>>>>> LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED     UNMAPPED_READS  UNPAIRED_READ_DUPLICATES        READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION     ESTIMATED_LIBRARY_SIZE
>>>>> Unknown Library 23973276        120683136       0       20803847        33499791        18770305        0,33091 317706634
>>>>> 
>>>>> 
>>>>> Second test :
>>>>> 
>>>>> java -Xmx8g -jar /data/software/picard-tools-1.87/MarkDuplicates.jar \
>>>>> INPUT=accepted_hits.sort.bam \
>>>>> OUTPUT=accepted_hits.sort.Picard-MarkDuplicates_10pixels.bam \
>>>>> METRICS_FILE=Picard-MarkDuplicates.metrics_10pixels.txt \
>>>>> REMOVE_DUPLICATES=true \
>>>>> ASSUME_SORTED=true \
>>>>> MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024 \
>>>>> OPTICAL_DUPLICATE_PIXEL_DISTANCE=10
>>>>> 
>>>>> ## net.sf.picard.metrics.StringHeader
>>>>> # net.sf.picard.sam.MarkDuplicates INPUT=[accepted_hits.sort.bam] OUTPUT=accepted_hits.sort.Picard-MarkDuplicates_10pixels.bam METRICS_FILE=Picard-MarkDuplicates.metrics_10pixels.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024 OPTICAL_DUPLICATE_PIXEL_DISTANCE=10                        PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 SORTING_COLLECTION_SIZE_RATIO=0.25                     READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
>>>>> ## net.sf.picard.metrics.StringHeader
>>>>> # Started on: Tue Mar 19 19:46:11 CET 2013
>>>>> 
>>>>> ## METRICS CLASS        net.sf.picard.sam.DuplicationMetrics
>>>>> LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED     UNMAPPED_READS  UNPAIRED_READ_DUPLICATES        READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION     ESTIMATED_LIBRARY_SIZE
>>>>> Unknown Library 23973276        120683136       0       20803847        33499791        13320324        0,33091 248569048
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> Everyone hates slow websites. So do we.
>>>>> Make your web apps faster with AppDynamics
>>>>> Download AppDynamics Lite for free today:
>>>>> http://p.sf.net/sfu/appdyn_d2d_mar_______________________________________________
>>>>> Samtools-help mailing list
>>>>> Sam...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-help
>>>> 
>>> 
>> 
>