ngopt / Tickets / #8 Too many scaffolds

Originally posted by: jlklas...@wisc.edu

I have a similar problem, where the a5_s5 step is skipped because no misassemblies are found (as from the .pl script it is clear it is supposed to).

What I am more concerned by is the behaviour of a5_qc. According to the a5_s3 .raw1.summaryfile.txt summary file, my PE library has an insert size of 160 +/-152. The s4 estimate is in agreement, but seems to remove the reads having those characteristics and uses a miniscule subset of the data for correction:

[a5_qc] Found the following clusters:
[a5_qc] cluster1: mu=159 sd=24 n=47697 perc=80.02 (signal)
[a5_qc] cluster0: mu=180 sd=283 n=11907 perc=19.98 (noise)
[a5_qc] Removing cluster1
[a5_qc] Final stats for sample after filtering: mu=180 sd=283 n=11907
[a5_qc] Filtering read pairs with inserts between 1-318
[a5_qc] Reading SAM file.....10%..20%..30%..40%..50%..60%..70%..80%..90%..100%... done!... Took 60 seconds.
[a5_qc] Keeping 0.42% (18353/4405428) of reads.

This seems to me a possible reason why I am not getting misassembly detection, where cluster1 is removed instead of cluster0 (i.e., the noise is kept, not the signal). Or am I way off and just not understanding how this step works?

Complete .qc.libraw1.qc.out below, except skipping most of the block finding:

[a5_qc] Reading run336_good.s4/run336_good.qc.libraw1.sam
[a5_qc] Found 6163 contigs
[a5_qc] Reading in a subset of reads for insert size estimation.
[a5_qc] Took 3 seconds to read in 100000 read pairs.
[a5_qc] Found a substantial amount of innies, but found no outties.
[a5_qc] EM-clustering insert sizes with K=2... stopping after 5 iterations with delta=5.0E-8. L = NaN. Took 0 seconds.
[a5_qc] EM-clustering insert sizes with K=3... stopping after 9 iterations with delta=5.0E-8. L = NaN. Took 2 seconds.
[a5_qc] EM-clustering insert sizes with K=4... stopping after 6 iterations with delta=5.0E-8. L = NaN. Took 1 seconds.
[a5_qc] EM-clustering insert sizes with K=5... stopping after 24 iterations with delta=5.0E-8. L = -73608.67790434555. Took 5 seconds.
[a5_qc] EM-clustering insert sizes with K=6... stopping after 26 iterations with delta=5.0E-8. L = -87322.21447382745. Took 6 seconds.
[a5_qc] EM-clustering insert sizes with K=7... stopping after 53 iterations with delta=5.0E-8. L = -90065.26303883643. Took 15 seconds.
[a5_qc] EM-clustering insert sizes with K=8... stopping after 48 iterations with delta=5.0E-8. L = -99419.94083752079. Took 16 seconds.
[a5_qc] EM-clustering insert sizes with K=9... stopping after 78 iterations with delta=5.0E-8. L = -110666.50469422666. Took 27 seconds.
[a5_qc] EM-clustering insert sizes with K=10... stopping after 97 iterations with delta=5.0E-8. L = -110737.19307312598. Took 36 seconds.
[a5_qc] EM-clustering insert sizes with K=11... stopping after 215 iterations with delta=5.0E-8. L = -115798.31962673985. Took 84 seconds.
[a5_qc] EM-clustering insert sizes with K=12... stopping after 40 iterations with delta=5.0E-8. L = -128587.19678705752. Took 15 seconds.
[a5_qc] EM-clustering insert sizes with K=13... stopping after 158 iterations with delta=5.0E-8. L = -120193.90440601777. Took 56 seconds.
[a5_qc] EM-clustering insert sizes with K=14... stopping after 95 iterations with delta=5.0E-8. L = -124348.34369786637. Took 34 seconds.
[a5_qc] EM-clustering insert sizes with K=15... stopping after 280 iterations with delta=5.0E-8. L = -128767.13967597009. Took 105 seconds.
[a5_qc] EM-clustering insert sizes with K=16... stopping after 768 iterations with delta=5.0E-8. L = -131085.39492548717. Took 296 seconds.
[a5_qc] EM-clustering insert sizes with K=17... stopping after 106 iterations with delta=5.0E-8. L = -146030.94750065517. Took 44 seconds.
[a5_qc] EM-clustering insert sizes with K=18... stopping after 984 iterations with delta=5.0E-8. L = -133506.33790708426. Took 414 seconds.
[a5_qc] EM-clustering insert sizes with K=19... stopping after 776 iterations with delta=5.0E-8. L = -137479.52186023252. Took 316 seconds.
[a5_qc] EM-clustering insert sizes with K=20... stopping after 1000 iterations with delta=5.0E-8. L = -141556.22416973184. Took 419 seconds.
[a5_qc] EM-clustering insert sizes with K=21... stopping after 1000 iterations with delta=5.0E-8. L = -143044.80630908583. Took 531 seconds.
[a5_qc] Found 1 clusters.
[a5_qc] Found the following clusters:
[a5_qc] cluster1: mu=159       sd=24        n=47697     perc=80.02       (signal)
[a5_qc] cluster0: mu=180       sd=283       n=11907     perc=19.98       (noise)
[a5_qc] Removing cluster1
[a5_qc] Final stats for sample after filtering: mu=180 sd=283 n=11907
[a5_qc] Filtering read pairs with inserts between 1-318
[a5_qc] Reading SAM file.....10%..20%..30%..40%..50%..60%..70%..80%..90%..100%... done!... Took 60 seconds.
[a5_qc] Keeping 0.42% (18353/4405428) of reads.
[a5_qc] parameters:
        P                   = 0.062
        MIN_BLOCK_LEN       = 284
        MEAN_BLOCK_LEN      = 159
        MAX_BLOCK_LEN       = 303
        MAX_INTERBLOCK_DIST = 318
        MAX_INTERPOINT_DIST = 142
        EPSILON             = 142.0
        MIN_POINTS          = 17
[a5_qc] Found 1 initial blocks between contigs 3233 and 5105
        548-680 <-> 22-230
[a5_qc] Found 1 initial blocks between contigs 147 and 640
        1-212 <-> 3-224

...

[a5_qc] Found 0 blocks on contig 3481
[a5_qc] Found 0 blocks on contig 2595

*Originally posted by:* [jlklas...@wisc.edu](http://code.google.com/u/113513768327076379675/)

I have a similar problem, where the a5\_s5 step is skipped because no misassemblies are found \(as from the .pl script it is clear it is supposed to\).

What I am more concerned by is the behaviour of a5\_qc. According to the a5\_s3 .raw1.summaryfile.txt summary file, my PE library has an insert size of 160 +/-152. The s4 estimate is in agreement, but seems to remove the reads having those characteristics and uses a miniscule subset of the data for correction:

\[a5\_qc\] Found the following clusters:
\[a5\_qc\] cluster1: mu=159&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sd=24&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; n=47697&nbsp;&nbsp;&nbsp;&nbsp; perc=80.02&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \(signal\)
\[a5\_qc\] cluster0: mu=180&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sd=283&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; n=11907&nbsp;&nbsp;&nbsp;&nbsp; perc=19.98&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \(noise\)
\[a5\_qc\] Removing&nbsp; cluster1
\[a5\_qc\] Final stats for sample after filtering: mu=180 sd=283 n=11907
\[a5\_qc\] Filtering read pairs with inserts between 1-318
\[a5\_qc\] Reading SAM file.....10%..20%..30%..40%..50%..60%..70%..80%..90%..100%... done\!... Took 60 seconds.
\[a5\_qc\] Keeping 0.42% \(18353/4405428\) of reads.

This seems to me a possible reason why I am not getting misassembly detection, where cluster1 is removed instead of cluster0 \(i.e., the noise is kept, not the signal\). Or am I way off and just not understanding how this step works?

Complete .qc.libraw1.qc.out below, except skipping most of the block finding:

\[a5\_qc\] Reading run336\_good.s4/run336\_good.qc.libraw1.sam
\[a5\_qc\] Found 6163 contigs
\[a5\_qc\] Reading in a subset of reads for insert size estimation.
\[a5\_qc\] Took 3 seconds to read in 100000 read pairs.
\[a5\_qc\] Found a substantial amount of innies, but found no outties.
\[a5\_qc\] EM-clustering insert sizes with K=2... stopping after 5 iterations with delta=5.0E-8. L = NaN. Took 0 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=3... stopping after 9 iterations with delta=5.0E-8. L = NaN. Took 2 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=4... stopping after 6 iterations with delta=5.0E-8. L = NaN. Took 1 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=5... stopping after 24 iterations with delta=5.0E-8. L = -73608.67790434555. Took 5 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=6... stopping after 26 iterations with delta=5.0E-8. L = -87322.21447382745. Took 6 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=7... stopping after 53 iterations with delta=5.0E-8. L = -90065.26303883643. Took 15 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=8... stopping after 48 iterations with delta=5.0E-8. L = -99419.94083752079. Took 16 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=9... stopping after 78 iterations with delta=5.0E-8. L = -110666.50469422666. Took 27 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=10... stopping after 97 iterations with delta=5.0E-8. L = -110737.19307312598. Took 36 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=11... stopping after 215 iterations with delta=5.0E-8. L = -115798.31962673985. Took 84 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=12... stopping after 40 iterations with delta=5.0E-8. L = -128587.19678705752. Took 15 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=13... stopping after 158 iterations with delta=5.0E-8. L = -120193.90440601777. Took 56 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=14... stopping after 95 iterations with delta=5.0E-8. L = -124348.34369786637. Took 34 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=15... stopping after 280 iterations with delta=5.0E-8. L = -128767.13967597009. Took 105 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=16... stopping after 768 iterations with delta=5.0E-8. L = -131085.39492548717. Took 296 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=17... stopping after 106 iterations with delta=5.0E-8. L = -146030.94750065517. Took 44 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=18... stopping after 984 iterations with delta=5.0E-8. L = -133506.33790708426. Took 414 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=19... stopping after 776 iterations with delta=5.0E-8. L = -137479.52186023252. Took 316 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=20... stopping after 1000 iterations with delta=5.0E-8. L = -141556.22416973184. Took 419 seconds.
\[a5\_qc\] EM-clustering insert sizes with K=21... stopping after 1000 iterations with delta=5.0E-8. L = -143044.80630908583. Took 531 seconds.
\[a5\_qc\] Found 1 clusters.
\[a5\_qc\] Found the following clusters:
\[a5\_qc\] cluster1: mu=159&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sd=24&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; n=47697&nbsp;&nbsp;&nbsp;&nbsp; perc=80.02&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \(signal\)
\[a5\_qc\] cluster0: mu=180&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sd=283&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; n=11907&nbsp;&nbsp;&nbsp;&nbsp; perc=19.98&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \(noise\)
\[a5\_qc\] Removing&nbsp; cluster1
\[a5\_qc\] Final stats for sample after filtering: mu=180 sd=283 n=11907
\[a5\_qc\] Filtering read pairs with inserts between 1-318
\[a5\_qc\] Reading SAM file.....10%..20%..30%..40%..50%..60%..70%..80%..90%..100%... done\!... Took 60 seconds.
\[a5\_qc\] Keeping 0.42% \(18353/4405428\) of reads.
\[a5\_qc\] parameters:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; P&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = 0.062
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MIN\_BLOCK\_LEN&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = 284
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MEAN\_BLOCK\_LEN&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = 159
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MAX\_BLOCK\_LEN&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = 303
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MAX\_INTERBLOCK\_DIST = 318
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MAX\_INTERPOINT\_DIST = 142
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; EPSILON&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = 142.0
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MIN\_POINTS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = 17
\[a5\_qc\] Found 1 initial blocks between contigs 3233 and 5105
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 548-680 &lt;-&gt; 22-230
\[a5\_qc\] Found 1 initial blocks between contigs 147 and 640
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1-212 &lt;-&gt; 3-224

...

\[a5\_qc\] Found 0 blocks on contig 3481
\[a5\_qc\] Found 0 blocks on contig 2595

Add attachments
Cancel

You seem to have CSS turned off. Please don't fill out this field.

Too many scaffolds

de novo assembly & analysis of Illumina sequence data

Searches

Help

#8 Too many scaffolds

Discussion