Menu

#8 Too many scaffolds

New
nobody
None
Medium
Defect
2013-07-02
2013-05-05
Anonymous
No

Originally created by: cadan...@mail.usf.edu

What steps will reproduce the problem?

Trying to assemble a bacterial genome from using raw Illumina PE data generated from Tru Seq libraries.

perl a5_pipeline.pl ~/ruegeria_genome_0413/R1_001.fastq ~/ruegeria_ genome_0413/R2_001.fastq 0413_genome.out

What is the expected output? What do you see instead?
Using command above, I get only .s1-.s4 files, and not a fifth file, which should contain the final assembly stats. I also expect fewer scaffolds, but get 1,800+. The prompt below shows that PE mode is not working.

preprocess: WARNING - it is suggested that the min read length is 40
preprocess: Using very short reads may considerably impact the performance
Parameters:
QualTrim: 10
QualFilter: at most 20 low quality bases
HardClip: 0
Min length: 29
Sample freq: 1
PE Mode: 0
Quality scaling: 2
MinGC: 0
MaxGC: 1
Outfile: stdout

[samopen] SAM header is present: 1860 sequences.
[bam_sort_core] merging from 12 files...
[a5] java -Xmx15912m -jar A5qc.jar test_ruegeria_genomev2.out.s4/test_ruegeria_genomev2.out.qc.libraw1.sam test_ruegeria_genomev2.out.crude.scaffolds.fasta t1 > test_ruegeria_genomev2.out.s4/test_ruegeria_genomev2.out.qc.libraw1.qc.out
[a5_s5] No misassemblies found.
[a5] Final assembly in test_ruegeria_genomev2.out.final.scaffolds.fasta

What version of the product are you using? On what operating system?
ngopt_a5pipeline_linux-x64_20120518

Please provide any additional information below.
Please help to resolve these issues. Thanks!

Discussion

  • Anonymous

    Anonymous - 2013-07-02

    Originally posted by: jlklas...@wisc.edu

    I have a similar problem, where the a5_s5 step is skipped because no misassemblies are found (as from the .pl script it is clear it is supposed to).

    What I am more concerned by is the behaviour of a5_qc. According to the a5_s3 .raw1.summaryfile.txt summary file, my PE library has an insert size of 160 +/-152. The s4 estimate is in agreement, but seems to remove the reads having those characteristics and uses a miniscule subset of the data for correction:

    [a5_qc] Found the following clusters:
    [a5_qc] cluster1: mu=159       sd=24        n=47697     perc=80.02       (signal)
    [a5_qc] cluster0: mu=180       sd=283       n=11907     perc=19.98       (noise)
    [a5_qc] Removing  cluster1
    [a5_qc] Final stats for sample after filtering: mu=180 sd=283 n=11907
    [a5_qc] Filtering read pairs with inserts between 1-318
    [a5_qc] Reading SAM file.....10%..20%..30%..40%..50%..60%..70%..80%..90%..100%... done!... Took 60 seconds.
    [a5_qc] Keeping 0.42% (18353/4405428) of reads.

    This seems to me a possible reason why I am not getting misassembly detection, where cluster1 is removed instead of cluster0 (i.e., the noise is kept, not the signal). Or am I way off and just not understanding how this step works?

    Complete .qc.libraw1.qc.out below, except skipping most of the block finding:

    [a5_qc] Reading run336_good.s4/run336_good.qc.libraw1.sam
    [a5_qc] Found 6163 contigs
    [a5_qc] Reading in a subset of reads for insert size estimation.
    [a5_qc] Took 3 seconds to read in 100000 read pairs.
    [a5_qc] Found a substantial amount of innies, but found no outties.
    [a5_qc] EM-clustering insert sizes with K=2... stopping after 5 iterations with delta=5.0E-8. L = NaN. Took 0 seconds.
    [a5_qc] EM-clustering insert sizes with K=3... stopping after 9 iterations with delta=5.0E-8. L = NaN. Took 2 seconds.
    [a5_qc] EM-clustering insert sizes with K=4... stopping after 6 iterations with delta=5.0E-8. L = NaN. Took 1 seconds.
    [a5_qc] EM-clustering insert sizes with K=5... stopping after 24 iterations with delta=5.0E-8. L = -73608.67790434555. Took 5 seconds.
    [a5_qc] EM-clustering insert sizes with K=6... stopping after 26 iterations with delta=5.0E-8. L = -87322.21447382745. Took 6 seconds.
    [a5_qc] EM-clustering insert sizes with K=7... stopping after 53 iterations with delta=5.0E-8. L = -90065.26303883643. Took 15 seconds.
    [a5_qc] EM-clustering insert sizes with K=8... stopping after 48 iterations with delta=5.0E-8. L = -99419.94083752079. Took 16 seconds.
    [a5_qc] EM-clustering insert sizes with K=9... stopping after 78 iterations with delta=5.0E-8. L = -110666.50469422666. Took 27 seconds.
    [a5_qc] EM-clustering insert sizes with K=10... stopping after 97 iterations with delta=5.0E-8. L = -110737.19307312598. Took 36 seconds.
    [a5_qc] EM-clustering insert sizes with K=11... stopping after 215 iterations with delta=5.0E-8. L = -115798.31962673985. Took 84 seconds.
    [a5_qc] EM-clustering insert sizes with K=12... stopping after 40 iterations with delta=5.0E-8. L = -128587.19678705752. Took 15 seconds.
    [a5_qc] EM-clustering insert sizes with K=13... stopping after 158 iterations with delta=5.0E-8. L = -120193.90440601777. Took 56 seconds.
    [a5_qc] EM-clustering insert sizes with K=14... stopping after 95 iterations with delta=5.0E-8. L = -124348.34369786637. Took 34 seconds.
    [a5_qc] EM-clustering insert sizes with K=15... stopping after 280 iterations with delta=5.0E-8. L = -128767.13967597009. Took 105 seconds.
    [a5_qc] EM-clustering insert sizes with K=16... stopping after 768 iterations with delta=5.0E-8. L = -131085.39492548717. Took 296 seconds.
    [a5_qc] EM-clustering insert sizes with K=17... stopping after 106 iterations with delta=5.0E-8. L = -146030.94750065517. Took 44 seconds.
    [a5_qc] EM-clustering insert sizes with K=18... stopping after 984 iterations with delta=5.0E-8. L = -133506.33790708426. Took 414 seconds.
    [a5_qc] EM-clustering insert sizes with K=19... stopping after 776 iterations with delta=5.0E-8. L = -137479.52186023252. Took 316 seconds.
    [a5_qc] EM-clustering insert sizes with K=20... stopping after 1000 iterations with delta=5.0E-8. L = -141556.22416973184. Took 419 seconds.
    [a5_qc] EM-clustering insert sizes with K=21... stopping after 1000 iterations with delta=5.0E-8. L = -143044.80630908583. Took 531 seconds.
    [a5_qc] Found 1 clusters.
    [a5_qc] Found the following clusters:
    [a5_qc] cluster1: mu=159       sd=24        n=47697     perc=80.02       (signal)
    [a5_qc] cluster0: mu=180       sd=283       n=11907     perc=19.98       (noise)
    [a5_qc] Removing  cluster1
    [a5_qc] Final stats for sample after filtering: mu=180 sd=283 n=11907
    [a5_qc] Filtering read pairs with inserts between 1-318
    [a5_qc] Reading SAM file.....10%..20%..30%..40%..50%..60%..70%..80%..90%..100%... done!... Took 60 seconds.
    [a5_qc] Keeping 0.42% (18353/4405428) of reads.
    [a5_qc] parameters:
            P                   = 0.062
            MIN_BLOCK_LEN       = 284
            MEAN_BLOCK_LEN      = 159
            MAX_BLOCK_LEN       = 303
            MAX_INTERBLOCK_DIST = 318
            MAX_INTERPOINT_DIST = 142
            EPSILON             = 142.0
            MIN_POINTS          = 17
    [a5_qc] Found 1 initial blocks between contigs 3233 and 5105
            548-680 <-> 22-230
    [a5_qc] Found 1 initial blocks between contigs 147 and 640
            1-212 <-> 3-224

    ...

    [a5_qc] Found 0 blocks on contig 3481
    [a5_qc] Found 0 blocks on contig 2595

     
  • Anonymous

    Anonymous - 2013-07-02

    Originally posted by: jlklas...@wisc.edu

    Sorry forgot to add: using ngopt_a5pipeline_linux-x64_20120518 on Ubuntu 12 LINUX

     

Log in to post a comment.

MongoDB Logo MongoDB