Originally created by: cadan...@mail.usf.edu
What steps will reproduce the problem?
Trying to assemble a bacterial genome from using raw Illumina PE data generated from Tru Seq libraries.
perl a5_pipeline.pl ~/ruegeria_genome_0413/R1_001.fastq ~/ruegeria_ genome_0413/R2_001.fastq 0413_genome.out
What is the expected output? What do you see instead?
Using command above, I get only .s1-.s4 files, and not a fifth file, which should contain the final assembly stats. I also expect fewer scaffolds, but get 1,800+. The prompt below shows that PE mode is not working.
preprocess: WARNING - it is suggested that the min read length is 40
preprocess: Using very short reads may considerably impact the performance
Parameters:
QualTrim: 10
QualFilter: at most 20 low quality bases
HardClip: 0
Min length: 29
Sample freq: 1
PE Mode: 0
Quality scaling: 2
MinGC: 0
MaxGC: 1
Outfile: stdout
[samopen] SAM header is present: 1860 sequences.
[bam_sort_core] merging from 12 files...
[a5] java -Xmx15912m -jar A5qc.jar test_ruegeria_genomev2.out.s4/test_ruegeria_genomev2.out.qc.libraw1.sam test_ruegeria_genomev2.out.crude.scaffolds.fasta t1 > test_ruegeria_genomev2.out.s4/test_ruegeria_genomev2.out.qc.libraw1.qc.out
[a5_s5] No misassemblies found.
[a5] Final assembly in test_ruegeria_genomev2.out.final.scaffolds.fasta
What version of the product are you using? On what operating system?
ngopt_a5pipeline_linux-x64_20120518
Please provide any additional information below.
Please help to resolve these issues. Thanks!
View and moderate all "tickets Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Tickets"
Originally posted by: jlklas...@wisc.edu
I have a similar problem, where the a5_s5 step is skipped because no misassemblies are found (as from the .pl script it is clear it is supposed to).
What I am more concerned by is the behaviour of a5_qc. According to the a5_s3 .raw1.summaryfile.txt summary file, my PE library has an insert size of 160 +/-152. The s4 estimate is in agreement, but seems to remove the reads having those characteristics and uses a miniscule subset of the data for correction:
[a5_qc] Found the following clusters:
[a5_qc] cluster1: mu=159 sd=24 n=47697 perc=80.02 (signal)
[a5_qc] cluster0: mu=180 sd=283 n=11907 perc=19.98 (noise)
[a5_qc] Removing cluster1
[a5_qc] Final stats for sample after filtering: mu=180 sd=283 n=11907
[a5_qc] Filtering read pairs with inserts between 1-318
[a5_qc] Reading SAM file.....10%..20%..30%..40%..50%..60%..70%..80%..90%..100%... done!... Took 60 seconds.
[a5_qc] Keeping 0.42% (18353/4405428) of reads.
This seems to me a possible reason why I am not getting misassembly detection, where cluster1 is removed instead of cluster0 (i.e., the noise is kept, not the signal). Or am I way off and just not understanding how this step works?
Complete .qc.libraw1.qc.out below, except skipping most of the block finding:
[a5_qc] Reading run336_good.s4/run336_good.qc.libraw1.sam
[a5_qc] Found 6163 contigs
[a5_qc] Reading in a subset of reads for insert size estimation.
[a5_qc] Took 3 seconds to read in 100000 read pairs.
[a5_qc] Found a substantial amount of innies, but found no outties.
[a5_qc] EM-clustering insert sizes with K=2... stopping after 5 iterations with delta=5.0E-8. L = NaN. Took 0 seconds.
[a5_qc] EM-clustering insert sizes with K=3... stopping after 9 iterations with delta=5.0E-8. L = NaN. Took 2 seconds.
[a5_qc] EM-clustering insert sizes with K=4... stopping after 6 iterations with delta=5.0E-8. L = NaN. Took 1 seconds.
[a5_qc] EM-clustering insert sizes with K=5... stopping after 24 iterations with delta=5.0E-8. L = -73608.67790434555. Took 5 seconds.
[a5_qc] EM-clustering insert sizes with K=6... stopping after 26 iterations with delta=5.0E-8. L = -87322.21447382745. Took 6 seconds.
[a5_qc] EM-clustering insert sizes with K=7... stopping after 53 iterations with delta=5.0E-8. L = -90065.26303883643. Took 15 seconds.
[a5_qc] EM-clustering insert sizes with K=8... stopping after 48 iterations with delta=5.0E-8. L = -99419.94083752079. Took 16 seconds.
[a5_qc] EM-clustering insert sizes with K=9... stopping after 78 iterations with delta=5.0E-8. L = -110666.50469422666. Took 27 seconds.
[a5_qc] EM-clustering insert sizes with K=10... stopping after 97 iterations with delta=5.0E-8. L = -110737.19307312598. Took 36 seconds.
[a5_qc] EM-clustering insert sizes with K=11... stopping after 215 iterations with delta=5.0E-8. L = -115798.31962673985. Took 84 seconds.
[a5_qc] EM-clustering insert sizes with K=12... stopping after 40 iterations with delta=5.0E-8. L = -128587.19678705752. Took 15 seconds.
[a5_qc] EM-clustering insert sizes with K=13... stopping after 158 iterations with delta=5.0E-8. L = -120193.90440601777. Took 56 seconds.
[a5_qc] EM-clustering insert sizes with K=14... stopping after 95 iterations with delta=5.0E-8. L = -124348.34369786637. Took 34 seconds.
[a5_qc] EM-clustering insert sizes with K=15... stopping after 280 iterations with delta=5.0E-8. L = -128767.13967597009. Took 105 seconds.
[a5_qc] EM-clustering insert sizes with K=16... stopping after 768 iterations with delta=5.0E-8. L = -131085.39492548717. Took 296 seconds.
[a5_qc] EM-clustering insert sizes with K=17... stopping after 106 iterations with delta=5.0E-8. L = -146030.94750065517. Took 44 seconds.
[a5_qc] EM-clustering insert sizes with K=18... stopping after 984 iterations with delta=5.0E-8. L = -133506.33790708426. Took 414 seconds.
[a5_qc] EM-clustering insert sizes with K=19... stopping after 776 iterations with delta=5.0E-8. L = -137479.52186023252. Took 316 seconds.
[a5_qc] EM-clustering insert sizes with K=20... stopping after 1000 iterations with delta=5.0E-8. L = -141556.22416973184. Took 419 seconds.
[a5_qc] EM-clustering insert sizes with K=21... stopping after 1000 iterations with delta=5.0E-8. L = -143044.80630908583. Took 531 seconds.
[a5_qc] Found 1 clusters.
[a5_qc] Found the following clusters:
[a5_qc] cluster1: mu=159 sd=24 n=47697 perc=80.02 (signal)
[a5_qc] cluster0: mu=180 sd=283 n=11907 perc=19.98 (noise)
[a5_qc] Removing cluster1
[a5_qc] Final stats for sample after filtering: mu=180 sd=283 n=11907
[a5_qc] Filtering read pairs with inserts between 1-318
[a5_qc] Reading SAM file.....10%..20%..30%..40%..50%..60%..70%..80%..90%..100%... done!... Took 60 seconds.
[a5_qc] Keeping 0.42% (18353/4405428) of reads.
[a5_qc] parameters:
P = 0.062
MIN_BLOCK_LEN = 284
MEAN_BLOCK_LEN = 159
MAX_BLOCK_LEN = 303
MAX_INTERBLOCK_DIST = 318
MAX_INTERPOINT_DIST = 142
EPSILON = 142.0
MIN_POINTS = 17
[a5_qc] Found 1 initial blocks between contigs 3233 and 5105
548-680 <-> 22-230
[a5_qc] Found 1 initial blocks between contigs 147 and 640
1-212 <-> 3-224
...
[a5_qc] Found 0 blocks on contig 3481
[a5_qc] Found 0 blocks on contig 2595
View and moderate all "tickets Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Tickets"
Originally posted by: jlklas...@wisc.edu
Sorry forgot to add: using ngopt_a5pipeline_linux-x64_20120518 on Ubuntu 12 LINUX