From: mathog <ma...@ca...> - 2015-01-20 20:37:18
|
(This is a followup to: Re: [wgs-assembler-users] mer, mertrim running single threaded on large SMP machine) On 19-Jan-2015 18:52, Brian Walenz wrote: > I didn't poke through the data much, just enough to see it was > Illumina. > My immediate reaction is to suggest trying masurca. It handles > illumina > much much better than plain CA, but does probably require more reads > because more crap gets filtered out. Will look into that. Also found Meraculous, also for Illumina. (So many assemblers, so little time...) > With your current assembly, I see two things I don't like: 1) bog > instead > of bogart, 2) 3% error rate. > > > You can do some experiments with the current assembly without too much > pain. All we're going to do is run bogart a few times, and look at the > resulting unitigs. No consensus generation, just unitig layouts. > > On a COPY of the gkpStore, run > > gatekeeper --revertclear OBTCHIMERA *gkpStore Did this: cp -r ..gkpStore copygkpStore cp ..gkpStore.err copygkpStore.err cp ..gkpStore.errorLog copygkpStore.errorLog cp ..gkpStore.fastqUIDmap copygkpStore.fastqUIDmap cp ..gkpStore.info copygkpStore.info export PATH=$PATH:/home/wgs_project/wgs/Linux-amd64/bin # gatekeeper --revertclear OBTCHIMERA copygkpStore > > This will restore the clear ranges to the state they had just after > trimming, and just before unitigging. > > Then a bunch of iterations of bogart: > > bogart -G *.gkpStore -O *.ovlStore -T e10.tigStore -o test.bogart -eg > 0.10 > -Eg 2.5 -em 0.10 -Em 2.5 > > Where the eg and em parameter is varied between 2 and 6 (percent > error). > By default, overlaps are generated to only 6% error, not that higher > would > be feasible with short reads. The Eg and Em parameters measure overlap > error as 'number of errors', to get around the problem of a 50-base > overlap > with one error resulting in 2% error. You can mostly ignore this for > the > higher error rates. Sorry, the wild card in that line is throwing me. Also I'm confused if you mean big Eg,Em (where 2.5 is in the range specified) or little eg,em (where values are not in that range). Given what I called the copy, is this what you want to run? VAL=2.5 #2.5 percent bogart -G copygkpStore -O copyovlStore -T e10.tigStore -o test.bogart \ -eg 0.10 -Eg $VAL -em 0.10 -Em $VAL tigStore -g copygkpStore -t e10.tigStore 1 -U -d sizes -s 800000000 VAL=3.0 #3.0 percent bogart -G copygkpStore -O copyovlStore -T e10.tigStore -o test.bogart \ -eg 0.10 -Eg $VAL -em 0.10 -Em $VAL tigStore -g copygkpStore -t e10.tigStore 1 -U -d sizes -s 800000000 # etc. The bogart command fails because "'copyovlStore' is not an ovelrapStore". Use the overlapStore from the first run in that command? (note the typo in the error message, that's what it says) Erase the e10.tigStore between runs? Do something to the overlapStore between runs? running tigStore on the original (not so useful) run gave this: tigStore -g ..gkpStore -t ..tigStore 1 -U -d sizes -s copygkpStore.info utgLenUnassigned n10 siz 528 sum 304316578 idx 479977 utgLenUnassigned n20 siz 400 sum 608633078 idx 1148939 utgLenUnassigned n30 siz 291 sum 912949618 idx 2026098 utgLenUnassigned n40 siz 179 sum 1217266213 idx 3353557 utgLenUnassigned n50 siz 150 sum 1521582630 idx 5307416 utgLenUnassigned n60 siz 145 sum 1825899170 idx 7367619 utgLenUnassigned n70 siz 126 sum 2130215760 idx 9584603 utgLenUnassigned n80 siz 122 sum 2434532234 idx 12033900 utgLenUnassigned n90 siz 102 sum 2738848751 idx 14689647 utgLenUnassigned sum 3043165239 (genomeSize 0) utgLenUnassigned num 18384123 utgLenUnassigned ave 165 tigLenSingleton n10 siz 150 sum 142617831 idx 907450 tigLenSingleton n20 siz 148 sum 285235697 idx 1865321 tigLenSingleton n30 siz 145 sum 427853436 idx 2837943 tigLenSingleton n40 siz 134 sum 570471289 idx 3850926 tigLenSingleton n50 siz 125 sum 713089018 idx 4969720 tigLenSingleton n60 siz 123 sum 855706883 idx 6116341 tigLenSingleton n70 siz 121 sum 998324590 idx 7282617 tigLenSingleton n80 siz 108 sum 1140942414 idx 8518814 tigLenSingleton n90 siz 87 sum 1283560221 idx 9981733 tigLenSingleton sum 1426177984 (genomeSize 0) tigLenSingleton num 11893391 tigLenSingleton ave 119 tigLenAssembled n10 siz 630 sum 161699171 idx 231237 tigLenAssembled n20 siz 517 sum 323397821 idx 516513 tigLenAssembled n30 siz 443 sum 485096301 idx 855316 tigLenAssembled n40 siz 389 sum 646795227 idx 1245703 tigLenAssembled n50 siz 335 sum 808493956 idx 1690952 tigLenAssembled n60 siz 266 sum 970192570 idx 2232349 tigLenAssembled n70 siz 205 sum 1131891234 idx 2921817 tigLenAssembled n80 siz 157 sum 1293589836 idx 3836637 tigLenAssembled n90 siz 136 sum 1455288608 idx 4933675 tigLenAssembled sum 1616987255 (genomeSize 0) tigLenAssembled num 6490732 tigLenAssembled ave 249 Presumably we want to see many more of the tigLenAssembled and fewer of the utgLenUnassigned and tigLenSingleton. Thanks, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |