From: Walenz, B. <bw...@jc...> - 2013-08-16 03:15:30
|
Hi, Geoff, Heiner- Lots of questions here, I'll try to (finally) answer them. The unitig splitter is looking for a pattern of low sequence coverage, no good mate coverage and bad mate coverage. Bad meaning misoriented, or mate read not present where it should be. The intent was to find bad joins caused by chimeric reads. With long reads present, repeats internal to the read can have either zero short read coverage (because it's a common repeat excluded from overlaps) OR can have the wrong short read placed there (because it's in a common repeat with multiple equally good placements). I can certainly see, with low long read coverage, lots and lots of incorrect splitting being done. The splitter works by marking a region in the unitig as being the chimeric junction, then moving all reads that touch that region to a new 'bad' unitig. Heiner, this is likely why you end up with long singleton reads - nothing else touched the region. You can disable splitting by creating an empty file '5-consensus-split/splitUnitigs.out'. I also just added a 'doUnitigSplitting' option to revision r4387. 'doFragmentCorrection' is a misnomer. It actually changes the error rate on overlaps, but doesn't change the reads at all. It seems to help in general, but can be quite expensive to run. We turn it off on big assemblies because of this. Possibly, increasing the various unitig error rates slightly will compensate. The degen issue is much harder, and we probably should write more on it. There are two parameters to directly influence this (astatLowBound=1 and astatHighBound=5). We've set the low bound to -20 for a very repetitive genome in the past. Logs for this are in 5-consensus-coverage-stat - scatter plotting rho (basically length) against covStat might help pick a reasonable threshold. https://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=RunCA#S caffolder If the genome isn't huge, you can also run the 'toggler'. This will run an assembly to completion, then analyze unitigs and force a change to their repeat/unique label, and rerun ALL of scaffolding. https://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=RunCA#U nitig_Repeat.2FUnique_Toggling You could also, with some effort, combine all these steps. Analyze the logs in 5-consensus-coverage-stat, decide which unitigs need to be promoted to unique (which unitigs need to have a low covStat modified to a high covStat) and set the appropriate flag in the tigStore (by dumping the layout, modifying, and loading it back). b On 8/10/13 6:08 AM, "kuhl" <ku...@mo...> wrote: > Hi, > > I am running into similar problems, when combining very long read assembly > (65000 bp reads which are assembled contigs from a de bruijn graph > assembler) with short read paired end and matepair data. The assembly gets > messed up by unitig splitting, some of the long reads even end up as > singletons. Other contigs end up as degenerates although they are very > large (>100 kbp) and as far as I can tell they are unique genomic sequences > (not repeats od duplications). Is there a workaround for these problems, or > should one not mix very long reads and very short read data? > > Best wishes, > > Heiner > > > On Fri, 9 Aug 2013 20:08:50 +0000, "Waldbieser, Geoff" > <Geo...@AR...> wrote: >> Hi, >> CA v7.0 (build assembled 3.3M pacBioToCA-corrected long reads into > 146,537 >> unitigs (~1Gb) in 24 hrs using 62cpu and 512GB RAM. >> >> $> /home/software/wgs1/Linux-amd64/bin/runCA -version >> CA version CVS TIP ($Id: AS_GKP_main.c,v 1.105 2012-09-13 17:41:13 > skoren >> Exp $). >> CA version CVS TIP ($Id: AS_CGB_unitigger.c,v 1.45 2011-09-06 02:15:18 >> mkotelbajcvi Exp $). >> CA version CVS TIP ($Id: BuildUnitigs.cc,v 1.88 2012-01-15 23:49:34 >> brianwalenz Exp $). >> Using up to 64 OpenMP threads. >> CA version CVS TIP ($Id: AS_CGW_main.c,v 1.116 2012-11-15 05:04:54 >> brianwalenz Exp $). >> CA version CVS TIP ($Id: terminator.C,v 1.17 2012-09-10 08:58:11 >> brianwalenz Exp $). >> >> I then added Illumina mate pairs that had been corrected, deduplicated, >> and chimeric reads removed. >> Library >> Pairs >> Illumina MP, 3kb insert 4,455,475 >> Illumina MP, 8kb insert 2,834,858 >> Illumina MP, 36kb insert 1,222,830 >> >> When these were added, CA spent a few days in splitUnitigs before > failing. >> >> ------------------- >> $> tail -60 splitUnitigs.out.FAILED >> Creating new unitig 8643669 with 28 fragments >> unitig 2478 interval 0 0,920 good >> unitig 2478 interval 1 920,1123 bad >> unitig 2478 interval 2 1124,1428 good >> Fixing contains. >> prev 1143,1230 -- 16110937 1220,1143 (no overlap to new 1193,1413) >> prev 1143,1230 -- 11823198 1220,1143 (no overlap to new 1193,1413) >> prev 1143,1230 -- 14595211 1217,1143 (no overlap to new 1193,1413) >> prev 1143,1230 -- 13147242 1213,1143 (no overlap to new 1193,1413) >> prev 1143,1230 -- 22697521 1230,1145 (no overlap to new 1193,1413) prev >> 1143,1230 -- 4687598 1145,1220 (no overlap to new 1193,1413) >> Creating new unitig 8643670 with 24 fragments >> Creating new unitig 8643671 with 8 fragments >> Creating new unitig 8643672 with 1 fragments >> unitig 2494 interval 0 0,16451 good >> unitig 2494 interval 1 16451,16502 bad >> unitig 2494 interval 2 16503,48319 good >> Creating new unitig 8643673 with 265 fragments >> Creating new unitig 8643674 with 2 fragments >> Creating new unitig 8643675 with 497 fragments >> unitig 2504 interval 0 0,18635 good >> unitig 2504 interval 1 18635,18769 bad >> unitig 2504 interval 2 18770,26279 good >> Creating new unitig 8643676 with 295 fragments >> Creating new unitig 8643677 with 2 fragments >> splitUnitigs: MultiAlignUnitig.C:469: int >> unitigConsensus::computePositionFromParent(bool): Assertion >> `cnspos[tiid].bgn < cnspos[tiid].end' failed. >> >> Failed with 'Aborted' >> >> Backtrace (mangled): >> >> > /home/software/wgs1/Linux-amd64/bin/splitUnitigs(_Z17AS_UTL_catchCrashiP7sigin > foPv+0x23)[0x410a13] >> /lib64/libpthread.so.0(+0xfd00)[0x7f55fe5bdd00] >> /lib64/libc.so.6(gsignal+0x35)[0x7f55fe252d95] >> /lib64/libc.so.6(abort+0x17b)[0x7f55fe2542ab] >> /lib64/libc.so.6(+0x2d8fe)[0x7f55fe24b8fe] >> /lib64/libc.so.6(+0x2d9a2)[0x7f55fe24b9a2] >> /home/software/wgs1/Linux-amd64/bin/splitUnitigs[0x42bf9f] >> > /home/software/wgs1/Linux-amd64/bin/splitUnitigs(_Z16MultiAlignUnitigP11MultiA > lignTP7gkStoreP11CNS_OptionsPi+0xf0)[0x42fa10] >> /home/software/wgs1/Linux-amd64/bin/splitUnitigs(main+0x28bd)[0x40c78d] >> /lib64/libc.so.6(__libc_start_main+0xed)[0x7f55fe23f23d] >> /home/software/wgs1/Linux-amd64/bin/splitUnitigs[0x40cbf9] >> >> Backtrace (demangled): >> >> [0] >> /home/software/wgs1/Linux-amd64/bin/splitUnitigs::AS_UTL_catchCrash(int, >> siginfo*, void*) + 0x23 [0x410a13] >> [1] /lib64/libpthread.so.0::(null) + 0xfd00 [0x7f55fe5bdd00] >> [2] /lib64/libc.so.6::(null) + 0x35 [0x7f55fe252d95] >> [3] /lib64/libc.so.6::(null) + 0x17b [0x7f55fe2542ab] >> [4] /lib64/libc.so.6::(null) + 0x2d8fe [0x7f55fe24b8fe] >> [5] /lib64/libc.so.6::(null) + 0x2d9a2 [0x7f55fe24b9a2] >> [6] /home/software/wgs1/Linux-amd64/bin/splitUnitigs() [0x42bf9f] >> [7] >> > /home/software/wgs1/Linux-amd64/bin/splitUnitigs::MultiAlignUnitig(MultiAlignT > *, >> gkStore*, CNS_Options*, int*) + 0xf0 [0x42fa10] >> [8] /home/software/wgs1/Linux-amd64/bin/splitUnitigs::(null) + 0x28bd >> [0x40c78d] >> [9] /lib64/libc.so.6::(null) + 0xed [0x7f55fe23f23d] >> [10] /home/software/wgs1/Linux-amd64/bin/splitUnitigs() [0x40cbf9] >> >> GDB: >> ------------------ >> >> After seeing that someone else had a splitUnitig problem, I installed >> build 4371 and restarted. So far it has run splitUnitigs for 24 hrs and > it >> is currently working on unitig 35551 out of 9254397. >> >> The Illumina jump read parameters are: >> forceBOGunitigger=0 >> isNotRandom=0 >> doNotTrustHomopolymerRuns=0 >> doTrim_initialNone=0 >> doTrim_initialMerBased=0 >> doTrim_initialFlowBased=0 >> doTrim_initialQualityBased=0 >> doRemoveDuplicateReads=0 >> doTrim_finalLargestCovered=1 >> doTrim_finalEvidenceBased=0 >> doTrim_finalBestEdge=0 >> doRemoveSpurReads=1 >> doRemoveChimericReads=1 >> doConsensusCorrection=1 >> forceShortReadFormat=1 >> constantInsertSize=0 >> fastqQualityValues=sanger >> fastqOrientation=innie >> >> The corrPacBio parameters are: >> forceBOGunitigger=0 >> isNotRandom=0 >> doNotTrustHomopolymerRuns=0 >> doTrim_initialNone=0 >> doTrim_initialMerBased=0 >> doTrim_initialFlowBased=0 >> doTrim_initialQualityBased=0 >> doRemoveDuplicateReads=0 >> doTrim_finalLargestCovered=0 >> doTrim_finalEvidenceBased=1 >> doRemoveSpurReads=1 >> doRemoveChimericReads=1 >> doConsensusCorrection=1 >> forceShortReadFormat=0 >> constantInsertSize=0 >> fastqQualityValues=sanger >> fastqOrientation=innie >> >> >> I have declared bogart as the unitigger. I also set >> "doFragmentCorrection=0" and "doOverlapBasedTrimming = 0" because the >> Illumina data had already been cleaned and I assumed the Illumina >> correction of the PacBio reads was an error correction. Is this leading > to >> false joins that the unitigger is identifying and having to correct by >> splitting? >> >> Thanks for any input. >> Geoff >> ___________________________________ >> Geoffrey C. Waldbieser >> Research Molecular Biologist >> Warmwater Aquaculture Research Unit >> Agricultural Research Service >> United States Department of Agriculture >> Stoneville, MS 38776 >> (662) 686-3593 >> >> >> >> >> >> This electronic message contains information generated by the USDA > solely >> for the intended recipients. Any unauthorized interception of this > message >> or the use or disclosure of the information it contains may violate the > law >> and subject the violator to civil or criminal penalties. If you believe > you >> have received this message in error, please notify the sender and delete >> the email immediately. |