From: kuhl <ku...@mo...> - 2014-07-25 16:18:46
|
Dear Brian, just a comment, would batRebuildRepeats = 1 batMateExtension = 1 help with this issue? I am also running long reads (~4000 bp) with short reads and found this to be helping with some issues I had with cgw. Anyway, I never could use the full memory with bogart with these parameters, because it crashed in step 10. I had to limit bogart to 100Gb RAM (on 2-3 Gbp vertebrate genomes). And then it worked. The result was lower N50 unitigs, but this was solved by cgw. Regarding missassemblies in scaffolds, I also find a lot, which are actually limiting the final N50 and are forcing me to do a lot of manual final polishing of the assemblies (splitting / rescaffolding / gap closing again etc). If I set "doUnitigSplitting = 1" it helps, but is there any way to speed this up, like doing the unitig splitting on partitions in parallel? Seems there is still no perfect solution for hybrid data assemblies.... Heiner On Fri, 25 Jul 2014 15:23:47 +0000, "Waldbieser, Geoff" <Geo...@AR...> wrote: > So in this case adding the Illumina PE reads would not have helped? > Is the graph trying to detangle or is it likely to be a mess that needs to > be axed now? > > > From: Brian Walenz [mailto:th...@gm...] > Sent: Friday, July 25, 2014 8:11 AM > To: Waldbieser, Geoff > Subject: Re: [wgs-assembler-users] Does scaffolding scale with available > RAM? > > Sorry, I owe you a few replies. I switched jobs, and now can't read gmail > at work, or work at home. > It's not that the pacbio assembled through repeats, but that the pacbio > reads themselves get through (larger) repeats. Without the pacbio, bogart > will detect the repeat, notice that no read spans it, and excise it from > the unitig. With the pacbio, bogart again detects the repeat, but now that > a read spans it, the repeat is left in the unitig. > That would be great, except that the repeat illumina mates are now a total > mess. With just illumina, the repeats are isolated to short unitigs, and > only those mates are a mess, but scaffolder was designed to handle this > case. With the longer repeats included in longer unitigs, and illumina > mates placed incorrectly in those, the scaffold graph is a mess. > > E.g., > unitig1: unique1-repeatA-unique2 > unitig2: unique3-repeatB-unique4 (where repeatA and repeatB are related) > It is possible to get a mate between repeatA and unique4, when really it > should be in repeatB. > Your pacbio-only assembly was from correction of the pacbio with illumina? > I'm surprised it was that bad. > > > On Mon, Jul 21, 2014 at 6:32 PM, Waldbieser, Geoff > <Geo...@ar...<mailto:Geo...@ar...>> > wrote: > First of all, thanks for saving us $100k on a high Mem server. > > When I mapped BAC end sequences to the Illumina-only assembly > (MaSuRCA-2.2.0) the avg insert length of contained mates was 165kb which > was on the dot for that BAC library. When I mapped to the PacBio-only > assembly the insert sizes were in the 30kb range, so I knew something was > wrong. That would support your idea of assembling through repeats and > perhaps through the wrong repeats. So I thought including the Illumina mate > pairs might help the PacBio assembly but apparently the MPs just made it > more convoluted. > > Aleksey had suggested not using the PacBio at all for assembly, just for > gap closure. Maybe it’s time to pull the plug on this one, maybe shred the > PacBio reads to overlapping 2kb lengths to use on MaSuRCA. But then again > it could end soon (I tell myself every day). Is there a reasonable way to > estimate how many contigs have been incorporated thus estimating how many > there are to go? > > > > From: Brian Walenz [mailto:th...@gm...<mailto:th...@gm...>] > Sent: Monday, July 21, 2014 5:19 PM > > To: Waldbieser, Geoff > Subject: Re: [wgs-assembler-users] Does scaffolding scale with available > RAM? > > Yup, that looks like a perfectly well behaved process. I can't explain > what Linux is doing with the memory -- filesystem cache would be my guess > -- but the cgw process is small, and more importantly, getting 100% CPU and > using no swap. > My guess is that the PacBio sequenced/assembled through repeats, and the > illumina is now overlapping to the wrong repeat copy, resulting in a very > messy mate graph. Compare this against an illumina only assembly where > unitigs broke at repeat boundaries. The graph is much cleaner, but > possibly disjoint. > I think Aleksey Zimin @ UMD had some success removing overlaps where none > of the kmer seeds were 'unique', for some definition of unique. The > process was rather involved: build unitigs, then decide what isn't unique > (by counting kmers in the assembled unitigs), recompute overlaps, and > re-unitig. I've never seen code to do it, nor the results. Just word of > mouth. > > > On Mon, Jul 21, 2014 at 9:21 AM, Waldbieser, Geoff > <Geo...@ar...<mailto:Geo...@ar...>> > wrote: > The Bri, > > So for Linux halfwits like me, I look at the Mem line and see that it’s > using about all the 512M RAM available. But then I look at the cgw command > line and see that it’s only using 5.7% of memory. So is that what you’re > talking about - that most of the RAM is taken up in cached data and only 5% > of the memory is actually involved in the active processes of cgw? > > [cid:image001.png@01CFA7F2.858DF130] > > The PacBio-only assemblies (no scaffolds) require about 2 days to > complete. The Illumina-only assemblies complete in about 2 weeks. So in the > present case, when the Illumina mate pairs are added to PacBio data but > Illumina PE reads are not included, is it something like the PacBio data > not having the depth of coverage to identify the repetitive elements like > the deep Illumina PE data did, therefore the Illumina mates are aligning to > more repetitive sequence? > > Geoff > > > > > > From: Brian Walenz [mailto:th...@gm...<mailto:th...@gm...>] > Sent: Saturday, July 19, 2014 10:40 AM > > To: Waldbieser, Geoff > Subject: Re: [wgs-assembler-users] Does scaffolding scale with available > RAM? > > Aye, no improvement by moving to 3tb....assuming it's not paging on > whatever tiny machine it is running on now! > -recomputegaps, I think, only matters only at the start of the run, and > only on the later iterations. kickOutNonOvlContigs=0 is the previous > default, so no trouble there. Filter level 2 was developed during our > salmon assembly headache. It seemed to be as sensitive as the default, > maybe a little faster, and also decreased the 'huge gap in scaffold' > problem that results in massive slow downs and enormous (and incorrect) > scaffolds. > > > On Fri, Jul 18, 2014 at 1:38 PM, Waldbieser, Geoff > <Geo...@ar...<mailto:Geo...@ar...>> > wrote: > Maybe I have exacerbated the slowdown by using ‘cgwMergeFilterLevel=2 > –recomputegaps’ and ‘kickOutNonOvlContigs = 0’? At least for now it seems > to be avoiding the 50Mb incorrect scaffold or the constant cycle of > merge/exclude specific contigs. If it’s a good assembly then it will have > been worth the time. > > From: Brian Walenz [mailto:th...@gm...<mailto:th...@gm...>] > Sent: Thursday, July 17, 2014 5:29 AM > To: Waldbieser, Geoff > Subject: Re: [wgs-assembler-users] Does scaffolding scale with available > RAM? > > Hi, Geoff- > Sadly, no control over memory in CGW. Its already using the most it can. > Most of the memory usage is for caching untigis/contigs, if space is really > tight, the cache can be turned off and they'll be loaded from disk every > time. Not what you're after. > Before we had a large memory machine, I ran a ~200gb CGW on a 128gb > machine. It ran perfectly fine. The infrequently used unitigs/contigs > ended up swapped out, just as if the cache was disabled. So, unless your > CGW process is much much bigger than 512gb, you won't gain anything. > There are a few options that can make significant improvements in run > time. cgwMergeFilterLevel of 2 should be a little faster and not that much > worse. cgwMergeFilterLevel of 5 will be quite speedy, but not aggressive. > cgwMinMergeWeight sets the minimum number of mates that are needed to > attempt a scaffold join; default is 2. This is shown in the logs. If it > gets stuck doing a bunch of weight 2 merges, increasing to 3 will help, but > could sacrifice some joins. > > b > > On Wed, Jul 16, 2014 at 4:07 PM, Waldbieser, Geoff > <Geo...@ar...<mailto:Geo...@ar...>> > wrote: > Hi Brian, > I’m once again using a calendar to measure a scaffolding job (basically > scaffolding PacBio reads with Illumina mate pairs). Does the scaffolding > speed scale with increases in RAM? The current setup has 512GB RAM but if > this were to run on a node that contains 1TB or 2TB RAM would the job be ½ > or ¼ the length of time? > > Geoff > > > Geoff Waldbieser > USDA, ARS, Warmwater Aquaculture Research Unit > 141 Experiment Station Road > Stoneville, Mississippi 38776 > Ofc. 662-686-3593<tel:662-686-3593> > Fax. 662-686-3567<tel:662-686-3567> > > > > > > This electronic message contains information generated by the USDA solely > for the intended recipients. Any unauthorized interception of this message > or the use or disclosure of the information it contains may violate the law > and subject the violator to civil or criminal penalties. If you believe you > have received this message in error, please notify the sender and delete > the email immediately. > > ------------------------------------------------------------------------------ > Want fast and easy access to all the code in your enterprise? Index and > search up to 200,000 lines of code with a free copy of Black Duck > Code Sight - the same software that powers the world's largest code > search on Ohloh, the Black Duck Open Hub! Try it now. > http://p.sf.net/sfu/bds > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li...<mailto:wgs...@li...> > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users -- --------------------------------------------------------------- Dr. Heiner Kuhl MPI Molecular Genetics Tel: + 49 + 30 / 8413 1776 Next Generation Sequencing Ihnestrasse 73 email: ku...@mo... D-14195 Berlin http://www.molgen.mpg.de/SeqCore --------------------------------------------------------------- |