wgs-assembler-users Mailing List for Whole-Genome Shotgun Assembler (Page 7)

Brought to you by: brianwalenz, jasonmiller9704, mcschatz, skoren

wgs-assembler-users — Discussion about Celera Assembler

You can subscribe to this list here.

2012	_Jan (1)	_Feb (2)	_Mar	_Apr (29)	_May (8)	_Jun (5)	_Jul (46)	_Aug (16)	_Sep (5)	_Oct (6)	_Nov (17)	_Dec (7)
2013	_Jan (5)	_Feb (2)	_Mar (10)	_Apr (13)	_May (20)	_Jun (7)	_Jul (6)	_Aug (14)	_Sep (9)	_Oct (19)	_Nov (17)	_Dec (3)
2014	_Jan (3)	_Feb	_Mar (7)	_Apr (1)	_May (1)	_Jun (30)	_Jul (10)	_Aug (2)	_Sep (18)	_Oct (3)	_Nov (4)	_Dec (13)
2015	_Jan (27)	_Feb	_Mar (19)	_Apr (12)	_May (10)	_Jun (18)	_Jul (4)	_Aug (2)	_Sep (2)	_Oct	_Nov (1)	_Dec (9)
2016	_Jan (6)	_Feb	_Mar	_Apr	_May	_Jun	_Jul (1)	_Aug (1)	_Sep (1)	_Oct	_Nov	_Dec

Flat | Threaded

<< < 1 .. 5 6 7 8 9 .. 19 > >> (Page 7 of 19)

Re: [wgs-assembler-users] PBcR: Starting with 100x pacbio but only ending up with 7-8x corrected reads

From: Jason H. <jas...@zo...> - 2014-09-02 00:01:43

Here you go, thanks!



-Jason



On Aug 21, 2014, at 12:56 PM, Serge Koren <se...@um...> wrote:

> Hi,
> 
> Sorry for the delayed reply, I missed your post in my email. The high heterozygosity could definitely have an effect on the throughput of the correction. I would suggest increasing the sensitivity further and not specifying -pbCNS on your command line (this consensus module is faster but less robust to higher error data and so could be negatively affected by heterozygosity). 
> mhap         = "-k 14 --num-hashes 768 --num-min-matches 3 --threshold 0.04" 
> merSize      = 14
> 
> If you could send your asm.layout.err file, I can get more information and confirm whether the low output is due to the consensus or the sensitivity parameters.
> 
> Sergey
> 
> On Aug 18, 2014, at 12:52 PM, Jason Hill <jas...@zo...> wrote:
> 
>> Hello PBcR and WGS community,
>> 
>> I’m working with what should be 100x pacbio coverage and after using PBcR I’m ending up with at best 7x - 8x of corrected reads. My initial read set is about 11million reads, with an average length of 3000bp. After error correction my best run resulted in 1.2million reads with an average length of 2000bp. My genome has a relatively high heterozygosity as a terrestrial insect. I’ve adjusted both max_coverage and increased genome size to try to account for this but see fewer and shorter reads than using the default PBcR parameters. My current run is being done with following the command spec file. I’m using the latest version of all WGS, 8.2b.
>> 
>> ############## pacbio.spec #############
>> assemble     = 0
>> localStaging = /wgs_pacbio_assembly/PBcR_self_correction/staging
>> 
>> #faster overlapper with more sensitive settings
>> mhap         = "-k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04" 
>> merSize      = 16
>> 
>> #system memory parameters to avoid fraction bug
>> ovlMemory    = 512
>> ovlStoreMemory = 512000
>> merylMemory  = 512000
>> 
>> #increase coverage depth to counter heterozygosity/error rate 
>> #usually results in less corrected reads
>> maxCoverage  = 60
>> 
>> #increase genome size to counter heterozygosity, actual genome size 350MB
>> #usually results in less corrected reads
>> genomeSize   = 500000000
>> #####################################
>> 
>> $PBcR -pbCNS\
>> -length 300\
>> -partitions 65\
>> -l corrected_pb_1\
>> -t 64\
>> -s pacbio.spec\
>> -noclean\
>> -fastq pb.fastq 2>&1 | tee self_corrected_pb_1.log
>> 
>> When looking at the corrected read lists in the temporary directory I see what appear to be deleted reads of a length I would assume would make the cut, for example:
>> 
>>> 100003680002,3680002 mate=0,0 lib=corrected_pb_1,1 clr=LATEST,1,2219 deleted=1
>> cgtatgtaaaccaattttatactgatggggcgcgaaataacttttcttaagttccttgtgtccaaaca… continues for a total of 2219 bp.
>> 
>> As it is, none of the overlap layout assemblers can do much with the low coverage I end up with so I’m very eager to hear ideas of how I can move this forward. Would you please take a look and let me know how you would proceed? I would be happy to supply any additional information and files.
>> 
>> -Jason
>> 
>> 
>> 
>> 
>> 
>

Re: [wgs-assembler-users] PBcR: Starting with 100x pacbio but only ending up with 7-8x corrected reads

From: Serge K. <se...@um...> - 2014-08-21 19:56:43

Hi,

Sorry for the delayed reply, I missed your post in my email. The high heterozygosity could definitely have an effect on the throughput of the correction. I would suggest increasing the sensitivity further and not specifying -pbCNS on your command line (this consensus module is faster but less robust to higher error data and so could be negatively affected by heterozygosity). 
mhap         = "-k 14 --num-hashes 768 --num-min-matches 3 --threshold 0.04" 
merSize      = 14

If you could send your asm.layout.err file, I can get more information and confirm whether the low output is due to the consensus or the sensitivity parameters.

Sergey

On Aug 18, 2014, at 12:52 PM, Jason Hill <jas...@zo...> wrote:

> Hello PBcR and WGS community,
> 
> I’m working with what should be 100x pacbio coverage and after using PBcR I’m ending up with at best 7x - 8x of corrected reads. My initial read set is about 11million reads, with an average length of 3000bp. After error correction my best run resulted in 1.2million reads with an average length of 2000bp. My genome has a relatively high heterozygosity as a terrestrial insect. I’ve adjusted both max_coverage and increased genome size to try to account for this but see fewer and shorter reads than using the default PBcR parameters. My current run is being done with following the command spec file. I’m using the latest version of all WGS, 8.2b.
> 
> ############## pacbio.spec #############
> assemble     = 0
> localStaging = /wgs_pacbio_assembly/PBcR_self_correction/staging
> 
> #faster overlapper with more sensitive settings
> mhap         = "-k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04" 
> merSize      = 16
> 
> #system memory parameters to avoid fraction bug
> ovlMemory    = 512
> ovlStoreMemory = 512000
> merylMemory  = 512000
> 
> #increase coverage depth to counter heterozygosity/error rate 
> #usually results in less corrected reads
> maxCoverage  = 60
> 
> #increase genome size to counter heterozygosity, actual genome size 350MB
> #usually results in less corrected reads
> genomeSize   = 500000000
> #####################################
> 
> $PBcR -pbCNS\
> -length 300\
> -partitions 65\
> -l corrected_pb_1\
> -t 64\
> -s pacbio.spec\
> -noclean\
> -fastq pb.fastq 2>&1 | tee self_corrected_pb_1.log
> 
> When looking at the corrected read lists in the temporary directory I see what appear to be deleted reads of a length I would assume would make the cut, for example:
> 
>> 100003680002,3680002 mate=0,0 lib=corrected_pb_1,1 clr=LATEST,1,2219 deleted=1
> cgtatgtaaaccaattttatactgatggggcgcgaaataacttttcttaagttccttgtgtccaaaca… continues for a total of 2219 bp.
> 
> As it is, none of the overlap layout assemblers can do much with the low coverage I end up with so I’m very eager to hear ideas of how I can move this forward. Would you please take a look and let me know how you would proceed? I would be happy to supply any additional information and files.
> 
> -Jason
> 
> 
> 
> 
>

[wgs-assembler-users] PBcR: Starting with 100x pacbio but only ending up with 7-8x corrected reads

From: Jason H. <jas...@zo...> - 2014-08-18 17:11:24

Hello PBcR and WGS community,

I’m working with what should be 100x pacbio coverage and after using PBcR I’m ending up with at best 7x - 8x of corrected reads. My initial read set is about 11million reads, with an average length of 3000bp. After error correction my best run resulted in 1.2million reads with an average length of 2000bp. My genome has a relatively high heterozygosity as a terrestrial insect. I’ve adjusted both max_coverage and increased genome size to try to account for this but see fewer and shorter reads than using the default PBcR parameters. My current run is being done with following the command spec file. I’m using the latest version of all WGS, 8.2b.

############## pacbio.spec #############
assemble = 0
localStaging = /wgs_pacbio_assembly/PBcR_self_correction/staging

#faster overlapper with more sensitive settings
mhap = "-k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04"
merSize = 16

#system memory parameters to avoid fraction bug
ovlMemory = 512
ovlStoreMemory = 512000
merylMemory = 512000

#increase coverage depth to counter heterozygosity/error rate
#usually results in less corrected reads
maxCoverage = 60

#increase genome size to counter heterozygosity, actual genome size 350MB
#usually results in less corrected reads
genomeSize = 500000000
#####################################

$PBcR -pbCNS\
-length 300\
-partitions 65\
-l corrected_pb_1\
-t 64\
-s pacbio.spec\
-noclean\
-fastq pb.fastq 2>&1 | tee self_corrected_pb_1.log

When looking at the corrected read lists in the temporary directory I see what appear to be deleted reads of a length I would assume would make the cut, for example:

>100003680002,3680002 mate=0,0 lib=corrected_pb_1,1 clr=LATEST,1,2219 deleted=1
cgtatgtaaaccaattttatactgatggggcgcgaaataacttttcttaagttccttgtgtccaaaca… continues for a total of 2219 bp.

As it is, none of the overlap layout assemblers can do much with the low coverage I end up with so I’m very eager to hear ideas of how I can move this forward. Would you please take a look and let me know how you would proceed? I would be happy to supply any additional information and files.

-Jason

Re: [wgs-assembler-users] simple patch to AS_OVS/overlapStoreBuild.C

From: Brian W. <th...@gm...> - 2014-07-30 13:31:51

My message with the patch didn't seem to make it into the archive
completely.  The patch is there, but the message text isn't.

Here's the patch:
https://sourceforge.net/p/wgs-assembler/mailman/message/32480476/

You can read the text in this reply:
https://sourceforge.net/p/wgs-assembler/mailman/message/32481695/

The two values you need to change are from gatekeeper -dumpinfo.  Search
for "pacBio" to find where they are in the code.

b



On Tue, Jul 29, 2014 at 5:29 PM, Brian Foster <bf...@lb...> wrote:

> Hello All,
>
> I think I am running into the same partitioning problem that was mentioned
> in a previous thread. I am getting a single relatively large partition with
> many smaller same sized partitions and the overlapStore stage is failing. I
> am looking for the patch to overlapStoreBuild.C and can't seem to find it.
> Was that sent as an email attachment? Any help would be appreciated.
>
> Thanks,
> Brian
>
>
>
> ------------------------------------------------------------------------------
> Infragistics Professional
> Build stunning WinForms apps today!
> Reboot your WinForms applications with our WinForms controls.
> Build a bridge from your legacy apps to the future.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk
> _______________________________________________
> wgs-assembler-users mailing list
> wgs...@li...
> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>
>

[wgs-assembler-users] simple patch to AS_OVS/overlapStoreBuild.C

From: Brian F. <bf...@lb...> - 2014-07-29 21:29:13

Hello All,

I think I am running into the same partitioning problem that was mentioned
in a previous thread. I am getting a single relatively large partition with
many smaller same sized partitions and the overlapStore stage is failing. I
am looking for the patch to overlapStoreBuild.C and can't seem to find it.
Was that sent as an email attachment? Any help would be appreciated.

Thanks,
Brian

Re: [wgs-assembler-users] Does scaffolding scale with available RAM?

From: Brian W. <th...@gm...> - 2014-07-25 23:31:39

Hi, Heiner-

Wow, you've got an old version.  ;-)  Those two options don't exist in the
latest code.

'rebuild repeats' would take all the reads detected by bogart as being
repetitive, and do a second unitigging using just those reads.  The idea
was that maybe we could collapse/separate repeats better if all the unique
reads were removed.  I never saw any huge gains from doing this.

'mate extension' was a similar idea.  Find all the reads that are in
repeats.  Then, for each unitig, reconstruct it using the reads in the
unitig PLUS any mated reads in the repeats.  The end result was that the
unitig should be extended into repeats, but only using mated reads.
Similar result - kind of worked, but nothing spectacular.

They were both decent ideas (and fun to remember), but I don't think
they'll help here.  We all (should) know that repeats bigger than a read
can't be resolved (in general).  A corollary of this is that if repeats
bigger than the smaller reads are resolved, then the smaller reads cannot
be uniquely resolved.  It just took enormously different sizes (4k pacbio
and 0.1k illumina) to make this a problem.

I've been pleased with ECtools from the Schatz Lab (
http://schatzlab.cshl.edu/data/ectools/).  Assemble the Illumina to
unitigs, use that to correct the pacbio, then assemble the pacbio.  I
wasn't so pleased by the effort it took to run it (this was 1/2 a year ago)
and it might not scale past 1/2 Gbp.  But the assemblies were quite good.

b

On Fri, Jul 25, 2014 at 12:00 PM, kuhl <ku...@mo...> wrote:

> Dear Brian,
>
> just a comment, would
>
> batRebuildRepeats = 1
> batMateExtension = 1
>
> help with this issue? I am also running long reads (~4000 bp) with short
> reads and found this to be helping with some issues I had with cgw.
> Anyway, I never could use the full memory with bogart with these
> parameters, because it crashed in step 10. I had to limit bogart to 100Gb
> RAM (on 2-3 Gbp vertebrate genomes). And then it worked. The result was
> lower N50 unitigs, but this was solved by cgw. Regarding missassemblies in
> scaffolds, I also find a lot, which are actually limiting the final N50 and
> are forcing me to do a lot of manual final polishing of the assemblies
> (splitting / rescaffolding / gap closing again etc). If I set
> "doUnitigSplitting = 1" it helps, but is there any way to speed this up,
> like doing the unitig splitting on partitions in parallel? Seems there is
> still no perfect solution for hybrid data assemblies....
>
> Heiner
>
>

Re: [wgs-assembler-users] Does scaffolding scale with available RAM?

From: Brian W. <th...@gm...> - 2014-07-25 23:07:49

If my suspicion is correct - keep in mind, all this is a total guess on
what I imagine is happening - it's likely a mess that can be pushed to
completion now.  All the obvious scaffolding should be done already.  Bump
it out of the scaffold merging steps, but let the other cgw steps run.

Possibly, you can get away with increasing the min weight (6? 8? no good
guess), instead of manually forcing it to stop merging.



On Fri, Jul 25, 2014 at 11:23 AM, Waldbieser, Geoff <
Geo...@ar...> wrote:

>  So in this case adding the Illumina PE reads would not have helped?
>
> Is the graph trying to detangle or is it likely to be a mess that needs to
> be axed now?
>
>
>
>
>
> *From:* Brian Walenz [mailto:th...@gm...]
> *Sent:* Friday, July 25, 2014 8:11 AM
>
> *To:* Waldbieser, Geoff
> *Subject:* Re: [wgs-assembler-users] Does scaffolding scale with
> available RAM?
>
>
>
> Sorry, I owe you a few replies.  I switched jobs, and now can't read gmail
> at work, or work at home.
>
> It's not that the pacbio assembled through repeats, but that the pacbio
> reads themselves get through (larger) repeats.  Without the pacbio, bogart
> will detect the repeat, notice that no read spans it, and excise it from
> the unitig.  With the pacbio, bogart again detects the repeat, but now that
> a read spans it, the repeat is left in the unitig.
>
> That would be great, except that the repeat illumina mates are now a total
> mess.  With just illumina, the repeats are isolated to short unitigs, and
> only those mates are a mess, but scaffolder was designed to handle this
> case.  With the longer repeats included in longer unitigs, and illumina
> mates placed incorrectly in those, the scaffold graph is a mess.
>
> E.g.,
>
> unitig1:  unique1-repeatA-unique2
> unitig2:  unique3-repeatB-unique4 (where repeatA and repeatB are related)
>
> It is possible to get a mate between repeatA and unique4, when really it
> should be in repeatB.
>
> Your pacbio-only assembly was from correction of the pacbio with
> illumina?  I'm surprised it was that bad.
>
>
>
>
>
> On Mon, Jul 21, 2014 at 6:32 PM, Waldbieser, Geoff <
> Geo...@ar...> wrote:
>
>  First of all, thanks for saving us $100k on a high Mem server.
>
>
>
> When I mapped BAC end sequences to the Illumina-only assembly
> (MaSuRCA-2.2.0) the avg insert length of contained mates was 165kb which
> was on the dot for that BAC library. When I mapped to the PacBio-only
> assembly the insert sizes were in the 30kb range, so I knew something was
> wrong. That would support your idea of assembling through repeats and
> perhaps through the wrong repeats. So I thought including the Illumina mate
> pairs might help the PacBio assembly but apparently the MPs just made it
> more convoluted.
>
>
>
> Aleksey had suggested not using the PacBio at all for assembly, just for
> gap closure. Maybe it’s time to pull the plug on this one, maybe shred the
> PacBio reads to overlapping 2kb lengths to use on MaSuRCA. But then again
> it could end soon (I tell myself every day). Is there a reasonable way to
> estimate how many contigs have been incorporated thus estimating how many
> there are to go?
>
>
>
>
>
>
>
>

Re: [wgs-assembler-users] Does scaffolding scale with available RAM?

From: kuhl <ku...@mo...> - 2014-07-25 16:18:46

Dear Brian,

just a comment, would 

batRebuildRepeats = 1
batMateExtension = 1

help with this issue? I am also running long reads (~4000 bp) with short
reads and found this to be helping with some issues I had with cgw.
Anyway, I never could use the full memory with bogart with these
parameters, because it crashed in step 10. I had to limit bogart to 100Gb
RAM (on 2-3 Gbp vertebrate genomes). And then it worked. The result was
lower N50 unitigs, but this was solved by cgw. Regarding missassemblies in
scaffolds, I also find a lot, which are actually limiting the final N50 and
are forcing me to do a lot of manual final polishing of the assemblies
(splitting / rescaffolding / gap closing again etc). If I set
"doUnitigSplitting = 1" it helps, but is there any way to speed this up,
like doing the unitig splitting on partitions in parallel? Seems there is
still no perfect solution for hybrid data assemblies....

Heiner




On Fri, 25 Jul 2014 15:23:47 +0000, "Waldbieser, Geoff"
<Geo...@AR...> wrote:
> So in this case adding the Illumina PE reads would not have helped?
> Is the graph trying to detangle or is it likely to be a mess that needs
to
> be axed now?
> 
> 
> From: Brian Walenz [mailto:th...@gm...]
> Sent: Friday, July 25, 2014 8:11 AM
> To: Waldbieser, Geoff
> Subject: Re: [wgs-assembler-users] Does scaffolding scale with available
> RAM?
> 
> Sorry, I owe you a few replies.  I switched jobs, and now can't read
gmail
> at work, or work at home.
> It's not that the pacbio assembled through repeats, but that the pacbio
> reads themselves get through (larger) repeats.  Without the pacbio,
bogart
> will detect the repeat, notice that no read spans it, and excise it from
> the unitig.  With the pacbio, bogart again detects the repeat, but now
that
> a read spans it, the repeat is left in the unitig.
> That would be great, except that the repeat illumina mates are now a
total
> mess.  With just illumina, the repeats are isolated to short unitigs,
and
> only those mates are a mess, but scaffolder was designed to handle this
> case.  With the longer repeats included in longer unitigs, and illumina
> mates placed incorrectly in those, the scaffold graph is a mess.
> 
> E.g.,
> unitig1:  unique1-repeatA-unique2
> unitig2:  unique3-repeatB-unique4 (where repeatA and repeatB are
related)
> It is possible to get a mate between repeatA and unique4, when really it
> should be in repeatB.
> Your pacbio-only assembly was from correction of the pacbio with
illumina?
> I'm surprised it was that bad.
> 
> 
> On Mon, Jul 21, 2014 at 6:32 PM, Waldbieser, Geoff
> <Geo...@ar...<mailto:Geo...@ar...>>
> wrote:
> First of all, thanks for saving us $100k on a high Mem server.
> 
> When I mapped BAC end sequences to the Illumina-only assembly
> (MaSuRCA-2.2.0) the avg insert length of contained mates was 165kb which
> was on the dot for that BAC library. When I mapped to the PacBio-only
> assembly the insert sizes were in the 30kb range, so I knew something
was
> wrong. That would support your idea of assembling through repeats and
> perhaps through the wrong repeats. So I thought including the Illumina
mate
> pairs might help the PacBio assembly but apparently the MPs just made it
> more convoluted.
> 
> Aleksey had suggested not using the PacBio at all for assembly, just for
> gap closure. Maybe it’s time to pull the plug on this one, maybe shred
the
> PacBio reads to overlapping 2kb lengths to use on MaSuRCA. But then
again
> it could end soon (I tell myself every day). Is there a reasonable way
to
> estimate how many contigs have been incorporated thus estimating how
many
> there are to go?
> 
> 
> 
> From: Brian Walenz [mailto:th...@gm...<mailto:th...@gm...>]
> Sent: Monday, July 21, 2014 5:19 PM
> 
> To: Waldbieser, Geoff
> Subject: Re: [wgs-assembler-users] Does scaffolding scale with available
> RAM?
> 
> Yup, that looks like a perfectly well behaved process.  I can't explain
> what Linux is doing with the memory -- filesystem cache would be my
guess
> -- but the cgw process is small, and more importantly, getting 100% CPU
and
> using no swap.
> My guess is that the PacBio sequenced/assembled through repeats, and the
> illumina is now overlapping to the wrong repeat copy, resulting in a
very
> messy mate graph.  Compare this against an illumina only assembly where
> unitigs broke at repeat boundaries.  The graph is much cleaner, but
> possibly disjoint.
> I think Aleksey Zimin @ UMD had some success removing overlaps where
none
> of the kmer seeds were 'unique', for some definition of unique.  The
> process was rather involved:  build unitigs, then decide what isn't
unique
> (by counting kmers in the assembled unitigs), recompute overlaps, and
> re-unitig.  I've never seen code to do it, nor the results.  Just word
of
> mouth.
> 
> 
> On Mon, Jul 21, 2014 at 9:21 AM, Waldbieser, Geoff
> <Geo...@ar...<mailto:Geo...@ar...>>
> wrote:
> The Bri,
> 
> So for Linux halfwits like me, I look at the Mem line and see that it’s
> using about all the 512M RAM available. But then I look at the cgw
command
> line and see that it’s only using 5.7% of memory. So is that what you’re
> talking about - that most of the RAM is taken up in cached data and only
5%
> of the memory is actually involved in the active processes of cgw?
> 
> [cid:image001.png@01CFA7F2.858DF130]
> 
> The PacBio-only assemblies (no scaffolds) require about 2 days to
> complete. The Illumina-only assemblies complete in about 2 weeks. So in
the
> present case, when the Illumina mate pairs are added to PacBio data but
> Illumina PE reads are not included, is it something like the PacBio data
> not having the depth of coverage to identify the repetitive elements
like
> the deep Illumina PE data did, therefore the Illumina mates are aligning
to
> more repetitive sequence?
> 
> Geoff
> 
> 
> 
> 
> 
> From: Brian Walenz [mailto:th...@gm...<mailto:th...@gm...>]
> Sent: Saturday, July 19, 2014 10:40 AM
> 
> To: Waldbieser, Geoff
> Subject: Re: [wgs-assembler-users] Does scaffolding scale with available
> RAM?
> 
> Aye, no improvement by moving to 3tb....assuming it's not paging on
> whatever tiny machine it is running on now!
> -recomputegaps, I think, only matters only at the start of the run, and
> only on the later iterations.  kickOutNonOvlContigs=0 is the previous
> default, so no trouble there.  Filter level 2 was developed during our
> salmon assembly headache.  It seemed to be as sensitive as the default,
> maybe a little faster, and also decreased the 'huge gap in scaffold'
> problem that results in massive slow downs and enormous (and incorrect)
> scaffolds.
> 
> 
> On Fri, Jul 18, 2014 at 1:38 PM, Waldbieser, Geoff
> <Geo...@ar...<mailto:Geo...@ar...>>
> wrote:
> Maybe I have exacerbated the slowdown by using ‘cgwMergeFilterLevel=2 
> –recomputegaps’ and ‘kickOutNonOvlContigs = 0’? At least for now it
seems
> to be avoiding the 50Mb incorrect scaffold or the constant cycle of
> merge/exclude specific contigs. If it’s a good assembly then it will
have
> been worth the time.
> 
> From: Brian Walenz [mailto:th...@gm...<mailto:th...@gm...>]
> Sent: Thursday, July 17, 2014 5:29 AM
> To: Waldbieser, Geoff
> Subject: Re: [wgs-assembler-users] Does scaffolding scale with available
> RAM?
> 
> Hi, Geoff-
> Sadly, no control over memory in CGW.  Its already using the most it
can. 
> Most of the memory usage is for caching untigis/contigs, if space is
really
> tight, the cache can be turned off and they'll be loaded from disk every
> time.  Not what you're after.
> Before we had a large memory machine, I ran a ~200gb CGW on a 128gb
> machine.  It ran perfectly fine.  The infrequently used unitigs/contigs
> ended up swapped out, just as if the cache was disabled.  So, unless
your
> CGW process is much much bigger than 512gb, you won't gain anything.
> There are a few options that can make significant improvements in run
> time.  cgwMergeFilterLevel of 2 should be a little faster and not that
much
> worse.  cgwMergeFilterLevel of 5 will be quite speedy, but not
aggressive. 
> cgwMinMergeWeight sets the minimum number of mates that are needed to
> attempt a scaffold join; default is 2.  This is shown in the logs.  If
it
> gets stuck doing a bunch of weight 2 merges, increasing to 3 will help,
but
> could sacrifice some joins.
> 
> b
> 
> On Wed, Jul 16, 2014 at 4:07 PM, Waldbieser, Geoff
> <Geo...@ar...<mailto:Geo...@ar...>>
> wrote:
> Hi Brian,
> I’m once again using a calendar to measure a scaffolding job (basically
> scaffolding PacBio reads with Illumina mate pairs). Does the scaffolding
> speed scale with increases in RAM? The current setup has 512GB RAM but
if
> this were to run on a node that contains 1TB or 2TB RAM would the job be
½
> or ¼ the length of time?
> 
> Geoff
> 
> 
> Geoff Waldbieser
> USDA, ARS, Warmwater Aquaculture Research Unit
> 141 Experiment Station Road
> Stoneville, Mississippi 38776
> Ofc. 662-686-3593<tel:662-686-3593>
> Fax. 662-686-3567<tel:662-686-3567>
> 
> 
> 
> 
> 
> This electronic message contains information generated by the USDA
solely
> for the intended recipients. Any unauthorized interception of this
message
> or the use or disclosure of the information it contains may violate the
law
> and subject the violator to civil or criminal penalties. If you believe
you
> have received this message in error, please notify the sender and delete
> the email immediately.
> 
>
------------------------------------------------------------------------------
> Want fast and easy access to all the code in your enterprise? Index and
> search up to 200,000 lines of code with a free copy of Black Duck
> Code Sight - the same software that powers the world's largest code
> search on Ohloh, the Black Duck Open Hub! Try it now.
> http://p.sf.net/sfu/bds
> _______________________________________________
> wgs-assembler-users mailing list
>
wgs...@li...<mailto:wgs...@li...>
> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users

-- 
---------------------------------------------------------------
Dr. Heiner Kuhl
MPI Molecular Genetics            Tel:   + 49 + 30 / 8413 1776
Next Generation Sequencing        
Ihnestrasse 73                    email: ku...@mo...
D-14195 Berlin                    http://www.molgen.mpg.de/SeqCore
---------------------------------------------------------------

[wgs-assembler-users] about runCA error in 5-consensus

From: 任一 <upf...@gm...> - 2014-07-18 01:00:21

Hi All:

    I try the PBcR in was-8.2alpha for assembly of sample data from phage and ecoli data. unfortunately it failed at 5-consensus。the errors see below：
I also changed the parameters to " consensus=cns” and retry it ,but it still failed at the same step.
    using the corrected data, I test the runCA in other version such as  8.1、8.0 and 7.0. all of them failed. 
    I thought it maybe because some older library in my OS?  
   Can anyone help me? Thanks very much. 

/mnt/lustre/users/renyi/bio-softs/wgs-download/sampledata/ecoli/ry/5-consensus/consensus.sh 1 > /dev/null 2>&1
----------------------------------------END CONCURRENT Thu Jul 17 18:52:47 2014 (8004 seconds)
/mnt/lustre/users/renyi/bio-softs/wgs-download/sampledata/ecoli/ry/5-consensus/asm_001 failed -- no .success.
================================================================================

runCA failed.

----------------------------------------
Stack trace:

 at /mnt/lustre/users/renyi/bio-softs/wgs-8.2alpha/Linux-amd64/bin/runCA line 1568.
        main::caFailure("1 unitig consensus jobs failed; remove /mnt/lustre/users/reny"..., undef) called at /mnt/lustre/users/renyi/bio-softs/wgs-8.2alpha/Linux-amd64/bin/runCA line 4944
        main::postUnitiggerConsensus() called at /mnt/lustre/users/renyi/bio-softs/wgs-8.2alpha/Linux-amd64/bin/runCA line 6479

[wgs-assembler-users] Does scaffolding scale with available RAM?

From: Waldbieser, G. <Geo...@AR...> - 2014-07-16 20:08:26

Hi Brian,
I'm once again using a calendar to measure a scaffolding job (basically scaffolding PacBio reads with Illumina mate pairs). Does the scaffolding speed scale with increases in RAM? The current setup has 512GB RAM but if this were to run on a node that contains 1TB or 2TB RAM would the job be ½ or ¼ the length of time?

Geoff


Geoff Waldbieser
USDA, ARS, Warmwater Aquaculture Research Unit
141 Experiment Station Road
Stoneville, Mississippi 38776
Ofc. 662-686-3593
Fax. 662-686-3567





This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.

Re: [wgs-assembler-users] multiple file input for pacBioToCA query

From: Serge K. <ser...@gm...> - 2014-07-11 16:26:04

Hi,

For the PacBio raw reads you do need to have one fastq file. You can just concatenate all your smrtcell filtered data together or run the filtering on all the SMRTcells at once. For the correction data (Illumina/etc), you can provide an arbitrary number of FRG files which will get used for correction.


On Jul 11, 2014, at 11:12 AM, nic blouin <nb...@ma...> wrote:

> Hi there-
> 
> I looked through the archives and din't see a post regarding this item, which makes me think i am being obtuse here.
> 
> I wish to use pacBioToCa to correct a PacBio data set I have.
> For error correction i can see that i can input  an illumina dataset and off I go. 
> I have quite alot of gDNA data for this organism and would like to use it all thinking that more is beter.
>  From looking over the documentation i believe that I can submit only one correction file is this correct?
> Or is there a way for me to include 4 read sets with different pared/mate distances to correct my PacBio data?
> For example i have a 4 illumina runs with 300 bp, 500 bp, 4kb, and 7 kb inserts respectively.
> 
> Thanks for any advice.
> 
> 
> 
> 
> nic
> 
> 
> Nicolas Achille Blouin, Ph.D.
> Dept. of Biological Sciences
> University of Rhode Island
> 120 Flagg Road, CBLS 260
> Kingston, RI 02881
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> wgs-assembler-users mailing list
> wgs...@li...
> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users

[wgs-assembler-users] multiple file input for pacBioToCA query

From: nic b. <nb...@ma...> - 2014-07-11 16:11:08

Hi there-

I looked through the archives and din't see a post regarding this item, which makes me think i am being obtuse here.

I wish to use pacBioToCa to correct a PacBio data set I have.
For error correction i can see that i can input  an illumina dataset and off I go. 
I have quite alot of gDNA data for this organism and would like to use it all thinking that more is beter.
 From looking over the documentation i believe that I can submit only one correction file is this correct?
Or is there a way for me to include 4 read sets with different pared/mate distances to correct my PacBio data?
For example i have a 4 illumina runs with 300 bp, 500 bp, 4kb, and 7 kb inserts respectively.

Thanks for any advice.




nic


Nicolas Achille Blouin, Ph.D.
Dept. of Biological Sciences
University of Rhode Island
120 Flagg Road, CBLS 260
Kingston, RI 02881