wgs-assembler-users Mailing List for Whole-Genome Shotgun Assembler (Page 18)

Brought to you by: brianwalenz, jasonmiller9704, mcschatz, skoren

wgs-assembler-users — Discussion about Celera Assembler

You can subscribe to this list here.

2012	_Jan (1)	_Feb (2)	_Mar	_Apr (29)	_May (8)	_Jun (5)	_Jul (46)	_Aug (16)	_Sep (5)	_Oct (6)	_Nov (17)	_Dec (7)
2013	_Jan (5)	_Feb (2)	_Mar (10)	_Apr (13)	_May (20)	_Jun (7)	_Jul (6)	_Aug (14)	_Sep (9)	_Oct (19)	_Nov (17)	_Dec (3)
2014	_Jan (3)	_Feb	_Mar (7)	_Apr (1)	_May (1)	_Jun (30)	_Jul (10)	_Aug (2)	_Sep (18)	_Oct (3)	_Nov (4)	_Dec (13)
2015	_Jan (27)	_Feb	_Mar (19)	_Apr (12)	_May (10)	_Jun (18)	_Jul (4)	_Aug (2)	_Sep (2)	_Oct	_Nov (1)	_Dec (9)
2016	_Jan (6)	_Feb	_Mar	_Apr	_May	_Jun	_Jul (1)	_Aug (1)	_Sep (1)	_Oct	_Nov	_Dec

Flat | Threaded

<< < 1 .. 16 17 18 19 > >> (Page 18 of 19)

Re: [wgs-assembler-users] FastqToCA for paired-end reads

From: Ole K. T. <o.k...@bi...> - 2012-05-14 18:46:41

On 14 May 2012 20:32, Mundy, Michael <Mun...@ma...> wrote:
> I’m using WGS 7.0 and I have two synchronized fastq files with paired-end
> reads.  Based on the documentation at
> http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=FastqToCA,
> I tried this command:
>
> wgs-7.0/Linux-amd64/bin/fastqToCA -libraryname SRR067601.000 -mates
> SRR067601.000_1_pair.fq,SRR067601.000_2_pair.fq
>
> But it returns this error:
>
> ERROR:  Mated reads (-mates) must have am insert size (-insertsize).
>
> The documentation page says that the –insertsize option is optional so I
> thought that was the flag to distinguish between paired-end reads and
> mate-pair reads.  How do I generate a FRG file for paired-end reads?

I guess the documentation is not up to date, so it's not optional to
supply the -insertsize option. Just add -insertsize 300 30, if your
reads are from a 300 bp DNA fragment and are paired end, or do
something like -insertsize 5000 500 -outtie if they are mate pairs
from a 5k library.

Ole

>
> Mike Mundy
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> wgs-assembler-users mailing list
> wgs...@li...
> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>

[wgs-assembler-users] FastqToCA for paired-end reads

From: Mundy, M. <Mun...@ma...> - 2012-05-14 18:32:27

I¹m using WGS 7.0 and I have two synchronized fastq files with paired-end
reads.  Based on the documentation at
http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=FastqToC
A, I tried this command:

wgs-7.0/Linux-amd64/bin/fastqToCA -libraryname SRR067601.000 -mates
SRR067601.000_1_pair.fq,SRR067601.000_2_pair.fq

But it returns this error:

ERROR:  Mated reads (-mates) must have am insert size (-insertsize).

The documentation page says that the insertsize option is optional so I
thought that was the flag to distinguish between paired-end reads and
mate-pair reads.  How do I generate a FRG file for paired-end reads?

Mike Mundy

Re: [wgs-assembler-users] Saving and combining overlap stores (error correction PacBio reads for ex.)

From: Walenz, B. <bw...@jc...> - 2012-05-11 19:00:37



On 5/10/12 2:55 PM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:

> Hi Brian.
> 
> Thank you for this, good to know. Our PacBio fastq files were over
> multiple lines (SMRT-Portal 1.3... Thank you a lot PacBio!), and the
> correction pipeline ran for 17 days taking up 48 CPUs and I guess we
> can just kill it now.

Multiple lines aren't nearly as bad as Illumina's new multi-word read
names...  ;-)

The paper on the correction pipeline will be appearing in Nature
Biotechnology real soon.  I'll send a link once I get one.  I'm pretty sure
nobody has tried correcting pacbio with 454 reads.


> 
> On 10 May 2012 19:50, Walenz, Brian <bw...@jc...> wrote:
>> Hi, Ole-
>> 
>> ovlHashLibrary=2 does mean to load only reads from the second library into
>> the hash table.  In this case, it's the pac bio reads.  The 'ref' library is
>> what fragments we search against the hash table.  ovlRefLibrary=1-1
>> translates to 'starting at library 1 and ending at library 1'.  Overlaps
>> well be computed between library 1 and 2, but not in the same library.
>> 
>> I should point out that this isn't implemented perfectly.  The overlap jobs
>> for computing overlaps within library 1 are still launched, and the hash
>> tables are still built, but no overlaps are output.  The 'overlap_partition'
>> command is responsible for setting up the hash and reference ranges for each
>> overlap job, and this isn't aware of the ovlHashLibrary/ovlRefLibrary
>> options.
>> 
>> We've been recently disabling OBT (and fragmentCorrection) in runCA, and
>> doing all trimming/correction outside the assembler.  In your case, you can
>> run the assembler up through OBT on all your 454 reads, then dump gatekeeper
>> to build a trimmed fragment set.  If you're using CVS tip, dumping as fastq
>> will work too.  With the pacbio reads, this is mandatory, since the pipeline
>> will split some of the pacbio reads into multiple pieces.
> 
> I saw some submissions to the CVS about this, but couldn't figure out
> exactly what it meant. This clears up that. I recently started an
> assembly with 454 and Illumina reads (Illumina corrected Quake), and
> correct-frags have run for several days now.
> 
> Should I run OBT on all my 454 reads, dump the trimmed reads, and use
> them in a new assembly with the error corrected Illumina reads? The
> default with the CVS tip will then be to not run correct-frags etc on
> those reads? What will be the effect of using these trimmed 454 reads
> for PacBio error correction?

If you have trimmed / corrected reads then disabling both OBT and the
correction should be done:

doOBT=0
doFragmentCorrection=0

The correction process hasn't changed since the Sanger-only days.  It
doesn't seem to scale easily to hundreds of millions of reads.  The
algorithm:  In the first pass (fragment correction) a multiple sequence
alignment is generated for each read.  The alignment is formed from all
overlaps to the read.  Errors were detected, and noted.  In the second pass
(overlap correction) these corrections were applied to change the error rate
of overlaps.  The bases in the read never change.

My opinion is that correction of the bases in the reads is now good enough
that the reads should be corrected before assembly.  The corrections can be
specific to the technology (homopolymer for 454, no indel for Illumina)
something that both isn't done and would be tough to do in CA.

>> The obt overlaps and ovl overlaps used for assembly aren't compatible.  The
>> obt overlaps are more like blast matches (align a-b in read 1 to c-d in read
>> 2) while the ovl overlaps are ... overlaps; see
>> http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Overlaps
>> .  Since trimming will change the length of the read, it's impossible to
>> translate the overlaps on untrimmed reads to overlaps on trimmed reads.
> 
> I hadn't seen that page. It's a useful reference (as are other
> "hidden" pages at that wiki.)

Thought we had a (one) link to it somewhere.  *sigh*

b


> 
> Ole
> 
>> 
>> b
>> 
>> 
>> 
>> On 5/10/12 4:53 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:
>> 
>>> Hi,
>>> we have started doing some sequencing on PacBio, and correcting the
>>> reads with the PacBioToCA pipeline. The genome about 800 Mb, and we're
>>> trying to correct the PacBio reads from two SMRTcells with about 20x
>>> in 454 reads. This translates to 130,389 PacBio reads with 126 Mb
>>> sequence, and 47M 454 reads and 17.6 Gb sequence.
>>> 
>>> We see that 0-overlaptrim-overlap uses quite a bit of time, and I fear
>>> that 1-overlapper will use a long time too. Is it possible to compute
>>> the overlaps between the 454 reads ahead of time, and use the overlaps
>>> from that store to only compute the overlaps between 454 reads and
>>> PacBio reads? Since I guess most to time is spent computing the
>>> overlaps between 454 reads. This could be useful for assembly in
>>> general too, sometimes we only input some data to have a faster
>>> assembly, while later on we input more.
>>> 
>>> When I look at the command that's used to run CA in the error
>>> correction step: runCA -s pacbio.spec -p asm -d temppacbio
>>> ovlHashLibrary=2 ovlRefLibrary=1-1 obtHashLibrary=1-1
>>> obtRefLibrary=1-1 sge=" -sync y" sgePropagateHold=corAsm
>>> stopAfter=overlapper, does it actually do something what I ask for? It
>>> only loads hash fragments from library 2, but it loads all libraries
>>> in the other *Library options (1-1 = 0)? Could anyone explain to me
>>> what that really means?
>>> 
>>> Sincerely,
>>> Ole
>>> 
>>> ----------------------------------------------------------------------------
>>> --
>>> Live Security Virtual Conference
>>> Exclusive live event will cover all the ways today's security and
>>> threat landscape has changed and how IT managers can respond. Discussions
>>> will include endpoint security, mobile security and the latest in malware
>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> _______________________________________________
>>> wgs-assembler-users mailing list
>>> wgs...@li...
>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>

Re: [wgs-assembler-users] Saving and combining overlap stores (error correction PacBio reads for ex.)

From: Ole K. T. <o.k...@bi...> - 2012-05-10 18:55:20

Hi Brian.

Thank you for this, good to know. Our PacBio fastq files were over
multiple lines (SMRT-Portal 1.3... Thank you a lot PacBio!), and the
correction pipeline ran for 17 days taking up 48 CPUs and I guess we
can just kill it now.

On 10 May 2012 19:50, Walenz, Brian <bw...@jc...> wrote:
> Hi, Ole-
>
> ovlHashLibrary=2 does mean to load only reads from the second library into
> the hash table.  In this case, it's the pac bio reads.  The 'ref' library is
> what fragments we search against the hash table.  ovlRefLibrary=1-1
> translates to 'starting at library 1 and ending at library 1'.  Overlaps
> well be computed between library 1 and 2, but not in the same library.
>
> I should point out that this isn't implemented perfectly.  The overlap jobs
> for computing overlaps within library 1 are still launched, and the hash
> tables are still built, but no overlaps are output.  The 'overlap_partition'
> command is responsible for setting up the hash and reference ranges for each
> overlap job, and this isn't aware of the ovlHashLibrary/ovlRefLibrary
> options.
>
> We've been recently disabling OBT (and fragmentCorrection) in runCA, and
> doing all trimming/correction outside the assembler.  In your case, you can
> run the assembler up through OBT on all your 454 reads, then dump gatekeeper
> to build a trimmed fragment set.  If you're using CVS tip, dumping as fastq
> will work too.  With the pacbio reads, this is mandatory, since the pipeline
> will split some of the pacbio reads into multiple pieces.

I saw some submissions to the CVS about this, but couldn't figure out
exactly what it meant. This clears up that. I recently started an
assembly with 454 and Illumina reads (Illumina corrected Quake), and
correct-frags have run for several days now.

Should I run OBT on all my 454 reads, dump the trimmed reads, and use
them in a new assembly with the error corrected Illumina reads? The
default with the CVS tip will then be to not run correct-frags etc on
those reads? What will be the effect of using these trimmed 454 reads
for PacBio error correction?

>
> The obt overlaps and ovl overlaps used for assembly aren't compatible.  The
> obt overlaps are more like blast matches (align a-b in read 1 to c-d in read
> 2) while the ovl overlaps are ... overlaps; see
> http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Overlaps
> .  Since trimming will change the length of the read, it's impossible to
> translate the overlaps on untrimmed reads to overlaps on trimmed reads.

I hadn't seen that page. It's a useful reference (as are other
"hidden" pages at that wiki.)

Ole

>
> b
>
>
>
> On 5/10/12 4:53 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:
>
>> Hi,
>> we have started doing some sequencing on PacBio, and correcting the
>> reads with the PacBioToCA pipeline. The genome about 800 Mb, and we're
>> trying to correct the PacBio reads from two SMRTcells with about 20x
>> in 454 reads. This translates to 130,389 PacBio reads with 126 Mb
>> sequence, and 47M 454 reads and 17.6 Gb sequence.
>>
>> We see that 0-overlaptrim-overlap uses quite a bit of time, and I fear
>> that 1-overlapper will use a long time too. Is it possible to compute
>> the overlaps between the 454 reads ahead of time, and use the overlaps
>> from that store to only compute the overlaps between 454 reads and
>> PacBio reads? Since I guess most to time is spent computing the
>> overlaps between 454 reads. This could be useful for assembly in
>> general too, sometimes we only input some data to have a faster
>> assembly, while later on we input more.
>>
>> When I look at the command that's used to run CA in the error
>> correction step: runCA -s pacbio.spec -p asm -d temppacbio
>> ovlHashLibrary=2 ovlRefLibrary=1-1 obtHashLibrary=1-1
>> obtRefLibrary=1-1 sge=" -sync y" sgePropagateHold=corAsm
>> stopAfter=overlapper, does it actually do something what I ask for? It
>> only loads hash fragments from library 2, but it loads all libraries
>> in the other *Library options (1-1 = 0)? Could anyone explain to me
>> what that really means?
>>
>> Sincerely,
>> Ole
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> wgs-assembler-users mailing list
>> wgs...@li...
>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>

Re: [wgs-assembler-users] Saving and combining overlap stores (error correction PacBio reads for ex.)

From: Walenz, B. <bw...@jc...> - 2012-05-10 17:50:41

Hi, Ole-

ovlHashLibrary=2 does mean to load only reads from the second library into
the hash table.  In this case, it's the pac bio reads.  The 'ref' library is
what fragments we search against the hash table.  ovlRefLibrary=1-1
translates to 'starting at library 1 and ending at library 1'.  Overlaps
well be computed between library 1 and 2, but not in the same library.

I should point out that this isn't implemented perfectly.  The overlap jobs
for computing overlaps within library 1 are still launched, and the hash
tables are still built, but no overlaps are output.  The 'overlap_partition'
command is responsible for setting up the hash and reference ranges for each
overlap job, and this isn't aware of the ovlHashLibrary/ovlRefLibrary
options.

We've been recently disabling OBT (and fragmentCorrection) in runCA, and
doing all trimming/correction outside the assembler.  In your case, you can
run the assembler up through OBT on all your 454 reads, then dump gatekeeper
to build a trimmed fragment set.  If you're using CVS tip, dumping as fastq
will work too.  With the pacbio reads, this is mandatory, since the pipeline
will split some of the pacbio reads into multiple pieces.

The obt overlaps and ovl overlaps used for assembly aren't compatible.  The
obt overlaps are more like blast matches (align a-b in read 1 to c-d in read
2) while the ovl overlaps are ... overlaps; see
http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Overlaps
.  Since trimming will change the length of the read, it's impossible to
translate the overlaps on untrimmed reads to overlaps on trimmed reads.

b

On 5/10/12 4:53 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:

> Hi,
> we have started doing some sequencing on PacBio, and correcting the
> reads with the PacBioToCA pipeline. The genome about 800 Mb, and we're
> trying to correct the PacBio reads from two SMRTcells with about 20x
> in 454 reads. This translates to 130,389 PacBio reads with 126 Mb
> sequence, and 47M 454 reads and 17.6 Gb sequence.
> 
> We see that 0-overlaptrim-overlap uses quite a bit of time, and I fear
> that 1-overlapper will use a long time too. Is it possible to compute
> the overlaps between the 454 reads ahead of time, and use the overlaps
> from that store to only compute the overlaps between 454 reads and
> PacBio reads? Since I guess most to time is spent computing the
> overlaps between 454 reads. This could be useful for assembly in
> general too, sometimes we only input some data to have a faster
> assembly, while later on we input more.
> 
> When I look at the command that's used to run CA in the error
> correction step: runCA -s pacbio.spec -p asm -d temppacbio
> ovlHashLibrary=2 ovlRefLibrary=1-1 obtHashLibrary=1-1
> obtRefLibrary=1-1 sge=" -sync y" sgePropagateHold=corAsm
> stopAfter=overlapper, does it actually do something what I ask for? It
> only loads hash fragments from library 2, but it loads all libraries
> in the other *Library options (1-1 = 0)? Could anyone explain to me
> what that really means?
> 
> Sincerely,
> Ole
> 
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> wgs-assembler-users mailing list
> wgs...@li...
> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users

[wgs-assembler-users] Saving and combining overlap stores (error correction PacBio reads for ex.)

From: Ole K. T. <o.k...@bi...> - 2012-05-10 08:53:47

Hi,
we have started doing some sequencing on PacBio, and correcting the
reads with the PacBioToCA pipeline. The genome about 800 Mb, and we're
trying to correct the PacBio reads from two SMRTcells with about 20x
in 454 reads. This translates to 130,389 PacBio reads with 126 Mb
sequence, and 47M 454 reads and 17.6 Gb sequence.

We see that 0-overlaptrim-overlap uses quite a bit of time, and I fear
that 1-overlapper will use a long time too. Is it possible to compute
the overlaps between the 454 reads ahead of time, and use the overlaps
from that store to only compute the overlaps between 454 reads and
PacBio reads? Since I guess most to time is spent computing the
overlaps between 454 reads. This could be useful for assembly in
general too, sometimes we only input some data to have a faster
assembly, while later on we input more.

When I look at the command that's used to run CA in the error
correction step: runCA -s pacbio.spec -p asm -d temppacbio
ovlHashLibrary=2 ovlRefLibrary=1-1 obtHashLibrary=1-1
obtRefLibrary=1-1 sge=" -sync y" sgePropagateHold=corAsm
stopAfter=overlapper, does it actually do something what I ask for? It
only loads hash fragments from library 2, but it loads all libraries
in the other *Library options (1-1 = 0)? Could anyone explain to me
what that really means?

Sincerely,
Ole

Re: [wgs-assembler-users] Converting IIDs to Read Names from FASTQ files

From: Arjun P. <ap...@ma...> - 2012-04-25 18:31:50

Hi,

Thanks Brian for the detailed explanation. The gkpStore.fastqUIDmap file 
is easy enough to parse. From what you said it seems like the generated 
UID in the output may be something you guys fix at some point, right?

I wrote a little perl script to convert the UIDs to readnames in the 
posmap files. I didn't do it for the .asm file because posmaps are all I 
need for now.

I posted it at http://arjunprasad.net/scripts/fixReadnamesInPosmap in case 
it's helpful for someone else.

It took about 1.5 Gigs of RAM for 7 million reads with fairly long names. 
It just occurred to me that fixReadnamesInPosmap doesn't handle the case 
where you have an assembly with some FASTQ files and some .frg files for 
input. That's easy to fix if it's useful to anyone.

Arjun

On Tue, 24 Apr 2012, Walenz, Brian wrote:

> Hi-
>
> I was fearing the day someone would ask about this.  We had a choice of
> either doing lots of engineering to optimize directly saving names of fastq
> reads, or an inelegant - and only partially completed - solution of
> stripping the names when the reads are loaded into the gatekeeper store, and
> adding them back as a post process.
>
> The names and mapping are saved in the *.gkpStore.fastqUIDmap.  The format
> is:
>
> UID IID Name (for unparied reads)
> UID IID Name UID IID Name (for paired reads)
>
> IIDs are used internal to the assembler.  Most logs refer to reads (unitigs,
> contigs and scaffolds) using these.  There is an implicit 'type' with each
> IID.  "1" is a valid IID for four objects: a fragment, a unitig, a contig
> and a scaffold.
>
> UIDs appear in the outputs - posmap and asm.  These are guaranteed to be
> unique within the assembly.  For reads loaded as .frg, the UID is the read
> name.
>
> The iidtouid file gives a mapping from IID to UID, for every object in the
> assembly, not just reads.
>
> Sorry for the pain.  We're a bit short on engineering time at the moment,
> and as this wasn't an issue critical to getting a good assembly, we only
> made it 'not break' for an assembly with > 1 billion reads.
>
> b
>
>
>
>
> On 4/24/12 1:52 PM, "Arjun Prasad" <ap...@ma...> wrote:
>
>>
>> Hi,
>>
>> I need to get a read-mapping with the actual read-names for an assembly
>> that was created based on FASTQ input sequences. I noticed the iidtouid
>> file in the 9-terminator directory, but it has numbers for fragments
>> rather than read names.
>>
>> Looking at the reads from the 9-terminator/.frg file I matched up some by
>> sequence, and it looks like the FRG numbers are alternating reads from
>> each of the paired ends.
>>
>> e.g.,
>>
>>      FRG 1   110000000001 - first entry from read 1
>>      No FRG 2
>>      FRG 3   110000000003 - 2nd entry from read 1
>>      FRG 4   120000000003 - 2nd entry from read 2
>>      FRG 5   110000000005 - 3rd entry from read 1
>>      FRG 6   120000000005 - 3rd entry from read 2
>>      FRG 100000 120000099999 - Entry 50,000 from read 2
>>
>> I'm guessing that I can figure out the read name to iid translation by
>> counting into the fastq files by FRG # / 2
>>
>> Has anyone else done this? Did I correctly interpret what the FRG numbers
>> mean? Are there any gotchas at input file boundaries?
>>
>> Thanks,
>> Arjun
>

Re: [wgs-assembler-users] caught in overlap.sh?

From: Christoph H. <chr...@gm...> - 2012-04-25 11:41:08

Hi Heiner,

Thanks for your effort and helpful comments! The overlap job did 
actually finish now, but unfortunately CABOG crashed right afterwards 
because of exceeding disc space. Very unfortunate, but I have to ask for 
more disk space before I can resume the assembly manually. I was not 
expecting to complete the whole assembly in ten days, just the 
overlap-trim stage for now..

Concerning reducing the coverage: I thought about that, but I have also 
tested several DeBrujin graph assemblers and have discovered that I get 
the best results when using all the illumina data (instead of only a 
subset of it). The illumina data I am using is already errorcorrected. I 
decided to use the data like that and to rely on the CABOG trimming 
algorithm. With stringent manual trimming prior to CABOG I could reduce 
the number of illumina reads to some 160 Mio (paired end reads). Also, I 
suppose to leave the 14 MIO single end illumina reads out will not 
substantially affect the result. That would result in some 160 Mio 
illumina reads (76 bp) + 1.1 Mio 454 reads (500bp) - assuming a 100Mb 
genome still a theoretical 130x coverage - when assuming some 20-30 % 
host and bacterial contamination we reach about 100x coverage. The 
question now is, what would be more effective. Either resume the 
assembly with the data as it is, or start from scratch with the trimmed 
data.
An effective solution in terms of runtime is unfortunately very 
important to me as I only have a limited amount of CPU hours available 
on the cluster. I can ask for more but only after the initial quota is 
exceeded and then it involves annoying bureaucracy and waiting time. 
Just to clearify why CPU hours are such an issue for me - sorry to 
bother you with that..

I put quite some time and effort into the configuration of the overlap 
jobs to reach a hash table load of some 70%, as suggested on the manual 
page. This was not so easy because the load varied between libraries, so 
I decided to focus on the paired end illumina library as this is the 
vast majority of the data. I had configured for 8 threads and the 
pipeline was constantly using all 8 threads. My illumina data is in 
zipped format.

The alternative approach you are mentioning below sounds very 
interesting, especially as I already have the best possible (I believe 
so  at least :-)) solexa only assembly available.. Can you give me some 
more detailed information on that? Where to find this Celera version? 
The snag is that I would need to convince the cluster administration to 
install the other Celera version.. Almost forgot: I am using Celera 
assembler 7.0 right now.

Thanks again for your suggestions and apologies for a long message..!

cheers,
Christoph

Am 25.04.2012 11:28, schrieb kuhl:
> Dear Christoph,
>
> I have successfully done an assembly of about 350 mio reads for a 1.2 Gb
> genome using Celera Assembler 6.1 (which version do you use?) from 454 and
> Solexa data. Anyway it took about 1.5 month to complete on a 48 core server
> and used plenty of disk space (2 - 3 TB) and there were lots of manual work
> with failed contigs that had to be corrected manually. So 10 days might be
> not enough. (The data will not be lost after the ten days, as you can
> resume the overlap.sh jobs manually and everything done so far is saved to
> disk) I also see from your mail that you are using a very high coverage of
> your genome. Celera may not take profit from that. Maybe you could reduce
> your dataset to a 50-70X coverage. That would reduce the computing time
> dramatically as computing time increases quadratically with
> (readnumber/coverage). It also depends how you did configure the
> overlapper. Depending on the configuration calculating the overlap jobs
> might take longer for each job or be more or less constant in computing
> time for each job.
>
> Another possibility I tried for a different genome (2.5Gb 10^9 reads ->  I
> did not want to wait for three month...) is to use an debrujin graph
> assembler to assemble the Illumina data (I would recommend SOAPdenovo or
> CLC, the later one can also make use of the 454 data), split the resulting
> scaffolds to contigs smaller than 32000bp and feed them together with 454
> data and a little (i.e. 5X) coverage of the illumina paired ends into the
> long read version of Celera assembler supplied with the pacificbio
> correction pipeline. These steps took about 1 week and delivered a much
> better assembly compared to using de bruijn graph assemblers alone.
>
> Question to other users/developers, did you also experience that if
> Illumina reads are stored in the packed format, the overlap jobs do not
> reach the maximum speed they should? I mean for example an overlap job
> configured to 12 threads is running only on 8 threads on average. Has
> anyone encountered this problem?
>
>
> I wish you good luck,
>
> Heiner
>
>
> On Tue, 24 Apr 2012 19:59:07 +0200, Christoph Hahn
> <chr...@gm...>
> wrote:
>> Thanks for that Ariel! Leaves me with little hope though..
>> Nevertheless I understand that these kind of jobs did finish in your
>> experience, right?
>>
>>   From my tests and the number of overlap.sh jobs created in the inital
>> phase I was assuming to be on the safe side with a wall clock limit of
>> 10 days to finish this stage. I can maybe ask the cluster administration
>> to prolong the wall clock limit, but I`d need some estimate of by how
>> long..
>> I am using some 1.1 Million 454 reads (~500 bp in length) plus some 200
>> Million paired end reads plus some 14 Million single end illumina reads
>> (76 bp read length, respecitively). The genome is estimated to be only
>> about 70-100 Mb in size, but we have reason to expect a substantial
>> amount of contamination from the host (as we are dealing with a
>> parasitic organism), and also a fair bit of polymorphisms as the
>> libraries were prepared from a pooled sample.
>>
>> Can anyone suggest a reasonable time frame for reaching a checkpoint
>> from which I can then resume the assembly?
>>
>> Thanks in advance!!
>>
>> Christoph
>>
>>
>> Am 24.04.2012 18:47, schrieb Schwartz, Ariel:
>>> I have experienced the same issue with our hybrid assemblies.
>>> Currently I am waiting for an overlap job that has been running for
>>> almost two weeks.
>>>
>>> I wonder if there are some recommended settings that could be used to
>>> alleviate this problem.
>>>
>>> Thanks,
>>>
>>> Ariel
>>>
>>> Ariel Schwartz, Ph.D.
>>> Senior Scientist, Bioinformatics
>>> Synthetic Genomics, Inc.
>>>
>>> On 4/24/12 4:44 AM, "Christoph Hahn"<chr...@gm...
>>> <mailto:chr...@gm...>>  wrote:
>>>
>>>      Dear CABOG developers and users,
>>>
>>>      I am trying to do a hybrid assembly using a combination of 454 and
>>>      single- as well as paired-end illumina data.
>>>
>>>      After initial trouble with optimization in the
> 0-overlaptrim-overlap
>>>      stage of my assembly I got it to run succesfully and during the
>>>      previous
>>>      7+ days the pipeline succesfully completetd some 2260 overlap.sh
>>>      jobs.
>>>      Now I am encoutering something strange:  The last pending
>>>      overlap.sh job
>>>      (2148 of 2261) is running now already for over 36 hours. The
>>>      002148.ovb.WORKING.gz file created by this job is slowly but
> steadily
>>>      growing. It presently has some 631 M. Is this normal? Has anyone
>>>      had a
>>>      similar experience before? Maybe it will sort out it self
> eventually
>>>      anyway, I am just a little concerned that CABOG will not finish
>>>      the job
>>>      until it hits the 10 days wall clock limit that is set on the
> cluster
>>>      for the job, which would result in thousands of CPU hours going
>>>      down the
>>>      drain..
>>>
>>>      Please share your wisdom with me!
>>>
>>>      much obliged,
>>>      Christoph Hahn
>>>      PhD fellow
>>>      University of Oslo
>>>      Norway
>>>
>>>
> ------------------------------------------------------------------------------
>>>      Live Security Virtual Conference
>>>      Exclusive live event will cover all the ways today's security and
>>>      threat landscape has changed and how IT managers can respond.
>>>      Discussions
>>>      will include endpoint security, mobile security and the latest in
>>>      malware
>>>      threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>>      _______________________________________________
>>>      wgs-assembler-users mailing list
>>>      wgs...@li...
>>>      <mailto:wgs...@li...>
>>>      https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>>

Re: [wgs-assembler-users] caught in overlap.sh?

From: kuhl <ku...@mo...> - 2012-04-25 10:04:08

Dear Christoph,

I have successfully done an assembly of about 350 mio reads for a 1.2 Gb
genome using Celera Assembler 6.1 (which version do you use?) from 454 and
Solexa data. Anyway it took about 1.5 month to complete on a 48 core server
and used plenty of disk space (2 - 3 TB) and there were lots of manual work
with failed contigs that had to be corrected manually. So 10 days might be
not enough. (The data will not be lost after the ten days, as you can
resume the overlap.sh jobs manually and everything done so far is saved to
disk) I also see from your mail that you are using a very high coverage of
your genome. Celera may not take profit from that. Maybe you could reduce
your dataset to a 50-70X coverage. That would reduce the computing time
dramatically as computing time increases quadratically with
(readnumber/coverage). It also depends how you did configure the
overlapper. Depending on the configuration calculating the overlap jobs
might take longer for each job or be more or less constant in computing
time for each job.

Another possibility I tried for a different genome (2.5Gb 10^9 reads -> I
did not want to wait for three month...) is to use an debrujin graph
assembler to assemble the Illumina data (I would recommend SOAPdenovo or
CLC, the later one can also make use of the 454 data), split the resulting
scaffolds to contigs smaller than 32000bp and feed them together with 454
data and a little (i.e. 5X) coverage of the illumina paired ends into the
long read version of Celera assembler supplied with the pacificbio
correction pipeline. These steps took about 1 week and delivered a much
better assembly compared to using de bruijn graph assemblers alone.

Question to other users/developers, did you also experience that if
Illumina reads are stored in the packed format, the overlap jobs do not
reach the maximum speed they should? I mean for example an overlap job
configured to 12 threads is running only on 8 threads on average. Has
anyone encountered this problem?

I wish you good luck,

Heiner

On Tue, 24 Apr 2012 19:59:07 +0200, Christoph Hahn
<chr...@gm...>
wrote:
> Thanks for that Ariel! Leaves me with little hope though..
> Nevertheless I understand that these kind of jobs did finish in your 
> experience, right?
> 
>  From my tests and the number of overlap.sh jobs created in the inital 
> phase I was assuming to be on the safe side with a wall clock limit of 
> 10 days to finish this stage. I can maybe ask the cluster administration

> to prolong the wall clock limit, but I`d need some estimate of by how 
> long..
> I am using some 1.1 Million 454 reads (~500 bp in length) plus some 200 
> Million paired end reads plus some 14 Million single end illumina reads 
> (76 bp read length, respecitively). The genome is estimated to be only 
> about 70-100 Mb in size, but we have reason to expect a substantial 
> amount of contamination from the host (as we are dealing with a 
> parasitic organism), and also a fair bit of polymorphisms as the 
> libraries were prepared from a pooled sample.
> 
> Can anyone suggest a reasonable time frame for reaching a checkpoint 
> from which I can then resume the assembly?
> 
> Thanks in advance!!
> 
> Christoph
> 
> 
> Am 24.04.2012 18:47, schrieb Schwartz, Ariel:
>> I have experienced the same issue with our hybrid assemblies.
>> Currently I am waiting for an overlap job that has been running for 
>> almost two weeks.
>>
>> I wonder if there are some recommended settings that could be used to 
>> alleviate this problem.
>>
>> Thanks,
>>
>> Ariel
>>
>> Ariel Schwartz, Ph.D.
>> Senior Scientist, Bioinformatics
>> Synthetic Genomics, Inc.
>>
>> On 4/24/12 4:44 AM, "Christoph Hahn" <chr...@gm... 
>> <mailto:chr...@gm...>> wrote:
>>
>>     Dear CABOG developers and users,
>>
>>     I am trying to do a hybrid assembly using a combination of 454 and
>>     single- as well as paired-end illumina data.
>>
>>     After initial trouble with optimization in the
0-overlaptrim-overlap
>>     stage of my assembly I got it to run succesfully and during the
>>     previous
>>     7+ days the pipeline succesfully completetd some 2260 overlap.sh
>>     jobs.
>>     Now I am encoutering something strange:  The last pending
>>     overlap.sh job
>>     (2148 of 2261) is running now already for over 36 hours. The
>>     002148.ovb.WORKING.gz file created by this job is slowly but
steadily
>>     growing. It presently has some 631 M. Is this normal? Has anyone
>>     had a
>>     similar experience before? Maybe it will sort out it self
eventually
>>     anyway, I am just a little concerned that CABOG will not finish
>>     the job
>>     until it hits the 10 days wall clock limit that is set on the
cluster
>>     for the job, which would result in thousands of CPU hours going
>>     down the
>>     drain..
>>
>>     Please share your wisdom with me!
>>
>>     much obliged,
>>     Christoph Hahn
>>     PhD fellow
>>     University of Oslo
>>     Norway
>>
>>    
------------------------------------------------------------------------------
>>     Live Security Virtual Conference
>>     Exclusive live event will cover all the ways today's security and
>>     threat landscape has changed and how IT managers can respond.
>>     Discussions
>>     will include endpoint security, mobile security and the latest in
>>     malware
>>     threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>     _______________________________________________
>>     wgs-assembler-users mailing list
>>     wgs...@li...
>>     <mailto:wgs...@li...>
>>     https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>

-- 
---------------------------------------------------------------
Dr. Heiner Kuhl
MPI Molecular Genetics            Tel:   + 49 + 30 / 8413 1551
Next Generation Sequencing        
Ihnestrasse 73                    email: ku...@mo...
D-14195 Berlin                    http://www.molgen.mpg.de
---------------------------------------------------------------

Re: [wgs-assembler-users] Converting IIDs to Read Names from FASTQ files

From: Walenz, B. <bw...@jc...> - 2012-04-24 20:22:16

Hi-

I was fearing the day someone would ask about this.  We had a choice of
either doing lots of engineering to optimize directly saving names of fastq
reads, or an inelegant - and only partially completed - solution of
stripping the names when the reads are loaded into the gatekeeper store, and
adding them back as a post process.

The names and mapping are saved in the *.gkpStore.fastqUIDmap.  The format
is:

UID IID Name (for unparied reads)
UID IID Name UID IID Name (for paired reads)

IIDs are used internal to the assembler.  Most logs refer to reads (unitigs,
contigs and scaffolds) using these.  There is an implicit 'type' with each
IID.  "1" is a valid IID for four objects: a fragment, a unitig, a contig
and a scaffold.

UIDs appear in the outputs - posmap and asm.  These are guaranteed to be
unique within the assembly.  For reads loaded as .frg, the UID is the read
name.

The iidtouid file gives a mapping from IID to UID, for every object in the
assembly, not just reads.

Sorry for the pain.  We're a bit short on engineering time at the moment,
and as this wasn't an issue critical to getting a good assembly, we only
made it 'not break' for an assembly with > 1 billion reads.

b

On 4/24/12 1:52 PM, "Arjun Prasad" <ap...@ma...> wrote:

> 
> Hi,
> 
> I need to get a read-mapping with the actual read-names for an assembly
> that was created based on FASTQ input sequences. I noticed the iidtouid
> file in the 9-terminator directory, but it has numbers for fragments
> rather than read names.
> 
> Looking at the reads from the 9-terminator/.frg file I matched up some by
> sequence, and it looks like the FRG numbers are alternating reads from
> each of the paired ends.
> 
> e.g.,
> 
>      FRG 1   110000000001 - first entry from read 1
>      No FRG 2
>      FRG 3   110000000003 - 2nd entry from read 1
>      FRG 4   120000000003 - 2nd entry from read 2
>      FRG 5   110000000005 - 3rd entry from read 1
>      FRG 6   120000000005 - 3rd entry from read 2
>      FRG 100000 120000099999 - Entry 50,000 from read 2
> 
> I'm guessing that I can figure out the read name to iid translation by
> counting into the fastq files by FRG # / 2
> 
> Has anyone else done this? Did I correctly interpret what the FRG numbers
> mean? Are there any gotchas at input file boundaries?
> 
> Thanks,
> Arjun

Re: [wgs-assembler-users] caught in overlap.sh?

From: Christoph H. <chr...@gm...> - 2012-04-24 17:59:22

Thanks for that Ariel! Leaves me with little hope though..
Nevertheless I understand that these kind of jobs did finish in your 
experience, right?

 From my tests and the number of overlap.sh jobs created in the inital 
phase I was assuming to be on the safe side with a wall clock limit of 
10 days to finish this stage. I can maybe ask the cluster administration 
to prolong the wall clock limit, but I`d need some estimate of by how 
long..
I am using some 1.1 Million 454 reads (~500 bp in length) plus some 200 
Million paired end reads plus some 14 Million single end illumina reads 
(76 bp read length, respecitively). The genome is estimated to be only 
about 70-100 Mb in size, but we have reason to expect a substantial 
amount of contamination from the host (as we are dealing with a 
parasitic organism), and also a fair bit of polymorphisms as the 
libraries were prepared from a pooled sample.

Can anyone suggest a reasonable time frame for reaching a checkpoint 
from which I can then resume the assembly?

Thanks in advance!!

Christoph

Am 24.04.2012 18:47, schrieb Schwartz, Ariel:
> I have experienced the same issue with our hybrid assemblies.
> Currently I am waiting for an overlap job that has been running for 
> almost two weeks.
>
> I wonder if there are some recommended settings that could be used to 
> alleviate this problem.
>
> Thanks,
>
> Ariel
>
> Ariel Schwartz, Ph.D.
> Senior Scientist, Bioinformatics
> Synthetic Genomics, Inc.
>
> On 4/24/12 4:44 AM, "Christoph Hahn" <chr...@gm... 
> <mailto:chr...@gm...>> wrote:
>
>     Dear CABOG developers and users,
>
>     I am trying to do a hybrid assembly using a combination of 454 and
>     single- as well as paired-end illumina data.
>
>     After initial trouble with optimization in the 0-overlaptrim-overlap
>     stage of my assembly I got it to run succesfully and during the
>     previous
>     7+ days the pipeline succesfully completetd some 2260 overlap.sh
>     jobs.
>     Now I am encoutering something strange:  The last pending
>     overlap.sh job
>     (2148 of 2261) is running now already for over 36 hours. The
>     002148.ovb.WORKING.gz file created by this job is slowly but steadily
>     growing. It presently has some 631 M. Is this normal? Has anyone
>     had a
>     similar experience before? Maybe it will sort out it self eventually
>     anyway, I am just a little concerned that CABOG will not finish
>     the job
>     until it hits the 10 days wall clock limit that is set on the cluster
>     for the job, which would result in thousands of CPU hours going
>     down the
>     drain..
>
>     Please share your wisdom with me!
>
>     much obliged,
>     Christoph Hahn
>     PhD fellow
>     University of Oslo
>     Norway
>
>     ------------------------------------------------------------------------------
>     Live Security Virtual Conference
>     Exclusive live event will cover all the ways today's security and
>     threat landscape has changed and how IT managers can respond.
>     Discussions
>     will include endpoint security, mobile security and the latest in
>     malware
>     threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>     _______________________________________________
>     wgs-assembler-users mailing list
>     wgs...@li...
>     <mailto:wgs...@li...>
>     https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>

[wgs-assembler-users] Converting IIDs to Read Names from FASTQ files

From: Arjun P. <ap...@ma...> - 2012-04-24 17:53:45

Hi,

I need to get a read-mapping with the actual read-names for an assembly 
that was created based on FASTQ input sequences. I noticed the iidtouid 
file in the 9-terminator directory, but it has numbers for fragments 
rather than read names.

Looking at the reads from the 9-terminator/.frg file I matched up some by 
sequence, and it looks like the FRG numbers are alternating reads from 
each of the paired ends.

e.g.,

     FRG 1   110000000001 - first entry from read 1
     No FRG 2
     FRG 3   110000000003 - 2nd entry from read 1
     FRG 4   120000000003 - 2nd entry from read 2
     FRG 5   110000000005 - 3rd entry from read 1
     FRG 6   120000000005 - 3rd entry from read 2
     FRG 100000 120000099999 - Entry 50,000 from read 2

I'm guessing that I can figure out the read name to iid translation by 
counting into the fastq files by FRG # / 2

Has anyone else done this? Did I correctly interpret what the FRG numbers 
mean? Are there any gotchas at input file boundaries?

Thanks,
Arjun

-- 
Genome Technology Branch
National Human Genome Research Institute
National Institutes of Health
5625 Fishers Lane                Phone: 301-594-9199
Room 5N-01L                        Fax: 301-435-6170
Rockville, MD 20892-9400        E-Mail: ap...@nh...

Re: [wgs-assembler-users] caught in overlap.sh?

From: Schwartz, A. <asc...@sy...> - 2012-04-24 17:00:21

I have experienced the same issue with our hybrid assemblies.
Currently I am waiting for an overlap job that has been running for almost two weeks.

I wonder if there are some recommended settings that could be used to alleviate this problem.

Thanks,

Ariel

Ariel Schwartz, Ph.D.
Senior Scientist, Bioinformatics
Synthetic Genomics, Inc.

On 4/24/12 4:44 AM, "Christoph Hahn" <chr...@gm...<mailto:chr...@gm...>> wrote:

Dear CABOG developers and users,

I am trying to do a hybrid assembly using a combination of 454 and
single- as well as paired-end illumina data.

After initial trouble with optimization in the 0-overlaptrim-overlap
stage of my assembly I got it to run succesfully and during the previous
7+ days the pipeline succesfully completetd some 2260 overlap.sh jobs.
Now I am encoutering something strange: The last pending overlap.sh job
(2148 of 2261) is running now already for over 36 hours. The
002148.ovb.WORKING.gz file created by this job is slowly but steadily
growing. It presently has some 631 M. Is this normal? Has anyone had a
similar experience before? Maybe it will sort out it self eventually
anyway, I am just a little concerned that CABOG will not finish the job
until it hits the 10 days wall clock limit that is set on the cluster
for the job, which would result in thousands of CPU hours going down the
drain..

Please share your wisdom with me!

much obliged,
Christoph Hahn
PhD fellow
University of Oslo
Norway

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
wgs-assembler-users mailing list
wgs...@li...<mailto:wgs...@li...>
https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users

[wgs-assembler-users] caught in overlap.sh?

From: Christoph H. <chr...@gm...> - 2012-04-24 11:44:27

Dear CABOG developers and users,

I am trying to do a hybrid assembly using a combination of 454 and 
single- as well as paired-end illumina data.

After initial trouble with optimization in the 0-overlaptrim-overlap 
stage of my assembly I got it to run succesfully and during the previous 
7+ days the pipeline succesfully completetd some 2260 overlap.sh jobs. 
Now I am encoutering something strange:  The last pending overlap.sh job 
(2148 of 2261) is running now already for over 36 hours. The 
002148.ovb.WORKING.gz file created by this job is slowly but steadily 
growing. It presently has some 631 M. Is this normal? Has anyone had a 
similar experience before? Maybe it will sort out it self eventually 
anyway, I am just a little concerned that CABOG will not finish the job 
until it hits the 10 days wall clock limit that is set on the cluster 
for the job, which would result in thousands of CPU hours going down the 
drain..

Please share your wisdom with me!

much obliged,
Christoph Hahn
PhD fellow
University of Oslo
Norway

Re: [wgs-assembler-users] viewers

From: Ole K. T. <o.k...@bi...> - 2012-04-15 20:07:42

Hi Paul.

You can use Hawkeye for this:
http://sourceforge.net/apps/mediawiki/amos/index.php?title=Hawkeye
(At least as long as your assembly is not too large, bacteria are
fine, but mammal genomes will probably not work.)

Ole

On 15 April 2012 21:28, Paul Cantalupo <pca...@gm...> wrote:
> Hi,
>
> Does anybody know if there are any graphical viewing programs for showing
> the output of CA so that I can manually see the contigs (scaffolds and
> degenerates), consensus sequence and reads that were used to construct the
> contigs? Thank you,
>
> Paul
>
> University of Pittsburgh
> Pittsburgh, PA
>
>
>
> ------------------------------------------------------------------------------
> For Developers, A Lot Can Happen In A Second.
> Boundary is the first to Know...and Tell You.
> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
> http://p.sf.net/sfu/Boundary-d2dvs2
> _______________________________________________
> wgs-assembler-users mailing list
> wgs...@li...
> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>

Re: [wgs-assembler-users] degenerate contigs

From: Paul C. <pca...@gm...> - 2012-04-15 19:37:03

Hi Brian,

First, I'd like to thank you and the development team at your institution
for making cabog public. I am finding it a very valuable tool to use.

On Sun, Apr 15, 2012 at 2:56 PM, Walenz, Brian <bw...@jc...> wrote:

> Without getting into precise definitions:  Scaffolder (cgw) promotes
> unitigs that looks like unique sequence (based on coverage, length and a
> few other signals) to contigs.


What command line options govern this? Your answer probably depends on what
I'm trying to do. I usually do two types of assemblies:

1) metagenomic (therefore, a complex mixed sample containing sequences from
many species)

2) targeted smaller assemblies with reads that are similar to one species.
Here, I'm trying to make the assembly quicker and hopefully more accurate
by only selecting reads that are similar to one species in hopes to
assembly a complete genome.

Thank you again for you help,

Paul


>  The left over unitigs are available for gap filling as repeats or
> singletons.  The unique contigs are then promoted almost immediately to
> single-contig scaffolds.  With no mates, that's all scaffolder will do.
>  The scaffolds/contigs are output as is, and the left over unitigs are
> output as degenerate contigs.
>
> bri
> --
> Brian Walenz
> Senior Software Engineer
> J. Craig Venter Institute
>
> ________________________________________
> From: Paul Cantalupo [pca...@gm...]
> Sent: Sunday, April 15, 2012 2:53 PM
> To: wgs-assembler-users
> Subject: [wgs-assembler-users] degenerate contigs
>
> Hi
>
> I work with non-paired end 454 sequences. When I perform an assembly, I
> always get a set of regular contigs and degenerate contigs. The celera
> assembler glossary says that degenerate contigs are those unitigs that
> cannot be placed into scaffolds. Well, with my non-paired end data, how can
> *any* contig be placed into a scaffold. Scaffolds cannot be built without
> paired-end data, right?. So, can somebody tell me the difference between a
> "regular" contig and a degenerate contig?
>
> Thank you for your help,
>
> Paul
>
> University of Pittsburgh
> Pittsburgh, PA 15260
>
>
>

[wgs-assembler-users] viewers

From: Paul C. <pca...@gm...> - 2012-04-15 19:28:46

Hi,

Does anybody know if there are any graphical viewing programs for showing
the output of CA so that I can manually see the contigs (scaffolds and
degenerates), consensus sequence and reads that were used to construct the
contigs? Thank you,

Paul

University of Pittsburgh
Pittsburgh, PA

Re: [wgs-assembler-users] degenerate contigs

From: Walenz, B. <bw...@jc...> - 2012-04-15 19:01:18

Without getting into precise definitions:  Scaffolder (cgw) promotes unitigs that looks like unique sequence (based on coverage, length and a few other signals) to contigs.  The left over unitigs are available for gap filling as repeats or singletons.  The unique contigs are then promoted almost immediately to single-contig scaffolds.  With no mates, that's all scaffolder will do.  The scaffolds/contigs are output as is, and the left over unitigs are output as degenerate contigs.

bri
--
Brian Walenz
Senior Software Engineer
J. Craig Venter Institute

________________________________________
From: Paul Cantalupo [pca...@gm...]
Sent: Sunday, April 15, 2012 2:53 PM
To: wgs-assembler-users
Subject: [wgs-assembler-users] degenerate contigs

Hi

I work with non-paired end 454 sequences. When I perform an assembly, I always get a set of regular contigs and degenerate contigs. The celera assembler glossary says that degenerate contigs are those unitigs that cannot be placed into scaffolds. Well, with my non-paired end data, how can *any* contig be placed into a scaffold. Scaffolds cannot be built without paired-end data, right?. So, can somebody tell me the difference between a "regular" contig and a degenerate contig?

Thank you for your help,

Paul

University of Pittsburgh
Pittsburgh, PA 15260

[wgs-assembler-users] degenerate contigs

From: Paul C. <pca...@gm...> - 2012-04-15 18:53:06

Hi

I work with non-paired end 454 sequences. When I perform an assembly, I
always get a set of regular contigs and degenerate contigs. The celera
assembler glossary says that degenerate contigs are those unitigs that
cannot be placed into scaffolds. Well, with my non-paired end data, how can
*any* contig be placed into a scaffold. Scaffolds cannot be built without
paired-end data, right?. So, can somebody tell me the difference between a
"regular" contig and a degenerate contig?

Thank you for your help,

Paul

University of Pittsburgh
Pittsburgh, PA 15260

Re: [wgs-assembler-users] resuming runCA after stop

From: Christoph H. <chr...@gm...> - 2012-04-15 14:06:02

Hi Brian,

Thanks so much for your help!

I have resumed the assembly now with the following settings:
ovlHashBits=23
ovlHashBlockLength=260000000

This consumes some 8.5Gb per job and in my tests gave me a nice load of 
some 70% (see ex1 below), but I have discovered that the load drops to 
some 43% after the 13th overlapper job and stays constant after that 
(currently job 77, see ex2 below). So, again not very efficient. What 
could be the reason for that? Could it be because I am feeding CA with 
two separate illumina datasets (one small single end library and one 
large paired end library)?

ex1:
HASH LOADING STOPPED: strings       3524789 out of      3524789 max.
HASH LOADING STOPPED: length      260000046 out of    260000046 max.
HASH LOADING STOPPED: entries     127378102 out of    132120576 max 
(load 72.31).
### realloc  Extra_Ref_Space  max_extra_ref_ct = 76183793
String_Ct = 3524789  Extra_String_Ct = 755  Extra_String_Subcount = 35
Read 563144 kmers to mark to skip
  Kmer hits without olaps = 13633635
     Kmer hits with olaps = 2890745
   Multiple overlaps/pair = 0
  Total overlaps produced = 2837254
       Contained overlaps = 0
        Dovetail overlaps = 0

ex2:
HASH LOADING STOPPED: strings       3393657 out of      3393657 max.
HASH LOADING STOPPED: length      260000052 out of    260000052 max.
HASH LOADING STOPPED: entries      76303061 out of    132120576 max 
(load 43.31).
### realloc  Extra_Ref_Space  max_extra_ref_ct = 127528828
String_Ct = 3393657  Extra_String_Ct = 13  Extra_String_Subcount = 7
Read 563144 kmers to mark to skip
  Kmer hits without olaps = 5141573
     Kmer hits with olaps = 3859708
   Multiple overlaps/pair = 0
  Total overlaps produced = 3728782
       Contained overlaps = 0
        Dovetail overlaps = 0

I also looked at the size of the *gkpStore/inf file. It has 1.1Gb. How 
do I affect which fragments are loaded first? Is it simply done by the 
order they are listed in the specfile? If so I have loaded the illumina 
fragments first.

Thanks again for your help! I really appreciate it!

cheers,
Christoph

Am 13.04.2012 17:00, schrieb Walenz, Brian:
> I've seen this too, and am a bit confused where the extra space is used.
> Some assemblies are spot on, others are up to twice as large.
>
> The entries below is 264..., where 957... of them are used.  In this case,
> you can either increase hashBlockLength (more memory) or decrease hashBits
> (less memory).  The important stat in what you show is ~30% load - most of
> that 3.5gb hash table is empty.  We target 70% load.  Any higher and the
> table does inefficient lookups, and lower wastes space and increases
> overlapper overhead (more jobs).
>
> One thing to check is the size of file *gkpStore/inf.  This is loaded into
> memory nThreads+1 times.  The next version (or the CVS tip version) will
> make this less of a problem.  If the 'inf' file is large, loading Illumina
> fragments first should reduce the size.
>
> b
>
>
>
> On 4/13/12 10:52 AM, "Christoph Hahn"<chr...@gm...>  wrote:
>
>> Hi Brian,
>>
>> Thanks for your reply and suggestions!
>>
>> I did follow your suggestion and configured the overlap jobs with
>> ³useGrid=1, scriptOnGrid=0². I subsequently ran overlap.sh 1, etc. to
>> check the memory usage.
>>
>> I am using the following overlap parameters:
>>
>> ovlHashBits=24, ovlHashBlockLength=200000000
>>
>> according to my calculations this would consume some 6 GB of memory
>> (3.5GB from ovlHashBits=24 + 0.5 GB overhang + some 2 GB for the 200 Mb
>> of sequence loaded) per thread.
>>
>> The actual max memory consumption is about 9.6 GB (I ran several
>> overlap.sh jobs by hand), so there is a difference of some 3.5 GB of
>> memory consumption between calculated and observed. Am I missing
>> anything? Where is the error in my calculation?
>>
>> When running the overlap.sh I get something like this:
>> HASH LOADING STOPPED: strings       2695151 out of      2695151 max.
>> HASH LOADING STOPPED: length      200000024 out of    200000024 max.
>> HASH LOADING STOPPED: entries      95738763 out of    264241152 max
>> (load 27.17).
>>
>> In order to optimize, one question to your rule of thumb ("As a rule of
>> thumb, setting ovlHashBlockLength to twice the number of entries
>> available in the table seems reasonable."): in my example, which one is
>> the number of entries available in the table? 95738763 or 264241152? I
>> am a little confused with the terminology... sorry.
>>
>> Thanks again for your kind help!
>>
>> cheers,
>> Christoph
>>
>> On 12.04.2012 21:55, Walenz, Brian wrote:
>>> Hi, Christoph-
>>>
>>> In general (but with exceptions) you can delete a stage and runCA will
>>> pick up from there. For example, you can delete 4-unitigger, fiddle with
>>> parameters, and restart exactly at creating unitigs.
>>>
>>> This works fine with overlaps. Just delete 0-overlaptrim-overlap (and
>>> nothing else!), change parameters and restart runCA. It will skip
>>> gatekeeper, meryl, any trimming, and move straight to configuring overlaps.
>>>
>>> Tip: For overlaps on large assemblies, I like to set ³useGrid=1
>>> scriptOnGrid=0². This will configure the overlap jobs, then print out a
>>> qsub command to run them on SGE, but not actually submit them. I then
>>> run several jobs by hand to see memory size and compute performance. To
>>> run by hand, in 0-overlaptrim-overlap, run ³overlap.sh 1², ³overlap.sh
>>> 2² etc. If you stop these early, they will leave an incomplete
>>> ³*.WORKING.gz² file in the output directory (001/ 002/ 003/ etc). I
>>> don¹t think overlap.sh checks for these files, so you don¹t even have to
>>> remove them before submitting the full batch.
>>>
>>> b
>>>
>>>
>>> On 4/11/12 5:02 PM, "Christoph Hahn"<chr...@gm...>  wrote:
>>>
>>>      Dear CA developers and users,
>>>
>>>      I am trying to use Celeara assembler 7.0 to assemble a medium sized
>>>      genome (about 100 Mb) using a combination of 454 and illumina reads.
>>>
>>>      I choose a bad combination of the ovlHashBits, ovlHashBlockLength
>>>      and ovlThreads options so that my last run stopped at the cluster I
>>>      am using due to exceeding memory limit in the overlaptrim step. I
>>>      think I know what the problem was, now, so my question is if it is
>>>      possible to resume runCA from any given stage. In my particular case
>>>      I would like to resume from the 0-overlaptrim-overlap stage with
>>>      altered ovlHashBits, ovlHashBlockLength and ovlThreads options. I
>>>      want to avaid doing the mercouts and initialtrim steps again,
>>>      because they seem to have worked fine.
>>>
>>>      I read in the manual about using the /do*/ option to get a kind of
>>>      /startBefore/ effect. I cant seem to find any more details on this
>>>      in the manual, so can you maybe help me out or point me to the
>>>      required information on the webpage. Thanks!
>>>
>>>      Your help is highly appreciated!
>>>
>>>      much obliged,
>>>      Christoph Hahn
>>>      PhD student
>>>      University of Oslo
>>>
>>>

Re: [wgs-assembler-users] resuming runCA after stop

From: Walenz, B. <bw...@jc...> - 2012-04-12 19:55:25

Hi, Christoph-

In general (but with exceptions) you can delete a stage and runCA will pick up from there. For example, you can delete 4-unitigger, fiddle with parameters, and restart exactly at creating unitigs.

This works fine with overlaps. Just delete 0-overlaptrim-overlap (and nothing else!), change parameters and restart runCA. It will skip gatekeeper, meryl, any trimming, and move straight to configuring overlaps.

Tip: For overlaps on large assemblies, I like to set “useGrid=1 scriptOnGrid=0”. This will configure the overlap jobs, then print out a qsub command to run them on SGE, but not actually submit them. I then run several jobs by hand to see memory size and compute performance. To run by hand, in 0-overlaptrim-overlap, run “overlap.sh 1”, “overlap.sh 2” etc. If you stop these early, they will leave an incomplete “*.WORKING.gz” file in the output directory (001/ 002/ 003/ etc). I don’t think overlap.sh checks for these files, so you don’t even have to remove them before submitting the full batch.

On 4/11/12 5:02 PM, "Christoph Hahn" <chr...@gm...> wrote:

Dear CA developers and users,

I am trying to use Celeara assembler 7.0 to assemble a medium sized genome (about 100 Mb) using a combination of 454 and illumina reads.

I choose a bad combination of the ovlHashBits, ovlHashBlockLength and ovlThreads options so that my last run stopped at the cluster I am using due to exceeding memory limit in the overlaptrim step. I think I know what the problem was, now, so my question is if it is possible to resume runCA from any given stage. In my particular case I would like to resume from the 0-overlaptrim-overlap stage with altered ovlHashBits, ovlHashBlockLength and ovlThreads options. I want to avaid doing the mercouts and initialtrim steps again, because they seem to have worked fine.

I read in the manual about using the do* option to get a kind of startBefore effect. I cant seem to find any more details on this in the manual, so can you maybe help me out or point me to the required information on the webpage. Thanks!

Your help is highly appreciated!

much obliged,
Christoph Hahn
PhD student
University of Oslo

Re: [wgs-assembler-users] Contig consensus failure

From: Ole K. T. <o.k...@bi...> - 2012-04-12 16:16:01

Hi Brian.
Incidentally, the numbers are the same in mine. I thought maybe you
had gleaned my numbers from the files I sent you, and that you used
them to make it easier for me to understand. :)

To sum up:
The contig is identical in version 14 and 16, same
'data.contig_status' (set to U), and as far as I can see, same
consensus and length on sequence. In version 15 however, the consensus
(and quality scores) are lacking, and the length of the contig is set
to 0 and 'data.contig_status' is also U.

I dumped the contigs by using 'tigStore -g *gkpStore -t *tigStore 14
-c 107652 -d layout3 > ctg1076523', and just varying the version
number. I tried loading the contig a couple of times into version 15
and then dumping it again, but still, it was without consensus
sequence and asmFastaOutput fails. Have I messed up the latter stages
of my assembly by doing this? Is it possible to fix this in any way?

Thank for your help so far. It's good to learn more about Celera.

Ole

On 12 April 2012 16:50, Walenz, Brian <bw...@jc...> wrote:
> Hi Ole-
>
> The version numbers will be different in different assemblies.  Mine came
> from a small assembly with little scaffolding work.  Larger assemblies can
> have more than 100 versions.  'ls -l *tigStore' will show the versions - you
> want to use the last three.
>
> b
>
>
>
> On 4/12/12 3:52 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:
>
>> Hi, Brian.
>>
>> Thank you for your help so far, but I seem to be missing something.
>>
>> I did this:
>> tigStore -g *gkpStore -t *tigStore 14 -c 107652 -d layout3 > ctg1076523
>> tigStore -g *gkpStore -t *tigStore 15 -cp 36 -R ctg1076523
>>
>> But when I dump the same contig from version 15:
>> tigStore -g *gkpStore -t *tigStore 15 -c 107652 -d layout3 > ctg1076523_v15
>> it's without consensus sequence:
>> contig 1076523
>> len 0
>> cns
>> qlt
>> data.unitig_coverage_stat -9874.792662
>> data.unitig_microhet_prob 0.000000
>> data.unitig_status        X
>> data.unitig_unique_rept   X
>> data.contig_status        U
>> data.num_frags            14302
>> data.num_unitigs          1
>>
>> The if I dump it from version 16, it's identical to the one from
>> version 14 (that is, with consensus). I've tried loading it several
>> times, but each time I dump it again it's lost consensus. Do you know
>> what I'm doing wrong?
>>
>> Ole
>>
>> On 11 April 2012 20:54, Walenz, Brian <bw...@jc...> wrote:
>>> Hi, Ole-
>>>
>>> Yes, I overlooked a step.  In the contig you insert to the latest version,
>>> update the 'data.contig_status' with what the second to last version has.
>>>
>>> FYI, the tigStore should have versions such as:
>>>
>>> seqDB.v014.ctg
>>> seqDB.v014.dat
>>> seqDB.v014.utg
>>>
>>> seqDB.v015.ctg
>>> seqDB.v015.p001.ctg
>>> seqDB.v015.p001.dat
>>> (etc)
>>> seqDB.v015.utg
>>>
>>> seqDB.v016.ctg
>>> seqDB.v016.p001.ctg
>>> seqDB.v016.p001.dat
>>> (etc)
>>> seqDB.v016.utg
>>>
>>> (the v numbers will of course be different in your assembly)
>>>
>>> v015 contains the output of scaffolder, which is the input to consensus.
>>> Contigs here have no consensus sequence, but otherwise all the data is
>>> present.  It is largely just rewriting the data from v014 into partitions
>>> (p###), so each consensus job can load a single file instead of randomly
>>> accessing a large file.  The status flag on each unitig/contig is also set.
>>> This flag tells if the unitig/contig was placed in a scaffold, is a
>>> surrogate, degenerate, etc.
>>>
>>> v016 is the output of consensus, the input to terminator.  All terminator
>>> does is to repackage this into ASCII files.
>>>
>>> To summarize: grab the contig from v014 (the last with a consensus
>>> sequence), the status flag from v015, change the status flag in the contig
>>> you grabbed, and then insert the contig into v016.
>>>
>>> by doing this, you'll lose VAR records for this contig, but otherwise the
>>> consensus sequence is the same (or largely the same; variant detection can
>>> change it a bit).
>>>
>>> b
>>>
>>>
>>> On 4/11/12 6:23 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:
>>>
>>>> Hi Brian,
>>>> ctgcns completed now, but I got an error with asmOutputFasta. From
>>>> 9-terminator/asmOutputFasta.err:
>>>> ERROR: Illegal unitigpos type type value 'X' (CCO) at line 1676575956
>>>>
>>>> Is this connected with the procedure I did with inserting the contig
>>>> from an older tigStore?
>>>>
>>>> Thank you for your help so far.
>>>>
>>>> Ole
>>>>
>>>> On 11 April 2012 08:13, Ole Kristian Tørresen <o.k...@bi...>
>>>> wrote:
>>>>> Hi Brian.
>>>>>
>>>>> I've done this, and rerunning ctgcns on that last partition. I'll send
>>>>> the layout and log in a separate email.
>>>>>
>>>>> Ole
>>>>>
>>>>> On 10 April 2012 21:37, Walenz, Brian <bw...@jc...> wrote:
>>>>>> Hi Ole-
>>>>>>
>>>>>> I don't see anything that looks like an error in the log, so I'll have to
>>>>>> assume it crashed.  You report it runs for 20 hours, which is odd for
>>>>>> contig
>>>>>> consensus, unless that contig is very very deep.  If so, the ctgcns
>>>>>> process
>>>>>> will also be large.  Do you know how big the process was?
>>>>>>
>>>>>> Can you make the full log available?
>>>>>>
>>>>>> It is possible to force the contig to have a consensus sequence.  If the
>>>>>> job
>>>>>> did crash, the other contigs will still need to have consensus generated.
>>>>>>
>>>>>> The process is the same as editing a unitig in the tigStore: dump the
>>>>>> contig
>>>>>> in question, edit the file to have a consensus sequence, then load that
>>>>>> contig back into the tigStore.  A consensus sequence for this contig can
>>>>>> be
>>>>>> found in one of the earlier tigStore versions; the version just before
>>>>>> this
>>>>>> one will probably have it.  That makes our process even easier: dump the
>>>>>> version with a consensus sequence, and load it back into the latest
>>>>>> version.
>>>>>>
>>>>>> A sketch of the steps:
>>>>>>
>>>>>> 1) Dump the previous version of the contig.  check that 'file' does
>>>>>> contain
>>>>>> a consensus sequence.
>>>>>>
>>>>>> tigStore -g *gkpStore -t *tigStore <vers-1> -c <ctgID> -d layout > file
>>>>>>
>>>>>> 2) Load that pervious version into the tigStore as the latest version
>>>>>>
>>>>>> tigStore -g *gkpStore -t *tigStore <vers> <part> -c <ctgID> -R file
>>>>>>
>>>>>> Notice that this tigStore command specifies both a version and a partition
>>>>>> for the tigStore.
>>>>>>
>>>>>> 3) Rerun consensus.sh on that partition.  It will not attempt to compute
>>>>>> the
>>>>>> consensus for that contig.
>>>>>>
>>>>>> I'd be interested in seeing the contig you dump, if only to verify that it
>>>>>> is deep.
>>>>>>
>>>>>> b
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 4/10/12 4:05 AM, "Ole Kristian Tørresen" <o.k...@bi...>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I'm having some problems while doing some low coverage sequencing
>>>>>>> assembly testing. I've tried to assemble about 10x coverage of 150 nt
>>>>>>> paired Illumina reads of 500 bp fragment size. These are from the
>>>>>>> parrot used in the Assemblathon 2
>>>>>>> (http://assemblathon.org/pages/download-data). Everything seems to run
>>>>>>> fine, until contig consensus, where 1 partition just don't succeed. It
>>>>>>> seems to run for quite some time (20 hours or something) before
>>>>>>> failing. These are the last 20 lines from the output of the ctgcns
>>>>>>> partition that fails:
>>>>>>> Alignment params: 297 333 200 200 0  0.12 1e-06 30 1
>>>>>>>  -- e/l = 7/112 =  6.25%
>>>>>>>   A -----+------+---->  []
>>>>>>>   B  332 -------> 40    []
>>>>>>> GetAlignmentTrace()-- Overlap ACCEPTED!  accept=1000.000000
>>>>>>> lScore=0.026087 (112 vs 115) aScore=0.160000 (332 vs 316)
>>>>>>> bScore=0.150000 (-42 vs -27).  (CONTIGF)
>>>>>>> GetAlignmentTrace()-- Overlap found between 1076523 (U) and 25763657
>>>>>>> (R) expected hangs: a=316 b=-27 erate=0.060000 aligner=Local_Overlap
>>>>>>> GetAlignmentTrace()-- Overlap ACCEPTED!  accept=1000.000000
>>>>>>> lScore=0.026087 (112 vs 115) aScore=0.160000 (332 vs 316)
>>>>>>> bScore=0.150000 (-42 vs -27).  (CONTIGF)
>>>>>>> Local_Overlap_AS_forCNS found overlap between 1076523 (U) and 25763657
>>>>>>> (R) ahang: 332, bhang: -42 (expected hang was 316)
>>>>>>> Alignment params: 298 334 200 200 0  0.12 1e-06 30 1
>>>>>>>  -- e/l = 6/112 =  5.36%
>>>>>>>   A -----+------+---->  []
>>>>>>>   B  332 -------> 42    []
>>>>>>> GetAlignmentTrace()-- Overlap ACCEPTED!  accept=1000.000000
>>>>>>> lScore=0.009009 (110 vs 111) aScore=0.140000 (332 vs 318)
>>>>>>> bScore=0.130000 (-42 vs -29).  (CONTIGF)
>>>>>>> GetAlignmentTrace()-- Overlap found between 1076523 (U) and 57537697
>>>>>>> (R) expected hangs: a=318 b=-29 erate=0.060000 aligner=Local_Overlap
>>>>>>> GetAlignmentTrace()-- Overlap ACCEPTED!  accept=1000.000000
>>>>>>> lScore=0.009009 (110 vs 111) aScore=0.140000 (332 vs 318)
>>>>>>> bScore=0.130000 (-42 vs -29).  (CONTIGF)
>>>>>>> Local_Overlap_AS_forCNS found overlap between 1076523 (U) and 57537697
>>>>>>> (R) ahang: 332, bhang: -42 (expected hang was 318)
>>>>>>> Alignment params: 300 336 200 200 0  0.12 1e-06 30 1
>>>>>>>  -- e/l = 6/110 =  5.45%
>>>>>>>   A -----+------+---->  []
>>>>>>>   B  332 -------> 42    []
>>>>>>>
>>>>>>> This is the error message:
>>>>>>>  at /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 1237
>>>>>>> main::caFailure('1 consensusAfterScaffolder jobs failed; remove
>>>>>>> 8-consensus/co...', undef) called at
>>>>>>> /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 5142
>>>>>>> main::postScaffolderConsensus() called at
>>>>>>> /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 5885
>>>>>>>
>>>>>>> ----------------------------------------
>>>>>>> Failure message:
>>>>>>>
>>>>>>> 1 consensusAfterScaffolder jobs failed; remove
>>>>>>> 8-consensus/consensus.sh to try again
>>>>>>>
>>>>>>> I've tried removing consensus.sh and running again, but get the same
>>>>>>> error.
>>>>>>>
>>>>>>> This is the spec file:
>>>>>>> utgErrorRate=0.03
>>>>>>> utgErrorLimit=2.5
>>>>>>> ovlErrorRate=0.06
>>>>>>> cnsErrorRate=0.06
>>>>>>> cgwErrorRate=0.10
>>>>>>> merSize = 22
>>>>>>> overlapper=ovl
>>>>>>> unitigger = bogart
>>>>>>> merylMemory   = 128000
>>>>>>> merylThreads = 16
>>>>>>> merOverlapperThreads = 2
>>>>>>> merOverlapperExtendConcurrency = 8
>>>>>>> merOverlapperSeedConcurrency = 8
>>>>>>> ovlThreads = 2
>>>>>>> mbtThreads = 2
>>>>>>> mbtConcurrency = 8
>>>>>>> ovlConcurrency = 8
>>>>>>> ovlCorrConcurrency = 16
>>>>>>> ovlRefBlockSize  = 32000000
>>>>>>> ovlHashBits = 24
>>>>>>> ovlHashBlockLength = 800000000
>>>>>>> ovlStoreMemory = 128000
>>>>>>> frgCorrThreads    = 2
>>>>>>> frgCorrConcurrency = 8
>>>>>>> ovlCorrBatchSize  = 1000000
>>>>>>> ovlCorrConcurrency = 16
>>>>>>> cnsConcurrency   = 16
>>>>>>> doExtendClearRanges = 0
>>>>>>>
>>>>>>> I don't need to have that unitig (1076523 (U)) in my finished
>>>>>>> assembly, so it's possible to just remove it as long as I get a
>>>>>>> finished assembly. I've also tried to just create the .success file,
>>>>>>> but then terminator fails.
>>>>>>>
>>>>>>> Does anyone have any ideas of what I might do different? Can I just
>>>>>>> remove that unitig and proceed? How do I do that?
>>>>>>>
>>>>>>> Sincerely,
>>>>>>> Ole Kristian Tørresen
>>>>>>> PhD student
>>>>>>> University of Oslo
>>>>>>>
>>>>>>> -------------------------------------------------------------------------
>>>>>>> --
>>>>>>> ---
>>>>>>> Better than sec? Nothing is better than sec when it comes to
>>>>>>> monitoring Big Data applications. Try Boundary one-second
>>>>>>> resolution app monitoring today. Free.
>>>>>>> http://p.sf.net/sfu/Boundary-dev2dev
>>>>>>> _______________________________________________
>>>>>>> wgs-assembler-users mailing list
>>>>>>> wgs...@li...
>>>>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>>>>>
>>>
>

Re: [wgs-assembler-users] Contig consensus failure

From: Walenz, B. <bw...@jc...> - 2012-04-12 14:51:04

Hi Ole-

The version numbers will be different in different assemblies.  Mine came
from a small assembly with little scaffolding work.  Larger assemblies can
have more than 100 versions.  'ls -l *tigStore' will show the versions - you
want to use the last three.

b



On 4/12/12 3:52 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:

> Hi, Brian.
> 
> Thank you for your help so far, but I seem to be missing something.
> 
> I did this:
> tigStore -g *gkpStore -t *tigStore 14 -c 107652 -d layout3 > ctg1076523
> tigStore -g *gkpStore -t *tigStore 15 -cp 36 -R ctg1076523
> 
> But when I dump the same contig from version 15:
> tigStore -g *gkpStore -t *tigStore 15 -c 107652 -d layout3 > ctg1076523_v15
> it's without consensus sequence:
> contig 1076523
> len 0
> cns
> qlt
> data.unitig_coverage_stat -9874.792662
> data.unitig_microhet_prob 0.000000
> data.unitig_status        X
> data.unitig_unique_rept   X
> data.contig_status        U
> data.num_frags            14302
> data.num_unitigs          1
> 
> The if I dump it from version 16, it's identical to the one from
> version 14 (that is, with consensus). I've tried loading it several
> times, but each time I dump it again it's lost consensus. Do you know
> what I'm doing wrong?
> 
> Ole
> 
> On 11 April 2012 20:54, Walenz, Brian <bw...@jc...> wrote:
>> Hi, Ole-
>> 
>> Yes, I overlooked a step.  In the contig you insert to the latest version,
>> update the 'data.contig_status' with what the second to last version has.
>> 
>> FYI, the tigStore should have versions such as:
>> 
>> seqDB.v014.ctg
>> seqDB.v014.dat
>> seqDB.v014.utg
>> 
>> seqDB.v015.ctg
>> seqDB.v015.p001.ctg
>> seqDB.v015.p001.dat
>> (etc)
>> seqDB.v015.utg
>> 
>> seqDB.v016.ctg
>> seqDB.v016.p001.ctg
>> seqDB.v016.p001.dat
>> (etc)
>> seqDB.v016.utg
>> 
>> (the v numbers will of course be different in your assembly)
>> 
>> v015 contains the output of scaffolder, which is the input to consensus.
>> Contigs here have no consensus sequence, but otherwise all the data is
>> present.  It is largely just rewriting the data from v014 into partitions
>> (p###), so each consensus job can load a single file instead of randomly
>> accessing a large file.  The status flag on each unitig/contig is also set.
>> This flag tells if the unitig/contig was placed in a scaffold, is a
>> surrogate, degenerate, etc.
>> 
>> v016 is the output of consensus, the input to terminator.  All terminator
>> does is to repackage this into ASCII files.
>> 
>> To summarize: grab the contig from v014 (the last with a consensus
>> sequence), the status flag from v015, change the status flag in the contig
>> you grabbed, and then insert the contig into v016.
>> 
>> by doing this, you'll lose VAR records for this contig, but otherwise the
>> consensus sequence is the same (or largely the same; variant detection can
>> change it a bit).
>> 
>> b
>> 
>> 
>> On 4/11/12 6:23 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:
>> 
>>> Hi Brian,
>>> ctgcns completed now, but I got an error with asmOutputFasta. From
>>> 9-terminator/asmOutputFasta.err:
>>> ERROR: Illegal unitigpos type type value 'X' (CCO) at line 1676575956
>>> 
>>> Is this connected with the procedure I did with inserting the contig
>>> from an older tigStore?
>>> 
>>> Thank you for your help so far.
>>> 
>>> Ole
>>> 
>>> On 11 April 2012 08:13, Ole Kristian Tørresen <o.k...@bi...>
>>> wrote:
>>>> Hi Brian.
>>>> 
>>>> I've done this, and rerunning ctgcns on that last partition. I'll send
>>>> the layout and log in a separate email.
>>>> 
>>>> Ole
>>>> 
>>>> On 10 April 2012 21:37, Walenz, Brian <bw...@jc...> wrote:
>>>>> Hi Ole-
>>>>> 
>>>>> I don't see anything that looks like an error in the log, so I'll have to
>>>>> assume it crashed.  You report it runs for 20 hours, which is odd for
>>>>> contig
>>>>> consensus, unless that contig is very very deep.  If so, the ctgcns
>>>>> process
>>>>> will also be large.  Do you know how big the process was?
>>>>> 
>>>>> Can you make the full log available?
>>>>> 
>>>>> It is possible to force the contig to have a consensus sequence.  If the
>>>>> job
>>>>> did crash, the other contigs will still need to have consensus generated.
>>>>> 
>>>>> The process is the same as editing a unitig in the tigStore: dump the
>>>>> contig
>>>>> in question, edit the file to have a consensus sequence, then load that
>>>>> contig back into the tigStore.  A consensus sequence for this contig can
>>>>> be
>>>>> found in one of the earlier tigStore versions; the version just before
>>>>> this
>>>>> one will probably have it.  That makes our process even easier: dump the
>>>>> version with a consensus sequence, and load it back into the latest
>>>>> version.
>>>>> 
>>>>> A sketch of the steps:
>>>>> 
>>>>> 1) Dump the previous version of the contig.  check that 'file' does
>>>>> contain
>>>>> a consensus sequence.
>>>>> 
>>>>> tigStore -g *gkpStore -t *tigStore <vers-1> -c <ctgID> -d layout > file
>>>>> 
>>>>> 2) Load that pervious version into the tigStore as the latest version
>>>>> 
>>>>> tigStore -g *gkpStore -t *tigStore <vers> <part> -c <ctgID> -R file
>>>>> 
>>>>> Notice that this tigStore command specifies both a version and a partition
>>>>> for the tigStore.
>>>>> 
>>>>> 3) Rerun consensus.sh on that partition.  It will not attempt to compute
>>>>> the
>>>>> consensus for that contig.
>>>>> 
>>>>> I'd be interested in seeing the contig you dump, if only to verify that it
>>>>> is deep.
>>>>> 
>>>>> b
>>>>> 
>>>>> 
>>>>> 
>>>>> On 4/10/12 4:05 AM, "Ole Kristian Tørresen" <o.k...@bi...>
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> I'm having some problems while doing some low coverage sequencing
>>>>>> assembly testing. I've tried to assemble about 10x coverage of 150 nt
>>>>>> paired Illumina reads of 500 bp fragment size. These are from the
>>>>>> parrot used in the Assemblathon 2
>>>>>> (http://assemblathon.org/pages/download-data). Everything seems to run
>>>>>> fine, until contig consensus, where 1 partition just don't succeed. It
>>>>>> seems to run for quite some time (20 hours or something) before
>>>>>> failing. These are the last 20 lines from the output of the ctgcns
>>>>>> partition that fails:
>>>>>> Alignment params: 297 333 200 200 0  0.12 1e-06 30 1
>>>>>>  -- e/l = 7/112 =  6.25%
>>>>>>   A -----+------+---->  []
>>>>>>   B  332 -------> 40    []
>>>>>> GetAlignmentTrace()-- Overlap ACCEPTED!  accept=1000.000000
>>>>>> lScore=0.026087 (112 vs 115) aScore=0.160000 (332 vs 316)
>>>>>> bScore=0.150000 (-42 vs -27).  (CONTIGF)
>>>>>> GetAlignmentTrace()-- Overlap found between 1076523 (U) and 25763657
>>>>>> (R) expected hangs: a=316 b=-27 erate=0.060000 aligner=Local_Overlap
>>>>>> GetAlignmentTrace()-- Overlap ACCEPTED!  accept=1000.000000
>>>>>> lScore=0.026087 (112 vs 115) aScore=0.160000 (332 vs 316)
>>>>>> bScore=0.150000 (-42 vs -27).  (CONTIGF)
>>>>>> Local_Overlap_AS_forCNS found overlap between 1076523 (U) and 25763657
>>>>>> (R) ahang: 332, bhang: -42 (expected hang was 316)
>>>>>> Alignment params: 298 334 200 200 0  0.12 1e-06 30 1
>>>>>>  -- e/l = 6/112 =  5.36%
>>>>>>   A -----+------+---->  []
>>>>>>   B  332 -------> 42    []
>>>>>> GetAlignmentTrace()-- Overlap ACCEPTED!  accept=1000.000000
>>>>>> lScore=0.009009 (110 vs 111) aScore=0.140000 (332 vs 318)
>>>>>> bScore=0.130000 (-42 vs -29).  (CONTIGF)
>>>>>> GetAlignmentTrace()-- Overlap found between 1076523 (U) and 57537697
>>>>>> (R) expected hangs: a=318 b=-29 erate=0.060000 aligner=Local_Overlap
>>>>>> GetAlignmentTrace()-- Overlap ACCEPTED!  accept=1000.000000
>>>>>> lScore=0.009009 (110 vs 111) aScore=0.140000 (332 vs 318)
>>>>>> bScore=0.130000 (-42 vs -29).  (CONTIGF)
>>>>>> Local_Overlap_AS_forCNS found overlap between 1076523 (U) and 57537697
>>>>>> (R) ahang: 332, bhang: -42 (expected hang was 318)
>>>>>> Alignment params: 300 336 200 200 0  0.12 1e-06 30 1
>>>>>>  -- e/l = 6/110 =  5.45%
>>>>>>   A -----+------+---->  []
>>>>>>   B  332 -------> 42    []
>>>>>> 
>>>>>> This is the error message:
>>>>>>  at /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 1237
>>>>>> main::caFailure('1 consensusAfterScaffolder jobs failed; remove
>>>>>> 8-consensus/co...', undef) called at
>>>>>> /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 5142
>>>>>> main::postScaffolderConsensus() called at
>>>>>> /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 5885
>>>>>> 
>>>>>> ----------------------------------------
>>>>>> Failure message:
>>>>>> 
>>>>>> 1 consensusAfterScaffolder jobs failed; remove
>>>>>> 8-consensus/consensus.sh to try again
>>>>>> 
>>>>>> I've tried removing consensus.sh and running again, but get the same
>>>>>> error.
>>>>>> 
>>>>>> This is the spec file:
>>>>>> utgErrorRate=0.03
>>>>>> utgErrorLimit=2.5
>>>>>> ovlErrorRate=0.06
>>>>>> cnsErrorRate=0.06
>>>>>> cgwErrorRate=0.10
>>>>>> merSize = 22
>>>>>> overlapper=ovl
>>>>>> unitigger = bogart
>>>>>> merylMemory   = 128000
>>>>>> merylThreads = 16
>>>>>> merOverlapperThreads = 2
>>>>>> merOverlapperExtendConcurrency = 8
>>>>>> merOverlapperSeedConcurrency = 8
>>>>>> ovlThreads = 2
>>>>>> mbtThreads = 2
>>>>>> mbtConcurrency = 8
>>>>>> ovlConcurrency = 8
>>>>>> ovlCorrConcurrency = 16
>>>>>> ovlRefBlockSize  = 32000000
>>>>>> ovlHashBits = 24
>>>>>> ovlHashBlockLength = 800000000
>>>>>> ovlStoreMemory = 128000
>>>>>> frgCorrThreads    = 2
>>>>>> frgCorrConcurrency = 8
>>>>>> ovlCorrBatchSize  = 1000000
>>>>>> ovlCorrConcurrency = 16
>>>>>> cnsConcurrency   = 16
>>>>>> doExtendClearRanges = 0
>>>>>> 
>>>>>> I don't need to have that unitig (1076523 (U)) in my finished
>>>>>> assembly, so it's possible to just remove it as long as I get a
>>>>>> finished assembly. I've also tried to just create the .success file,
>>>>>> but then terminator fails.
>>>>>> 
>>>>>> Does anyone have any ideas of what I might do different? Can I just
>>>>>> remove that unitig and proceed? How do I do that?
>>>>>> 
>>>>>> Sincerely,
>>>>>> Ole Kristian Tørresen
>>>>>> PhD student
>>>>>> University of Oslo
>>>>>> 
>>>>>> -------------------------------------------------------------------------
>>>>>> --
>>>>>> ---
>>>>>> Better than sec? Nothing is better than sec when it comes to
>>>>>> monitoring Big Data applications. Try Boundary one-second
>>>>>> resolution app monitoring today. Free.
>>>>>> http://p.sf.net/sfu/Boundary-dev2dev
>>>>>> _______________________________________________
>>>>>> wgs-assembler-users mailing list
>>>>>> wgs...@li...
>>>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>>>> 
>>

Re: [wgs-assembler-users] Contig consensus failure

From: Ole K. T. <o.k...@bi...> - 2012-04-12 07:52:39

Hi, Brian.

Thank you for your help so far, but I seem to be missing something.

I did this:
tigStore -g *gkpStore -t *tigStore 14 -c 107652 -d layout3 > ctg1076523
tigStore -g *gkpStore -t *tigStore 15 -cp 36 -R ctg1076523

But when I dump the same contig from version 15:
tigStore -g *gkpStore -t *tigStore 15 -c 107652 -d layout3 > ctg1076523_v15
it's without consensus sequence:
contig 1076523
len 0
cns
qlt
data.unitig_coverage_stat -9874.792662
data.unitig_microhet_prob 0.000000
data.unitig_status        X
data.unitig_unique_rept   X
data.contig_status        U
data.num_frags            14302
data.num_unitigs          1

The if I dump it from version 16, it's identical to the one from
version 14 (that is, with consensus). I've tried loading it several
times, but each time I dump it again it's lost consensus. Do you know
what I'm doing wrong?

Ole

On 11 April 2012 20:54, Walenz, Brian <bw...@jc...> wrote:
> Hi, Ole-
>
> Yes, I overlooked a step.  In the contig you insert to the latest version,
> update the 'data.contig_status' with what the second to last version has.
>
> FYI, the tigStore should have versions such as:
>
> seqDB.v014.ctg
> seqDB.v014.dat
> seqDB.v014.utg
>
> seqDB.v015.ctg
> seqDB.v015.p001.ctg
> seqDB.v015.p001.dat
> (etc)
> seqDB.v015.utg
>
> seqDB.v016.ctg
> seqDB.v016.p001.ctg
> seqDB.v016.p001.dat
> (etc)
> seqDB.v016.utg
>
> (the v numbers will of course be different in your assembly)
>
> v015 contains the output of scaffolder, which is the input to consensus.
> Contigs here have no consensus sequence, but otherwise all the data is
> present.  It is largely just rewriting the data from v014 into partitions
> (p###), so each consensus job can load a single file instead of randomly
> accessing a large file.  The status flag on each unitig/contig is also set.
> This flag tells if the unitig/contig was placed in a scaffold, is a
> surrogate, degenerate, etc.
>
> v016 is the output of consensus, the input to terminator.  All terminator
> does is to repackage this into ASCII files.
>
> To summarize: grab the contig from v014 (the last with a consensus
> sequence), the status flag from v015, change the status flag in the contig
> you grabbed, and then insert the contig into v016.
>
> by doing this, you'll lose VAR records for this contig, but otherwise the
> consensus sequence is the same (or largely the same; variant detection can
> change it a bit).
>
> b
>
>
> On 4/11/12 6:23 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:
>
>> Hi Brian,
>> ctgcns completed now, but I got an error with asmOutputFasta. From
>> 9-terminator/asmOutputFasta.err:
>> ERROR: Illegal unitigpos type type value 'X' (CCO) at line 1676575956
>>
>> Is this connected with the procedure I did with inserting the contig
>> from an older tigStore?
>>
>> Thank you for your help so far.
>>
>> Ole
>>
>> On 11 April 2012 08:13, Ole Kristian Tørresen <o.k...@bi...> wrote:
>>> Hi Brian.
>>>
>>> I've done this, and rerunning ctgcns on that last partition. I'll send
>>> the layout and log in a separate email.
>>>
>>> Ole
>>>
>>> On 10 April 2012 21:37, Walenz, Brian <bw...@jc...> wrote:
>>>> Hi Ole-
>>>>
>>>> I don't see anything that looks like an error in the log, so I'll have to
>>>> assume it crashed.  You report it runs for 20 hours, which is odd for contig
>>>> consensus, unless that contig is very very deep.  If so, the ctgcns process
>>>> will also be large.  Do you know how big the process was?
>>>>
>>>> Can you make the full log available?
>>>>
>>>> It is possible to force the contig to have a consensus sequence.  If the job
>>>> did crash, the other contigs will still need to have consensus generated.
>>>>
>>>> The process is the same as editing a unitig in the tigStore: dump the contig
>>>> in question, edit the file to have a consensus sequence, then load that
>>>> contig back into the tigStore.  A consensus sequence for this contig can be
>>>> found in one of the earlier tigStore versions; the version just before this
>>>> one will probably have it.  That makes our process even easier: dump the
>>>> version with a consensus sequence, and load it back into the latest version.
>>>>
>>>> A sketch of the steps:
>>>>
>>>> 1) Dump the previous version of the contig.  check that 'file' does contain
>>>> a consensus sequence.
>>>>
>>>> tigStore -g *gkpStore -t *tigStore <vers-1> -c <ctgID> -d layout > file
>>>>
>>>> 2) Load that pervious version into the tigStore as the latest version
>>>>
>>>> tigStore -g *gkpStore -t *tigStore <vers> <part> -c <ctgID> -R file
>>>>
>>>> Notice that this tigStore command specifies both a version and a partition
>>>> for the tigStore.
>>>>
>>>> 3) Rerun consensus.sh on that partition.  It will not attempt to compute the
>>>> consensus for that contig.
>>>>
>>>> I'd be interested in seeing the contig you dump, if only to verify that it
>>>> is deep.
>>>>
>>>> b
>>>>
>>>>
>>>>
>>>> On 4/10/12 4:05 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:
>>>>
>>>>> Hi,
>>>>> I'm having some problems while doing some low coverage sequencing
>>>>> assembly testing. I've tried to assemble about 10x coverage of 150 nt
>>>>> paired Illumina reads of 500 bp fragment size. These are from the
>>>>> parrot used in the Assemblathon 2
>>>>> (http://assemblathon.org/pages/download-data). Everything seems to run
>>>>> fine, until contig consensus, where 1 partition just don't succeed. It
>>>>> seems to run for quite some time (20 hours or something) before
>>>>> failing. These are the last 20 lines from the output of the ctgcns
>>>>> partition that fails:
>>>>> Alignment params: 297 333 200 200 0  0.12 1e-06 30 1
>>>>>  -- e/l = 7/112 =  6.25%
>>>>>   A -----+------+---->  []
>>>>>   B  332 -------> 40    []
>>>>> GetAlignmentTrace()-- Overlap ACCEPTED!  accept=1000.000000
>>>>> lScore=0.026087 (112 vs 115) aScore=0.160000 (332 vs 316)
>>>>> bScore=0.150000 (-42 vs -27).  (CONTIGF)
>>>>> GetAlignmentTrace()-- Overlap found between 1076523 (U) and 25763657
>>>>> (R) expected hangs: a=316 b=-27 erate=0.060000 aligner=Local_Overlap
>>>>> GetAlignmentTrace()-- Overlap ACCEPTED!  accept=1000.000000
>>>>> lScore=0.026087 (112 vs 115) aScore=0.160000 (332 vs 316)
>>>>> bScore=0.150000 (-42 vs -27).  (CONTIGF)
>>>>> Local_Overlap_AS_forCNS found overlap between 1076523 (U) and 25763657
>>>>> (R) ahang: 332, bhang: -42 (expected hang was 316)
>>>>> Alignment params: 298 334 200 200 0  0.12 1e-06 30 1
>>>>>  -- e/l = 6/112 =  5.36%
>>>>>   A -----+------+---->  []
>>>>>   B  332 -------> 42    []
>>>>> GetAlignmentTrace()-- Overlap ACCEPTED!  accept=1000.000000
>>>>> lScore=0.009009 (110 vs 111) aScore=0.140000 (332 vs 318)
>>>>> bScore=0.130000 (-42 vs -29).  (CONTIGF)
>>>>> GetAlignmentTrace()-- Overlap found between 1076523 (U) and 57537697
>>>>> (R) expected hangs: a=318 b=-29 erate=0.060000 aligner=Local_Overlap
>>>>> GetAlignmentTrace()-- Overlap ACCEPTED!  accept=1000.000000
>>>>> lScore=0.009009 (110 vs 111) aScore=0.140000 (332 vs 318)
>>>>> bScore=0.130000 (-42 vs -29).  (CONTIGF)
>>>>> Local_Overlap_AS_forCNS found overlap between 1076523 (U) and 57537697
>>>>> (R) ahang: 332, bhang: -42 (expected hang was 318)
>>>>> Alignment params: 300 336 200 200 0  0.12 1e-06 30 1
>>>>>  -- e/l = 6/110 =  5.45%
>>>>>   A -----+------+---->  []
>>>>>   B  332 -------> 42    []
>>>>>
>>>>> This is the error message:
>>>>>  at /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 1237
>>>>> main::caFailure('1 consensusAfterScaffolder jobs failed; remove
>>>>> 8-consensus/co...', undef) called at
>>>>> /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 5142
>>>>> main::postScaffolderConsensus() called at
>>>>> /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 5885
>>>>>
>>>>> ----------------------------------------
>>>>> Failure message:
>>>>>
>>>>> 1 consensusAfterScaffolder jobs failed; remove
>>>>> 8-consensus/consensus.sh to try again
>>>>>
>>>>> I've tried removing consensus.sh and running again, but get the same error.
>>>>>
>>>>> This is the spec file:
>>>>> utgErrorRate=0.03
>>>>> utgErrorLimit=2.5
>>>>> ovlErrorRate=0.06
>>>>> cnsErrorRate=0.06
>>>>> cgwErrorRate=0.10
>>>>> merSize = 22
>>>>> overlapper=ovl
>>>>> unitigger = bogart
>>>>> merylMemory   = 128000
>>>>> merylThreads = 16
>>>>> merOverlapperThreads = 2
>>>>> merOverlapperExtendConcurrency = 8
>>>>> merOverlapperSeedConcurrency = 8
>>>>> ovlThreads = 2
>>>>> mbtThreads = 2
>>>>> mbtConcurrency = 8
>>>>> ovlConcurrency = 8
>>>>> ovlCorrConcurrency = 16
>>>>> ovlRefBlockSize  = 32000000
>>>>> ovlHashBits = 24
>>>>> ovlHashBlockLength = 800000000
>>>>> ovlStoreMemory = 128000
>>>>> frgCorrThreads    = 2
>>>>> frgCorrConcurrency = 8
>>>>> ovlCorrBatchSize  = 1000000
>>>>> ovlCorrConcurrency = 16
>>>>> cnsConcurrency   = 16
>>>>> doExtendClearRanges = 0
>>>>>
>>>>> I don't need to have that unitig (1076523 (U)) in my finished
>>>>> assembly, so it's possible to just remove it as long as I get a
>>>>> finished assembly. I've also tried to just create the .success file,
>>>>> but then terminator fails.
>>>>>
>>>>> Does anyone have any ideas of what I might do different? Can I just
>>>>> remove that unitig and proceed? How do I do that?
>>>>>
>>>>> Sincerely,
>>>>> Ole Kristian Tørresen
>>>>> PhD student
>>>>> University of Oslo
>>>>>
>>>>> ---------------------------------------------------------------------------
>>>>> ---
>>>>> Better than sec? Nothing is better than sec when it comes to
>>>>> monitoring Big Data applications. Try Boundary one-second
>>>>> resolution app monitoring today. Free.
>>>>> http://p.sf.net/sfu/Boundary-dev2dev
>>>>> _______________________________________________
>>>>> wgs-assembler-users mailing list
>>>>> wgs...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>>>
>

[wgs-assembler-users] resuming runCA after stop

From: Christoph H. <chr...@gm...> - 2012-04-11 21:02:56

Dear CA developers and users,

I am trying to use Celeara assembler 7.0 to assemble a medium sized 
genome (about 100 Mb) using a combination of 454 and illumina reads.

I choose a bad combination of the ovlHashBits, ovlHashBlockLength and 
ovlThreads options so that my last run stopped at the cluster I am using 
due to exceeding memory limit in the overlaptrim step. I think I know 
what the problem was, now, so my question is if it is possible to resume 
runCA from any given stage. In my particular case I would like to resume 
from the 0-overlaptrim-overlap stage with altered ovlHashBits, 
ovlHashBlockLength and ovlThreads options. I want to avaid doing the 
mercouts and initialtrim steps again, because they seem to have worked fine.

I read in the manual about using the /do*/ option to get a kind of 
/startBefore/ effect. I cant seem to find any more details on this in 
the manual, so can you maybe help me out or point me to the required 
information on the webpage. Thanks!

Your help is highly appreciated!

much obliged,
Christoph Hahn
PhD student
University of Oslo

9 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 16 17 18 19 > >> (Page 18 of 19)

2012	Jan (1)	Feb (2)	Mar	Apr (29)	May (8)	Jun (5)	Jul (46)	Aug (16)	Sep (5)	Oct (6)	Nov (17)	Dec (7)
2013	Jan (5)	Feb (2)	Mar (10)	Apr (13)	May (20)	Jun (7)	Jul (6)	Aug (14)	Sep (9)	Oct (19)	Nov (17)	Dec (3)
2014	Jan (3)	Feb	Mar (7)	Apr (1)	May (1)	Jun (30)	Jul (10)	Aug (2)	Sep (18)	Oct (3)	Nov (4)	Dec (13)
2015	Jan (27)	Feb	Mar (19)	Apr (12)	May (10)	Jun (18)	Jul (4)	Aug (2)	Sep (2)	Oct	Nov (1)	Dec (9)
2016	Jan (6)	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (1)	Sep (1)	Oct	Nov	Dec