You can subscribe to this list here.
2012 |
Jan
(1) |
Feb
(2) |
Mar
|
Apr
(29) |
May
(8) |
Jun
(5) |
Jul
(46) |
Aug
(16) |
Sep
(5) |
Oct
(6) |
Nov
(17) |
Dec
(7) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 |
Jan
(5) |
Feb
(2) |
Mar
(10) |
Apr
(13) |
May
(20) |
Jun
(7) |
Jul
(6) |
Aug
(14) |
Sep
(9) |
Oct
(19) |
Nov
(17) |
Dec
(3) |
2014 |
Jan
(3) |
Feb
|
Mar
(7) |
Apr
(1) |
May
(1) |
Jun
(30) |
Jul
(10) |
Aug
(2) |
Sep
(18) |
Oct
(3) |
Nov
(4) |
Dec
(13) |
2015 |
Jan
(27) |
Feb
|
Mar
(19) |
Apr
(12) |
May
(10) |
Jun
(18) |
Jul
(4) |
Aug
(2) |
Sep
(2) |
Oct
|
Nov
(1) |
Dec
(9) |
2016 |
Jan
(6) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Ole K. T. <o.k...@bi...> - 2012-05-14 18:46:41
|
On 14 May 2012 20:32, Mundy, Michael <Mun...@ma...> wrote: > I’m using WGS 7.0 and I have two synchronized fastq files with paired-end > reads. Based on the documentation at > http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=FastqToCA, > I tried this command: > > wgs-7.0/Linux-amd64/bin/fastqToCA -libraryname SRR067601.000 -mates > SRR067601.000_1_pair.fq,SRR067601.000_2_pair.fq > > But it returns this error: > > ERROR: Mated reads (-mates) must have am insert size (-insertsize). > > The documentation page says that the –insertsize option is optional so I > thought that was the flag to distinguish between paired-end reads and > mate-pair reads. How do I generate a FRG file for paired-end reads? I guess the documentation is not up to date, so it's not optional to supply the -insertsize option. Just add -insertsize 300 30, if your reads are from a 300 bp DNA fragment and are paired end, or do something like -insertsize 5000 500 -outtie if they are mate pairs from a 5k library. Ole > > Mike Mundy > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > |
From: Mundy, M. <Mun...@ma...> - 2012-05-14 18:32:27
|
I¹m using WGS 7.0 and I have two synchronized fastq files with paired-end reads. Based on the documentation at http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=FastqToC A, I tried this command: wgs-7.0/Linux-amd64/bin/fastqToCA -libraryname SRR067601.000 -mates SRR067601.000_1_pair.fq,SRR067601.000_2_pair.fq But it returns this error: ERROR: Mated reads (-mates) must have am insert size (-insertsize). The documentation page says that the insertsize option is optional so I thought that was the flag to distinguish between paired-end reads and mate-pair reads. How do I generate a FRG file for paired-end reads? Mike Mundy |
From: Walenz, B. <bw...@jc...> - 2012-05-11 19:00:37
|
On 5/10/12 2:55 PM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: > Hi Brian. > > Thank you for this, good to know. Our PacBio fastq files were over > multiple lines (SMRT-Portal 1.3... Thank you a lot PacBio!), and the > correction pipeline ran for 17 days taking up 48 CPUs and I guess we > can just kill it now. Multiple lines aren't nearly as bad as Illumina's new multi-word read names... ;-) The paper on the correction pipeline will be appearing in Nature Biotechnology real soon. I'll send a link once I get one. I'm pretty sure nobody has tried correcting pacbio with 454 reads. > > On 10 May 2012 19:50, Walenz, Brian <bw...@jc...> wrote: >> Hi, Ole- >> >> ovlHashLibrary=2 does mean to load only reads from the second library into >> the hash table. In this case, it's the pac bio reads. The 'ref' library is >> what fragments we search against the hash table. ovlRefLibrary=1-1 >> translates to 'starting at library 1 and ending at library 1'. Overlaps >> well be computed between library 1 and 2, but not in the same library. >> >> I should point out that this isn't implemented perfectly. The overlap jobs >> for computing overlaps within library 1 are still launched, and the hash >> tables are still built, but no overlaps are output. The 'overlap_partition' >> command is responsible for setting up the hash and reference ranges for each >> overlap job, and this isn't aware of the ovlHashLibrary/ovlRefLibrary >> options. >> >> We've been recently disabling OBT (and fragmentCorrection) in runCA, and >> doing all trimming/correction outside the assembler. In your case, you can >> run the assembler up through OBT on all your 454 reads, then dump gatekeeper >> to build a trimmed fragment set. If you're using CVS tip, dumping as fastq >> will work too. With the pacbio reads, this is mandatory, since the pipeline >> will split some of the pacbio reads into multiple pieces. > > I saw some submissions to the CVS about this, but couldn't figure out > exactly what it meant. This clears up that. I recently started an > assembly with 454 and Illumina reads (Illumina corrected Quake), and > correct-frags have run for several days now. > > Should I run OBT on all my 454 reads, dump the trimmed reads, and use > them in a new assembly with the error corrected Illumina reads? The > default with the CVS tip will then be to not run correct-frags etc on > those reads? What will be the effect of using these trimmed 454 reads > for PacBio error correction? If you have trimmed / corrected reads then disabling both OBT and the correction should be done: doOBT=0 doFragmentCorrection=0 The correction process hasn't changed since the Sanger-only days. It doesn't seem to scale easily to hundreds of millions of reads. The algorithm: In the first pass (fragment correction) a multiple sequence alignment is generated for each read. The alignment is formed from all overlaps to the read. Errors were detected, and noted. In the second pass (overlap correction) these corrections were applied to change the error rate of overlaps. The bases in the read never change. My opinion is that correction of the bases in the reads is now good enough that the reads should be corrected before assembly. The corrections can be specific to the technology (homopolymer for 454, no indel for Illumina) something that both isn't done and would be tough to do in CA. >> The obt overlaps and ovl overlaps used for assembly aren't compatible. The >> obt overlaps are more like blast matches (align a-b in read 1 to c-d in read >> 2) while the ovl overlaps are ... overlaps; see >> http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Overlaps >> . Since trimming will change the length of the read, it's impossible to >> translate the overlaps on untrimmed reads to overlaps on trimmed reads. > > I hadn't seen that page. It's a useful reference (as are other > "hidden" pages at that wiki.) Thought we had a (one) link to it somewhere. *sigh* b > > Ole > >> >> b >> >> >> >> On 5/10/12 4:53 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: >> >>> Hi, >>> we have started doing some sequencing on PacBio, and correcting the >>> reads with the PacBioToCA pipeline. The genome about 800 Mb, and we're >>> trying to correct the PacBio reads from two SMRTcells with about 20x >>> in 454 reads. This translates to 130,389 PacBio reads with 126 Mb >>> sequence, and 47M 454 reads and 17.6 Gb sequence. >>> >>> We see that 0-overlaptrim-overlap uses quite a bit of time, and I fear >>> that 1-overlapper will use a long time too. Is it possible to compute >>> the overlaps between the 454 reads ahead of time, and use the overlaps >>> from that store to only compute the overlaps between 454 reads and >>> PacBio reads? Since I guess most to time is spent computing the >>> overlaps between 454 reads. This could be useful for assembly in >>> general too, sometimes we only input some data to have a faster >>> assembly, while later on we input more. >>> >>> When I look at the command that's used to run CA in the error >>> correction step: runCA -s pacbio.spec -p asm -d temppacbio >>> ovlHashLibrary=2 ovlRefLibrary=1-1 obtHashLibrary=1-1 >>> obtRefLibrary=1-1 sge=" -sync y" sgePropagateHold=corAsm >>> stopAfter=overlapper, does it actually do something what I ask for? It >>> only loads hash fragments from library 2, but it loads all libraries >>> in the other *Library options (1-1 = 0)? Could anyone explain to me >>> what that really means? >>> >>> Sincerely, >>> Ole >>> >>> ---------------------------------------------------------------------------- >>> -- >>> Live Security Virtual Conference >>> Exclusive live event will cover all the ways today's security and >>> threat landscape has changed and how IT managers can respond. Discussions >>> will include endpoint security, mobile security and the latest in malware >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>> _______________________________________________ >>> wgs-assembler-users mailing list >>> wgs...@li... >>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> |
From: Ole K. T. <o.k...@bi...> - 2012-05-10 18:55:20
|
Hi Brian. Thank you for this, good to know. Our PacBio fastq files were over multiple lines (SMRT-Portal 1.3... Thank you a lot PacBio!), and the correction pipeline ran for 17 days taking up 48 CPUs and I guess we can just kill it now. On 10 May 2012 19:50, Walenz, Brian <bw...@jc...> wrote: > Hi, Ole- > > ovlHashLibrary=2 does mean to load only reads from the second library into > the hash table. In this case, it's the pac bio reads. The 'ref' library is > what fragments we search against the hash table. ovlRefLibrary=1-1 > translates to 'starting at library 1 and ending at library 1'. Overlaps > well be computed between library 1 and 2, but not in the same library. > > I should point out that this isn't implemented perfectly. The overlap jobs > for computing overlaps within library 1 are still launched, and the hash > tables are still built, but no overlaps are output. The 'overlap_partition' > command is responsible for setting up the hash and reference ranges for each > overlap job, and this isn't aware of the ovlHashLibrary/ovlRefLibrary > options. > > We've been recently disabling OBT (and fragmentCorrection) in runCA, and > doing all trimming/correction outside the assembler. In your case, you can > run the assembler up through OBT on all your 454 reads, then dump gatekeeper > to build a trimmed fragment set. If you're using CVS tip, dumping as fastq > will work too. With the pacbio reads, this is mandatory, since the pipeline > will split some of the pacbio reads into multiple pieces. I saw some submissions to the CVS about this, but couldn't figure out exactly what it meant. This clears up that. I recently started an assembly with 454 and Illumina reads (Illumina corrected Quake), and correct-frags have run for several days now. Should I run OBT on all my 454 reads, dump the trimmed reads, and use them in a new assembly with the error corrected Illumina reads? The default with the CVS tip will then be to not run correct-frags etc on those reads? What will be the effect of using these trimmed 454 reads for PacBio error correction? > > The obt overlaps and ovl overlaps used for assembly aren't compatible. The > obt overlaps are more like blast matches (align a-b in read 1 to c-d in read > 2) while the ovl overlaps are ... overlaps; see > http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Overlaps > . Since trimming will change the length of the read, it's impossible to > translate the overlaps on untrimmed reads to overlaps on trimmed reads. I hadn't seen that page. It's a useful reference (as are other "hidden" pages at that wiki.) Ole > > b > > > > On 5/10/12 4:53 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: > >> Hi, >> we have started doing some sequencing on PacBio, and correcting the >> reads with the PacBioToCA pipeline. The genome about 800 Mb, and we're >> trying to correct the PacBio reads from two SMRTcells with about 20x >> in 454 reads. This translates to 130,389 PacBio reads with 126 Mb >> sequence, and 47M 454 reads and 17.6 Gb sequence. >> >> We see that 0-overlaptrim-overlap uses quite a bit of time, and I fear >> that 1-overlapper will use a long time too. Is it possible to compute >> the overlaps between the 454 reads ahead of time, and use the overlaps >> from that store to only compute the overlaps between 454 reads and >> PacBio reads? Since I guess most to time is spent computing the >> overlaps between 454 reads. This could be useful for assembly in >> general too, sometimes we only input some data to have a faster >> assembly, while later on we input more. >> >> When I look at the command that's used to run CA in the error >> correction step: runCA -s pacbio.spec -p asm -d temppacbio >> ovlHashLibrary=2 ovlRefLibrary=1-1 obtHashLibrary=1-1 >> obtRefLibrary=1-1 sge=" -sync y" sgePropagateHold=corAsm >> stopAfter=overlapper, does it actually do something what I ask for? It >> only loads hash fragments from library 2, but it loads all libraries >> in the other *Library options (1-1 = 0)? Could anyone explain to me >> what that really means? >> >> Sincerely, >> Ole >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > |
From: Walenz, B. <bw...@jc...> - 2012-05-10 17:50:41
|
Hi, Ole- ovlHashLibrary=2 does mean to load only reads from the second library into the hash table. In this case, it's the pac bio reads. The 'ref' library is what fragments we search against the hash table. ovlRefLibrary=1-1 translates to 'starting at library 1 and ending at library 1'. Overlaps well be computed between library 1 and 2, but not in the same library. I should point out that this isn't implemented perfectly. The overlap jobs for computing overlaps within library 1 are still launched, and the hash tables are still built, but no overlaps are output. The 'overlap_partition' command is responsible for setting up the hash and reference ranges for each overlap job, and this isn't aware of the ovlHashLibrary/ovlRefLibrary options. We've been recently disabling OBT (and fragmentCorrection) in runCA, and doing all trimming/correction outside the assembler. In your case, you can run the assembler up through OBT on all your 454 reads, then dump gatekeeper to build a trimmed fragment set. If you're using CVS tip, dumping as fastq will work too. With the pacbio reads, this is mandatory, since the pipeline will split some of the pacbio reads into multiple pieces. The obt overlaps and ovl overlaps used for assembly aren't compatible. The obt overlaps are more like blast matches (align a-b in read 1 to c-d in read 2) while the ovl overlaps are ... overlaps; see http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Overlaps . Since trimming will change the length of the read, it's impossible to translate the overlaps on untrimmed reads to overlaps on trimmed reads. b On 5/10/12 4:53 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: > Hi, > we have started doing some sequencing on PacBio, and correcting the > reads with the PacBioToCA pipeline. The genome about 800 Mb, and we're > trying to correct the PacBio reads from two SMRTcells with about 20x > in 454 reads. This translates to 130,389 PacBio reads with 126 Mb > sequence, and 47M 454 reads and 17.6 Gb sequence. > > We see that 0-overlaptrim-overlap uses quite a bit of time, and I fear > that 1-overlapper will use a long time too. Is it possible to compute > the overlaps between the 454 reads ahead of time, and use the overlaps > from that store to only compute the overlaps between 454 reads and > PacBio reads? Since I guess most to time is spent computing the > overlaps between 454 reads. This could be useful for assembly in > general too, sometimes we only input some data to have a faster > assembly, while later on we input more. > > When I look at the command that's used to run CA in the error > correction step: runCA -s pacbio.spec -p asm -d temppacbio > ovlHashLibrary=2 ovlRefLibrary=1-1 obtHashLibrary=1-1 > obtRefLibrary=1-1 sge=" -sync y" sgePropagateHold=corAsm > stopAfter=overlapper, does it actually do something what I ask for? It > only loads hash fragments from library 2, but it loads all libraries > in the other *Library options (1-1 = 0)? Could anyone explain to me > what that really means? > > Sincerely, > Ole > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: Ole K. T. <o.k...@bi...> - 2012-05-10 08:53:47
|
Hi, we have started doing some sequencing on PacBio, and correcting the reads with the PacBioToCA pipeline. The genome about 800 Mb, and we're trying to correct the PacBio reads from two SMRTcells with about 20x in 454 reads. This translates to 130,389 PacBio reads with 126 Mb sequence, and 47M 454 reads and 17.6 Gb sequence. We see that 0-overlaptrim-overlap uses quite a bit of time, and I fear that 1-overlapper will use a long time too. Is it possible to compute the overlaps between the 454 reads ahead of time, and use the overlaps from that store to only compute the overlaps between 454 reads and PacBio reads? Since I guess most to time is spent computing the overlaps between 454 reads. This could be useful for assembly in general too, sometimes we only input some data to have a faster assembly, while later on we input more. When I look at the command that's used to run CA in the error correction step: runCA -s pacbio.spec -p asm -d temppacbio ovlHashLibrary=2 ovlRefLibrary=1-1 obtHashLibrary=1-1 obtRefLibrary=1-1 sge=" -sync y" sgePropagateHold=corAsm stopAfter=overlapper, does it actually do something what I ask for? It only loads hash fragments from library 2, but it loads all libraries in the other *Library options (1-1 = 0)? Could anyone explain to me what that really means? Sincerely, Ole |
From: Arjun P. <ap...@ma...> - 2012-04-25 18:31:50
|
Hi, Thanks Brian for the detailed explanation. The gkpStore.fastqUIDmap file is easy enough to parse. From what you said it seems like the generated UID in the output may be something you guys fix at some point, right? I wrote a little perl script to convert the UIDs to readnames in the posmap files. I didn't do it for the .asm file because posmaps are all I need for now. I posted it at http://arjunprasad.net/scripts/fixReadnamesInPosmap in case it's helpful for someone else. It took about 1.5 Gigs of RAM for 7 million reads with fairly long names. It just occurred to me that fixReadnamesInPosmap doesn't handle the case where you have an assembly with some FASTQ files and some .frg files for input. That's easy to fix if it's useful to anyone. Arjun On Tue, 24 Apr 2012, Walenz, Brian wrote: > Hi- > > I was fearing the day someone would ask about this. We had a choice of > either doing lots of engineering to optimize directly saving names of fastq > reads, or an inelegant - and only partially completed - solution of > stripping the names when the reads are loaded into the gatekeeper store, and > adding them back as a post process. > > The names and mapping are saved in the *.gkpStore.fastqUIDmap. The format > is: > > UID IID Name (for unparied reads) > UID IID Name UID IID Name (for paired reads) > > IIDs are used internal to the assembler. Most logs refer to reads (unitigs, > contigs and scaffolds) using these. There is an implicit 'type' with each > IID. "1" is a valid IID for four objects: a fragment, a unitig, a contig > and a scaffold. > > UIDs appear in the outputs - posmap and asm. These are guaranteed to be > unique within the assembly. For reads loaded as .frg, the UID is the read > name. > > The iidtouid file gives a mapping from IID to UID, for every object in the > assembly, not just reads. > > Sorry for the pain. We're a bit short on engineering time at the moment, > and as this wasn't an issue critical to getting a good assembly, we only > made it 'not break' for an assembly with > 1 billion reads. > > b > > > > > On 4/24/12 1:52 PM, "Arjun Prasad" <ap...@ma...> wrote: > >> >> Hi, >> >> I need to get a read-mapping with the actual read-names for an assembly >> that was created based on FASTQ input sequences. I noticed the iidtouid >> file in the 9-terminator directory, but it has numbers for fragments >> rather than read names. >> >> Looking at the reads from the 9-terminator/.frg file I matched up some by >> sequence, and it looks like the FRG numbers are alternating reads from >> each of the paired ends. >> >> e.g., >> >> FRG 1 110000000001 - first entry from read 1 >> No FRG 2 >> FRG 3 110000000003 - 2nd entry from read 1 >> FRG 4 120000000003 - 2nd entry from read 2 >> FRG 5 110000000005 - 3rd entry from read 1 >> FRG 6 120000000005 - 3rd entry from read 2 >> FRG 100000 120000099999 - Entry 50,000 from read 2 >> >> I'm guessing that I can figure out the read name to iid translation by >> counting into the fastq files by FRG # / 2 >> >> Has anyone else done this? Did I correctly interpret what the FRG numbers >> mean? Are there any gotchas at input file boundaries? >> >> Thanks, >> Arjun > |
From: Christoph H. <chr...@gm...> - 2012-04-25 11:41:08
|
Hi Heiner, Thanks for your effort and helpful comments! The overlap job did actually finish now, but unfortunately CABOG crashed right afterwards because of exceeding disc space. Very unfortunate, but I have to ask for more disk space before I can resume the assembly manually. I was not expecting to complete the whole assembly in ten days, just the overlap-trim stage for now.. Concerning reducing the coverage: I thought about that, but I have also tested several DeBrujin graph assemblers and have discovered that I get the best results when using all the illumina data (instead of only a subset of it). The illumina data I am using is already errorcorrected. I decided to use the data like that and to rely on the CABOG trimming algorithm. With stringent manual trimming prior to CABOG I could reduce the number of illumina reads to some 160 Mio (paired end reads). Also, I suppose to leave the 14 MIO single end illumina reads out will not substantially affect the result. That would result in some 160 Mio illumina reads (76 bp) + 1.1 Mio 454 reads (500bp) - assuming a 100Mb genome still a theoretical 130x coverage - when assuming some 20-30 % host and bacterial contamination we reach about 100x coverage. The question now is, what would be more effective. Either resume the assembly with the data as it is, or start from scratch with the trimmed data. An effective solution in terms of runtime is unfortunately very important to me as I only have a limited amount of CPU hours available on the cluster. I can ask for more but only after the initial quota is exceeded and then it involves annoying bureaucracy and waiting time. Just to clearify why CPU hours are such an issue for me - sorry to bother you with that.. I put quite some time and effort into the configuration of the overlap jobs to reach a hash table load of some 70%, as suggested on the manual page. This was not so easy because the load varied between libraries, so I decided to focus on the paired end illumina library as this is the vast majority of the data. I had configured for 8 threads and the pipeline was constantly using all 8 threads. My illumina data is in zipped format. The alternative approach you are mentioning below sounds very interesting, especially as I already have the best possible (I believe so at least :-)) solexa only assembly available.. Can you give me some more detailed information on that? Where to find this Celera version? The snag is that I would need to convince the cluster administration to install the other Celera version.. Almost forgot: I am using Celera assembler 7.0 right now. Thanks again for your suggestions and apologies for a long message..! cheers, Christoph Am 25.04.2012 11:28, schrieb kuhl: > Dear Christoph, > > I have successfully done an assembly of about 350 mio reads for a 1.2 Gb > genome using Celera Assembler 6.1 (which version do you use?) from 454 and > Solexa data. Anyway it took about 1.5 month to complete on a 48 core server > and used plenty of disk space (2 - 3 TB) and there were lots of manual work > with failed contigs that had to be corrected manually. So 10 days might be > not enough. (The data will not be lost after the ten days, as you can > resume the overlap.sh jobs manually and everything done so far is saved to > disk) I also see from your mail that you are using a very high coverage of > your genome. Celera may not take profit from that. Maybe you could reduce > your dataset to a 50-70X coverage. That would reduce the computing time > dramatically as computing time increases quadratically with > (readnumber/coverage). It also depends how you did configure the > overlapper. Depending on the configuration calculating the overlap jobs > might take longer for each job or be more or less constant in computing > time for each job. > > Another possibility I tried for a different genome (2.5Gb 10^9 reads -> I > did not want to wait for three month...) is to use an debrujin graph > assembler to assemble the Illumina data (I would recommend SOAPdenovo or > CLC, the later one can also make use of the 454 data), split the resulting > scaffolds to contigs smaller than 32000bp and feed them together with 454 > data and a little (i.e. 5X) coverage of the illumina paired ends into the > long read version of Celera assembler supplied with the pacificbio > correction pipeline. These steps took about 1 week and delivered a much > better assembly compared to using de bruijn graph assemblers alone. > > Question to other users/developers, did you also experience that if > Illumina reads are stored in the packed format, the overlap jobs do not > reach the maximum speed they should? I mean for example an overlap job > configured to 12 threads is running only on 8 threads on average. Has > anyone encountered this problem? > > > I wish you good luck, > > Heiner > > > On Tue, 24 Apr 2012 19:59:07 +0200, Christoph Hahn > <chr...@gm...> > wrote: >> Thanks for that Ariel! Leaves me with little hope though.. >> Nevertheless I understand that these kind of jobs did finish in your >> experience, right? >> >> From my tests and the number of overlap.sh jobs created in the inital >> phase I was assuming to be on the safe side with a wall clock limit of >> 10 days to finish this stage. I can maybe ask the cluster administration >> to prolong the wall clock limit, but I`d need some estimate of by how >> long.. >> I am using some 1.1 Million 454 reads (~500 bp in length) plus some 200 >> Million paired end reads plus some 14 Million single end illumina reads >> (76 bp read length, respecitively). The genome is estimated to be only >> about 70-100 Mb in size, but we have reason to expect a substantial >> amount of contamination from the host (as we are dealing with a >> parasitic organism), and also a fair bit of polymorphisms as the >> libraries were prepared from a pooled sample. >> >> Can anyone suggest a reasonable time frame for reaching a checkpoint >> from which I can then resume the assembly? >> >> Thanks in advance!! >> >> Christoph >> >> >> Am 24.04.2012 18:47, schrieb Schwartz, Ariel: >>> I have experienced the same issue with our hybrid assemblies. >>> Currently I am waiting for an overlap job that has been running for >>> almost two weeks. >>> >>> I wonder if there are some recommended settings that could be used to >>> alleviate this problem. >>> >>> Thanks, >>> >>> Ariel >>> >>> Ariel Schwartz, Ph.D. >>> Senior Scientist, Bioinformatics >>> Synthetic Genomics, Inc. >>> >>> On 4/24/12 4:44 AM, "Christoph Hahn"<chr...@gm... >>> <mailto:chr...@gm...>> wrote: >>> >>> Dear CABOG developers and users, >>> >>> I am trying to do a hybrid assembly using a combination of 454 and >>> single- as well as paired-end illumina data. >>> >>> After initial trouble with optimization in the > 0-overlaptrim-overlap >>> stage of my assembly I got it to run succesfully and during the >>> previous >>> 7+ days the pipeline succesfully completetd some 2260 overlap.sh >>> jobs. >>> Now I am encoutering something strange: The last pending >>> overlap.sh job >>> (2148 of 2261) is running now already for over 36 hours. The >>> 002148.ovb.WORKING.gz file created by this job is slowly but > steadily >>> growing. It presently has some 631 M. Is this normal? Has anyone >>> had a >>> similar experience before? Maybe it will sort out it self > eventually >>> anyway, I am just a little concerned that CABOG will not finish >>> the job >>> until it hits the 10 days wall clock limit that is set on the > cluster >>> for the job, which would result in thousands of CPU hours going >>> down the >>> drain.. >>> >>> Please share your wisdom with me! >>> >>> much obliged, >>> Christoph Hahn >>> PhD fellow >>> University of Oslo >>> Norway >>> >>> > ------------------------------------------------------------------------------ >>> Live Security Virtual Conference >>> Exclusive live event will cover all the ways today's security and >>> threat landscape has changed and how IT managers can respond. >>> Discussions >>> will include endpoint security, mobile security and the latest in >>> malware >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>> _______________________________________________ >>> wgs-assembler-users mailing list >>> wgs...@li... >>> <mailto:wgs...@li...> >>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >>> |
From: kuhl <ku...@mo...> - 2012-04-25 10:04:08
|
Dear Christoph, I have successfully done an assembly of about 350 mio reads for a 1.2 Gb genome using Celera Assembler 6.1 (which version do you use?) from 454 and Solexa data. Anyway it took about 1.5 month to complete on a 48 core server and used plenty of disk space (2 - 3 TB) and there were lots of manual work with failed contigs that had to be corrected manually. So 10 days might be not enough. (The data will not be lost after the ten days, as you can resume the overlap.sh jobs manually and everything done so far is saved to disk) I also see from your mail that you are using a very high coverage of your genome. Celera may not take profit from that. Maybe you could reduce your dataset to a 50-70X coverage. That would reduce the computing time dramatically as computing time increases quadratically with (readnumber/coverage). It also depends how you did configure the overlapper. Depending on the configuration calculating the overlap jobs might take longer for each job or be more or less constant in computing time for each job. Another possibility I tried for a different genome (2.5Gb 10^9 reads -> I did not want to wait for three month...) is to use an debrujin graph assembler to assemble the Illumina data (I would recommend SOAPdenovo or CLC, the later one can also make use of the 454 data), split the resulting scaffolds to contigs smaller than 32000bp and feed them together with 454 data and a little (i.e. 5X) coverage of the illumina paired ends into the long read version of Celera assembler supplied with the pacificbio correction pipeline. These steps took about 1 week and delivered a much better assembly compared to using de bruijn graph assemblers alone. Question to other users/developers, did you also experience that if Illumina reads are stored in the packed format, the overlap jobs do not reach the maximum speed they should? I mean for example an overlap job configured to 12 threads is running only on 8 threads on average. Has anyone encountered this problem? I wish you good luck, Heiner On Tue, 24 Apr 2012 19:59:07 +0200, Christoph Hahn <chr...@gm...> wrote: > Thanks for that Ariel! Leaves me with little hope though.. > Nevertheless I understand that these kind of jobs did finish in your > experience, right? > > From my tests and the number of overlap.sh jobs created in the inital > phase I was assuming to be on the safe side with a wall clock limit of > 10 days to finish this stage. I can maybe ask the cluster administration > to prolong the wall clock limit, but I`d need some estimate of by how > long.. > I am using some 1.1 Million 454 reads (~500 bp in length) plus some 200 > Million paired end reads plus some 14 Million single end illumina reads > (76 bp read length, respecitively). The genome is estimated to be only > about 70-100 Mb in size, but we have reason to expect a substantial > amount of contamination from the host (as we are dealing with a > parasitic organism), and also a fair bit of polymorphisms as the > libraries were prepared from a pooled sample. > > Can anyone suggest a reasonable time frame for reaching a checkpoint > from which I can then resume the assembly? > > Thanks in advance!! > > Christoph > > > Am 24.04.2012 18:47, schrieb Schwartz, Ariel: >> I have experienced the same issue with our hybrid assemblies. >> Currently I am waiting for an overlap job that has been running for >> almost two weeks. >> >> I wonder if there are some recommended settings that could be used to >> alleviate this problem. >> >> Thanks, >> >> Ariel >> >> Ariel Schwartz, Ph.D. >> Senior Scientist, Bioinformatics >> Synthetic Genomics, Inc. >> >> On 4/24/12 4:44 AM, "Christoph Hahn" <chr...@gm... >> <mailto:chr...@gm...>> wrote: >> >> Dear CABOG developers and users, >> >> I am trying to do a hybrid assembly using a combination of 454 and >> single- as well as paired-end illumina data. >> >> After initial trouble with optimization in the 0-overlaptrim-overlap >> stage of my assembly I got it to run succesfully and during the >> previous >> 7+ days the pipeline succesfully completetd some 2260 overlap.sh >> jobs. >> Now I am encoutering something strange: The last pending >> overlap.sh job >> (2148 of 2261) is running now already for over 36 hours. The >> 002148.ovb.WORKING.gz file created by this job is slowly but steadily >> growing. It presently has some 631 M. Is this normal? Has anyone >> had a >> similar experience before? Maybe it will sort out it self eventually >> anyway, I am just a little concerned that CABOG will not finish >> the job >> until it hits the 10 days wall clock limit that is set on the cluster >> for the job, which would result in thousands of CPU hours going >> down the >> drain.. >> >> Please share your wisdom with me! >> >> much obliged, >> Christoph Hahn >> PhD fellow >> University of Oslo >> Norway >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. >> Discussions >> will include endpoint security, mobile security and the latest in >> malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> <mailto:wgs...@li...> >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> -- --------------------------------------------------------------- Dr. Heiner Kuhl MPI Molecular Genetics Tel: + 49 + 30 / 8413 1551 Next Generation Sequencing Ihnestrasse 73 email: ku...@mo... D-14195 Berlin http://www.molgen.mpg.de --------------------------------------------------------------- |
From: Walenz, B. <bw...@jc...> - 2012-04-24 20:22:16
|
Hi- I was fearing the day someone would ask about this. We had a choice of either doing lots of engineering to optimize directly saving names of fastq reads, or an inelegant - and only partially completed - solution of stripping the names when the reads are loaded into the gatekeeper store, and adding them back as a post process. The names and mapping are saved in the *.gkpStore.fastqUIDmap. The format is: UID IID Name (for unparied reads) UID IID Name UID IID Name (for paired reads) IIDs are used internal to the assembler. Most logs refer to reads (unitigs, contigs and scaffolds) using these. There is an implicit 'type' with each IID. "1" is a valid IID for four objects: a fragment, a unitig, a contig and a scaffold. UIDs appear in the outputs - posmap and asm. These are guaranteed to be unique within the assembly. For reads loaded as .frg, the UID is the read name. The iidtouid file gives a mapping from IID to UID, for every object in the assembly, not just reads. Sorry for the pain. We're a bit short on engineering time at the moment, and as this wasn't an issue critical to getting a good assembly, we only made it 'not break' for an assembly with > 1 billion reads. b On 4/24/12 1:52 PM, "Arjun Prasad" <ap...@ma...> wrote: > > Hi, > > I need to get a read-mapping with the actual read-names for an assembly > that was created based on FASTQ input sequences. I noticed the iidtouid > file in the 9-terminator directory, but it has numbers for fragments > rather than read names. > > Looking at the reads from the 9-terminator/.frg file I matched up some by > sequence, and it looks like the FRG numbers are alternating reads from > each of the paired ends. > > e.g., > > FRG 1 110000000001 - first entry from read 1 > No FRG 2 > FRG 3 110000000003 - 2nd entry from read 1 > FRG 4 120000000003 - 2nd entry from read 2 > FRG 5 110000000005 - 3rd entry from read 1 > FRG 6 120000000005 - 3rd entry from read 2 > FRG 100000 120000099999 - Entry 50,000 from read 2 > > I'm guessing that I can figure out the read name to iid translation by > counting into the fastq files by FRG # / 2 > > Has anyone else done this? Did I correctly interpret what the FRG numbers > mean? Are there any gotchas at input file boundaries? > > Thanks, > Arjun |
From: Christoph H. <chr...@gm...> - 2012-04-24 17:59:22
|
Thanks for that Ariel! Leaves me with little hope though.. Nevertheless I understand that these kind of jobs did finish in your experience, right? From my tests and the number of overlap.sh jobs created in the inital phase I was assuming to be on the safe side with a wall clock limit of 10 days to finish this stage. I can maybe ask the cluster administration to prolong the wall clock limit, but I`d need some estimate of by how long.. I am using some 1.1 Million 454 reads (~500 bp in length) plus some 200 Million paired end reads plus some 14 Million single end illumina reads (76 bp read length, respecitively). The genome is estimated to be only about 70-100 Mb in size, but we have reason to expect a substantial amount of contamination from the host (as we are dealing with a parasitic organism), and also a fair bit of polymorphisms as the libraries were prepared from a pooled sample. Can anyone suggest a reasonable time frame for reaching a checkpoint from which I can then resume the assembly? Thanks in advance!! Christoph Am 24.04.2012 18:47, schrieb Schwartz, Ariel: > I have experienced the same issue with our hybrid assemblies. > Currently I am waiting for an overlap job that has been running for > almost two weeks. > > I wonder if there are some recommended settings that could be used to > alleviate this problem. > > Thanks, > > Ariel > > Ariel Schwartz, Ph.D. > Senior Scientist, Bioinformatics > Synthetic Genomics, Inc. > > On 4/24/12 4:44 AM, "Christoph Hahn" <chr...@gm... > <mailto:chr...@gm...>> wrote: > > Dear CABOG developers and users, > > I am trying to do a hybrid assembly using a combination of 454 and > single- as well as paired-end illumina data. > > After initial trouble with optimization in the 0-overlaptrim-overlap > stage of my assembly I got it to run succesfully and during the > previous > 7+ days the pipeline succesfully completetd some 2260 overlap.sh > jobs. > Now I am encoutering something strange: The last pending > overlap.sh job > (2148 of 2261) is running now already for over 36 hours. The > 002148.ovb.WORKING.gz file created by this job is slowly but steadily > growing. It presently has some 631 M. Is this normal? Has anyone > had a > similar experience before? Maybe it will sort out it self eventually > anyway, I am just a little concerned that CABOG will not finish > the job > until it hits the 10 days wall clock limit that is set on the cluster > for the job, which would result in thousands of CPU hours going > down the > drain.. > > Please share your wisdom with me! > > much obliged, > Christoph Hahn > PhD fellow > University of Oslo > Norway > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. > Discussions > will include endpoint security, mobile security and the latest in > malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > <mailto:wgs...@li...> > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > |
From: Arjun P. <ap...@ma...> - 2012-04-24 17:53:45
|
Hi, I need to get a read-mapping with the actual read-names for an assembly that was created based on FASTQ input sequences. I noticed the iidtouid file in the 9-terminator directory, but it has numbers for fragments rather than read names. Looking at the reads from the 9-terminator/.frg file I matched up some by sequence, and it looks like the FRG numbers are alternating reads from each of the paired ends. e.g., FRG 1 110000000001 - first entry from read 1 No FRG 2 FRG 3 110000000003 - 2nd entry from read 1 FRG 4 120000000003 - 2nd entry from read 2 FRG 5 110000000005 - 3rd entry from read 1 FRG 6 120000000005 - 3rd entry from read 2 FRG 100000 120000099999 - Entry 50,000 from read 2 I'm guessing that I can figure out the read name to iid translation by counting into the fastq files by FRG # / 2 Has anyone else done this? Did I correctly interpret what the FRG numbers mean? Are there any gotchas at input file boundaries? Thanks, Arjun -- Genome Technology Branch National Human Genome Research Institute National Institutes of Health 5625 Fishers Lane Phone: 301-594-9199 Room 5N-01L Fax: 301-435-6170 Rockville, MD 20892-9400 E-Mail: ap...@nh... |
From: Schwartz, A. <asc...@sy...> - 2012-04-24 17:00:21
|
I have experienced the same issue with our hybrid assemblies. Currently I am waiting for an overlap job that has been running for almost two weeks. I wonder if there are some recommended settings that could be used to alleviate this problem. Thanks, Ariel Ariel Schwartz, Ph.D. Senior Scientist, Bioinformatics Synthetic Genomics, Inc. On 4/24/12 4:44 AM, "Christoph Hahn" <chr...@gm...<mailto:chr...@gm...>> wrote: Dear CABOG developers and users, I am trying to do a hybrid assembly using a combination of 454 and single- as well as paired-end illumina data. After initial trouble with optimization in the 0-overlaptrim-overlap stage of my assembly I got it to run succesfully and during the previous 7+ days the pipeline succesfully completetd some 2260 overlap.sh jobs. Now I am encoutering something strange: The last pending overlap.sh job (2148 of 2261) is running now already for over 36 hours. The 002148.ovb.WORKING.gz file created by this job is slowly but steadily growing. It presently has some 631 M. Is this normal? Has anyone had a similar experience before? Maybe it will sort out it self eventually anyway, I am just a little concerned that CABOG will not finish the job until it hits the 10 days wall clock limit that is set on the cluster for the job, which would result in thousands of CPU hours going down the drain.. Please share your wisdom with me! much obliged, Christoph Hahn PhD fellow University of Oslo Norway ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ wgs-assembler-users mailing list wgs...@li...<mailto:wgs...@li...> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: Christoph H. <chr...@gm...> - 2012-04-24 11:44:27
|
Dear CABOG developers and users, I am trying to do a hybrid assembly using a combination of 454 and single- as well as paired-end illumina data. After initial trouble with optimization in the 0-overlaptrim-overlap stage of my assembly I got it to run succesfully and during the previous 7+ days the pipeline succesfully completetd some 2260 overlap.sh jobs. Now I am encoutering something strange: The last pending overlap.sh job (2148 of 2261) is running now already for over 36 hours. The 002148.ovb.WORKING.gz file created by this job is slowly but steadily growing. It presently has some 631 M. Is this normal? Has anyone had a similar experience before? Maybe it will sort out it self eventually anyway, I am just a little concerned that CABOG will not finish the job until it hits the 10 days wall clock limit that is set on the cluster for the job, which would result in thousands of CPU hours going down the drain.. Please share your wisdom with me! much obliged, Christoph Hahn PhD fellow University of Oslo Norway |
From: Ole K. T. <o.k...@bi...> - 2012-04-15 20:07:42
|
Hi Paul. You can use Hawkeye for this: http://sourceforge.net/apps/mediawiki/amos/index.php?title=Hawkeye (At least as long as your assembly is not too large, bacteria are fine, but mammal genomes will probably not work.) Ole On 15 April 2012 21:28, Paul Cantalupo <pca...@gm...> wrote: > Hi, > > Does anybody know if there are any graphical viewing programs for showing > the output of CA so that I can manually see the contigs (scaffolds and > degenerates), consensus sequence and reads that were used to construct the > contigs? Thank you, > > Paul > > University of Pittsburgh > Pittsburgh, PA > > > > ------------------------------------------------------------------------------ > For Developers, A Lot Can Happen In A Second. > Boundary is the first to Know...and Tell You. > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > http://p.sf.net/sfu/Boundary-d2dvs2 > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > |
From: Paul C. <pca...@gm...> - 2012-04-15 19:37:03
|
Hi Brian, First, I'd like to thank you and the development team at your institution for making cabog public. I am finding it a very valuable tool to use. On Sun, Apr 15, 2012 at 2:56 PM, Walenz, Brian <bw...@jc...> wrote: > Without getting into precise definitions: Scaffolder (cgw) promotes > unitigs that looks like unique sequence (based on coverage, length and a > few other signals) to contigs. What command line options govern this? Your answer probably depends on what I'm trying to do. I usually do two types of assemblies: 1) metagenomic (therefore, a complex mixed sample containing sequences from many species) 2) targeted smaller assemblies with reads that are similar to one species. Here, I'm trying to make the assembly quicker and hopefully more accurate by only selecting reads that are similar to one species in hopes to assembly a complete genome. Thank you again for you help, Paul > The left over unitigs are available for gap filling as repeats or > singletons. The unique contigs are then promoted almost immediately to > single-contig scaffolds. With no mates, that's all scaffolder will do. > The scaffolds/contigs are output as is, and the left over unitigs are > output as degenerate contigs. > > bri > -- > Brian Walenz > Senior Software Engineer > J. Craig Venter Institute > > ________________________________________ > From: Paul Cantalupo [pca...@gm...] > Sent: Sunday, April 15, 2012 2:53 PM > To: wgs-assembler-users > Subject: [wgs-assembler-users] degenerate contigs > > Hi > > I work with non-paired end 454 sequences. When I perform an assembly, I > always get a set of regular contigs and degenerate contigs. The celera > assembler glossary says that degenerate contigs are those unitigs that > cannot be placed into scaffolds. Well, with my non-paired end data, how can > *any* contig be placed into a scaffold. Scaffolds cannot be built without > paired-end data, right?. So, can somebody tell me the difference between a > "regular" contig and a degenerate contig? > > Thank you for your help, > > Paul > > University of Pittsburgh > Pittsburgh, PA 15260 > > > |
From: Paul C. <pca...@gm...> - 2012-04-15 19:28:46
|
Hi, Does anybody know if there are any graphical viewing programs for showing the output of CA so that I can manually see the contigs (scaffolds and degenerates), consensus sequence and reads that were used to construct the contigs? Thank you, Paul University of Pittsburgh Pittsburgh, PA |
From: Walenz, B. <bw...@jc...> - 2012-04-15 19:01:18
|
Without getting into precise definitions: Scaffolder (cgw) promotes unitigs that looks like unique sequence (based on coverage, length and a few other signals) to contigs. The left over unitigs are available for gap filling as repeats or singletons. The unique contigs are then promoted almost immediately to single-contig scaffolds. With no mates, that's all scaffolder will do. The scaffolds/contigs are output as is, and the left over unitigs are output as degenerate contigs. bri -- Brian Walenz Senior Software Engineer J. Craig Venter Institute ________________________________________ From: Paul Cantalupo [pca...@gm...] Sent: Sunday, April 15, 2012 2:53 PM To: wgs-assembler-users Subject: [wgs-assembler-users] degenerate contigs Hi I work with non-paired end 454 sequences. When I perform an assembly, I always get a set of regular contigs and degenerate contigs. The celera assembler glossary says that degenerate contigs are those unitigs that cannot be placed into scaffolds. Well, with my non-paired end data, how can *any* contig be placed into a scaffold. Scaffolds cannot be built without paired-end data, right?. So, can somebody tell me the difference between a "regular" contig and a degenerate contig? Thank you for your help, Paul University of Pittsburgh Pittsburgh, PA 15260 |
From: Paul C. <pca...@gm...> - 2012-04-15 18:53:06
|
Hi I work with non-paired end 454 sequences. When I perform an assembly, I always get a set of regular contigs and degenerate contigs. The celera assembler glossary says that degenerate contigs are those unitigs that cannot be placed into scaffolds. Well, with my non-paired end data, how can *any* contig be placed into a scaffold. Scaffolds cannot be built without paired-end data, right?. So, can somebody tell me the difference between a "regular" contig and a degenerate contig? Thank you for your help, Paul University of Pittsburgh Pittsburgh, PA 15260 |
From: Christoph H. <chr...@gm...> - 2012-04-15 14:06:02
|
Hi Brian, Thanks so much for your help! I have resumed the assembly now with the following settings: ovlHashBits=23 ovlHashBlockLength=260000000 This consumes some 8.5Gb per job and in my tests gave me a nice load of some 70% (see ex1 below), but I have discovered that the load drops to some 43% after the 13th overlapper job and stays constant after that (currently job 77, see ex2 below). So, again not very efficient. What could be the reason for that? Could it be because I am feeding CA with two separate illumina datasets (one small single end library and one large paired end library)? ex1: HASH LOADING STOPPED: strings 3524789 out of 3524789 max. HASH LOADING STOPPED: length 260000046 out of 260000046 max. HASH LOADING STOPPED: entries 127378102 out of 132120576 max (load 72.31). ### realloc Extra_Ref_Space max_extra_ref_ct = 76183793 String_Ct = 3524789 Extra_String_Ct = 755 Extra_String_Subcount = 35 Read 563144 kmers to mark to skip Kmer hits without olaps = 13633635 Kmer hits with olaps = 2890745 Multiple overlaps/pair = 0 Total overlaps produced = 2837254 Contained overlaps = 0 Dovetail overlaps = 0 ex2: HASH LOADING STOPPED: strings 3393657 out of 3393657 max. HASH LOADING STOPPED: length 260000052 out of 260000052 max. HASH LOADING STOPPED: entries 76303061 out of 132120576 max (load 43.31). ### realloc Extra_Ref_Space max_extra_ref_ct = 127528828 String_Ct = 3393657 Extra_String_Ct = 13 Extra_String_Subcount = 7 Read 563144 kmers to mark to skip Kmer hits without olaps = 5141573 Kmer hits with olaps = 3859708 Multiple overlaps/pair = 0 Total overlaps produced = 3728782 Contained overlaps = 0 Dovetail overlaps = 0 I also looked at the size of the *gkpStore/inf file. It has 1.1Gb. How do I affect which fragments are loaded first? Is it simply done by the order they are listed in the specfile? If so I have loaded the illumina fragments first. Thanks again for your help! I really appreciate it! cheers, Christoph Am 13.04.2012 17:00, schrieb Walenz, Brian: > I've seen this too, and am a bit confused where the extra space is used. > Some assemblies are spot on, others are up to twice as large. > > The entries below is 264..., where 957... of them are used. In this case, > you can either increase hashBlockLength (more memory) or decrease hashBits > (less memory). The important stat in what you show is ~30% load - most of > that 3.5gb hash table is empty. We target 70% load. Any higher and the > table does inefficient lookups, and lower wastes space and increases > overlapper overhead (more jobs). > > One thing to check is the size of file *gkpStore/inf. This is loaded into > memory nThreads+1 times. The next version (or the CVS tip version) will > make this less of a problem. If the 'inf' file is large, loading Illumina > fragments first should reduce the size. > > b > > > > On 4/13/12 10:52 AM, "Christoph Hahn"<chr...@gm...> wrote: > >> Hi Brian, >> >> Thanks for your reply and suggestions! >> >> I did follow your suggestion and configured the overlap jobs with >> ³useGrid=1, scriptOnGrid=0². I subsequently ran overlap.sh 1, etc. to >> check the memory usage. >> >> I am using the following overlap parameters: >> >> ovlHashBits=24, ovlHashBlockLength=200000000 >> >> according to my calculations this would consume some 6 GB of memory >> (3.5GB from ovlHashBits=24 + 0.5 GB overhang + some 2 GB for the 200 Mb >> of sequence loaded) per thread. >> >> The actual max memory consumption is about 9.6 GB (I ran several >> overlap.sh jobs by hand), so there is a difference of some 3.5 GB of >> memory consumption between calculated and observed. Am I missing >> anything? Where is the error in my calculation? >> >> When running the overlap.sh I get something like this: >> HASH LOADING STOPPED: strings 2695151 out of 2695151 max. >> HASH LOADING STOPPED: length 200000024 out of 200000024 max. >> HASH LOADING STOPPED: entries 95738763 out of 264241152 max >> (load 27.17). >> >> In order to optimize, one question to your rule of thumb ("As a rule of >> thumb, setting ovlHashBlockLength to twice the number of entries >> available in the table seems reasonable."): in my example, which one is >> the number of entries available in the table? 95738763 or 264241152? I >> am a little confused with the terminology... sorry. >> >> Thanks again for your kind help! >> >> cheers, >> Christoph >> >> On 12.04.2012 21:55, Walenz, Brian wrote: >>> Hi, Christoph- >>> >>> In general (but with exceptions) you can delete a stage and runCA will >>> pick up from there. For example, you can delete 4-unitigger, fiddle with >>> parameters, and restart exactly at creating unitigs. >>> >>> This works fine with overlaps. Just delete 0-overlaptrim-overlap (and >>> nothing else!), change parameters and restart runCA. It will skip >>> gatekeeper, meryl, any trimming, and move straight to configuring overlaps. >>> >>> Tip: For overlaps on large assemblies, I like to set ³useGrid=1 >>> scriptOnGrid=0². This will configure the overlap jobs, then print out a >>> qsub command to run them on SGE, but not actually submit them. I then >>> run several jobs by hand to see memory size and compute performance. To >>> run by hand, in 0-overlaptrim-overlap, run ³overlap.sh 1², ³overlap.sh >>> 2² etc. If you stop these early, they will leave an incomplete >>> ³*.WORKING.gz² file in the output directory (001/ 002/ 003/ etc). I >>> don¹t think overlap.sh checks for these files, so you don¹t even have to >>> remove them before submitting the full batch. >>> >>> b >>> >>> >>> On 4/11/12 5:02 PM, "Christoph Hahn"<chr...@gm...> wrote: >>> >>> Dear CA developers and users, >>> >>> I am trying to use Celeara assembler 7.0 to assemble a medium sized >>> genome (about 100 Mb) using a combination of 454 and illumina reads. >>> >>> I choose a bad combination of the ovlHashBits, ovlHashBlockLength >>> and ovlThreads options so that my last run stopped at the cluster I >>> am using due to exceeding memory limit in the overlaptrim step. I >>> think I know what the problem was, now, so my question is if it is >>> possible to resume runCA from any given stage. In my particular case >>> I would like to resume from the 0-overlaptrim-overlap stage with >>> altered ovlHashBits, ovlHashBlockLength and ovlThreads options. I >>> want to avaid doing the mercouts and initialtrim steps again, >>> because they seem to have worked fine. >>> >>> I read in the manual about using the /do*/ option to get a kind of >>> /startBefore/ effect. I cant seem to find any more details on this >>> in the manual, so can you maybe help me out or point me to the >>> required information on the webpage. Thanks! >>> >>> Your help is highly appreciated! >>> >>> much obliged, >>> Christoph Hahn >>> PhD student >>> University of Oslo >>> >>> |
From: Walenz, B. <bw...@jc...> - 2012-04-12 19:55:25
|
Hi, Christoph- In general (but with exceptions) you can delete a stage and runCA will pick up from there. For example, you can delete 4-unitigger, fiddle with parameters, and restart exactly at creating unitigs. This works fine with overlaps. Just delete 0-overlaptrim-overlap (and nothing else!), change parameters and restart runCA. It will skip gatekeeper, meryl, any trimming, and move straight to configuring overlaps. Tip: For overlaps on large assemblies, I like to set “useGrid=1 scriptOnGrid=0”. This will configure the overlap jobs, then print out a qsub command to run them on SGE, but not actually submit them. I then run several jobs by hand to see memory size and compute performance. To run by hand, in 0-overlaptrim-overlap, run “overlap.sh 1”, “overlap.sh 2” etc. If you stop these early, they will leave an incomplete “*.WORKING.gz” file in the output directory (001/ 002/ 003/ etc). I don’t think overlap.sh checks for these files, so you don’t even have to remove them before submitting the full batch. b On 4/11/12 5:02 PM, "Christoph Hahn" <chr...@gm...> wrote: Dear CA developers and users, I am trying to use Celeara assembler 7.0 to assemble a medium sized genome (about 100 Mb) using a combination of 454 and illumina reads. I choose a bad combination of the ovlHashBits, ovlHashBlockLength and ovlThreads options so that my last run stopped at the cluster I am using due to exceeding memory limit in the overlaptrim step. I think I know what the problem was, now, so my question is if it is possible to resume runCA from any given stage. In my particular case I would like to resume from the 0-overlaptrim-overlap stage with altered ovlHashBits, ovlHashBlockLength and ovlThreads options. I want to avaid doing the mercouts and initialtrim steps again, because they seem to have worked fine. I read in the manual about using the do* option to get a kind of startBefore effect. I cant seem to find any more details on this in the manual, so can you maybe help me out or point me to the required information on the webpage. Thanks! Your help is highly appreciated! much obliged, Christoph Hahn PhD student University of Oslo |
From: Ole K. T. <o.k...@bi...> - 2012-04-12 16:16:01
|
Hi Brian. Incidentally, the numbers are the same in mine. I thought maybe you had gleaned my numbers from the files I sent you, and that you used them to make it easier for me to understand. :) To sum up: The contig is identical in version 14 and 16, same 'data.contig_status' (set to U), and as far as I can see, same consensus and length on sequence. In version 15 however, the consensus (and quality scores) are lacking, and the length of the contig is set to 0 and 'data.contig_status' is also U. I dumped the contigs by using 'tigStore -g *gkpStore -t *tigStore 14 -c 107652 -d layout3 > ctg1076523', and just varying the version number. I tried loading the contig a couple of times into version 15 and then dumping it again, but still, it was without consensus sequence and asmFastaOutput fails. Have I messed up the latter stages of my assembly by doing this? Is it possible to fix this in any way? Thank for your help so far. It's good to learn more about Celera. Ole On 12 April 2012 16:50, Walenz, Brian <bw...@jc...> wrote: > Hi Ole- > > The version numbers will be different in different assemblies. Mine came > from a small assembly with little scaffolding work. Larger assemblies can > have more than 100 versions. 'ls -l *tigStore' will show the versions - you > want to use the last three. > > b > > > > On 4/12/12 3:52 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: > >> Hi, Brian. >> >> Thank you for your help so far, but I seem to be missing something. >> >> I did this: >> tigStore -g *gkpStore -t *tigStore 14 -c 107652 -d layout3 > ctg1076523 >> tigStore -g *gkpStore -t *tigStore 15 -cp 36 -R ctg1076523 >> >> But when I dump the same contig from version 15: >> tigStore -g *gkpStore -t *tigStore 15 -c 107652 -d layout3 > ctg1076523_v15 >> it's without consensus sequence: >> contig 1076523 >> len 0 >> cns >> qlt >> data.unitig_coverage_stat -9874.792662 >> data.unitig_microhet_prob 0.000000 >> data.unitig_status X >> data.unitig_unique_rept X >> data.contig_status U >> data.num_frags 14302 >> data.num_unitigs 1 >> >> The if I dump it from version 16, it's identical to the one from >> version 14 (that is, with consensus). I've tried loading it several >> times, but each time I dump it again it's lost consensus. Do you know >> what I'm doing wrong? >> >> Ole >> >> On 11 April 2012 20:54, Walenz, Brian <bw...@jc...> wrote: >>> Hi, Ole- >>> >>> Yes, I overlooked a step. In the contig you insert to the latest version, >>> update the 'data.contig_status' with what the second to last version has. >>> >>> FYI, the tigStore should have versions such as: >>> >>> seqDB.v014.ctg >>> seqDB.v014.dat >>> seqDB.v014.utg >>> >>> seqDB.v015.ctg >>> seqDB.v015.p001.ctg >>> seqDB.v015.p001.dat >>> (etc) >>> seqDB.v015.utg >>> >>> seqDB.v016.ctg >>> seqDB.v016.p001.ctg >>> seqDB.v016.p001.dat >>> (etc) >>> seqDB.v016.utg >>> >>> (the v numbers will of course be different in your assembly) >>> >>> v015 contains the output of scaffolder, which is the input to consensus. >>> Contigs here have no consensus sequence, but otherwise all the data is >>> present. It is largely just rewriting the data from v014 into partitions >>> (p###), so each consensus job can load a single file instead of randomly >>> accessing a large file. The status flag on each unitig/contig is also set. >>> This flag tells if the unitig/contig was placed in a scaffold, is a >>> surrogate, degenerate, etc. >>> >>> v016 is the output of consensus, the input to terminator. All terminator >>> does is to repackage this into ASCII files. >>> >>> To summarize: grab the contig from v014 (the last with a consensus >>> sequence), the status flag from v015, change the status flag in the contig >>> you grabbed, and then insert the contig into v016. >>> >>> by doing this, you'll lose VAR records for this contig, but otherwise the >>> consensus sequence is the same (or largely the same; variant detection can >>> change it a bit). >>> >>> b >>> >>> >>> On 4/11/12 6:23 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: >>> >>>> Hi Brian, >>>> ctgcns completed now, but I got an error with asmOutputFasta. From >>>> 9-terminator/asmOutputFasta.err: >>>> ERROR: Illegal unitigpos type type value 'X' (CCO) at line 1676575956 >>>> >>>> Is this connected with the procedure I did with inserting the contig >>>> from an older tigStore? >>>> >>>> Thank you for your help so far. >>>> >>>> Ole >>>> >>>> On 11 April 2012 08:13, Ole Kristian Tørresen <o.k...@bi...> >>>> wrote: >>>>> Hi Brian. >>>>> >>>>> I've done this, and rerunning ctgcns on that last partition. I'll send >>>>> the layout and log in a separate email. >>>>> >>>>> Ole >>>>> >>>>> On 10 April 2012 21:37, Walenz, Brian <bw...@jc...> wrote: >>>>>> Hi Ole- >>>>>> >>>>>> I don't see anything that looks like an error in the log, so I'll have to >>>>>> assume it crashed. You report it runs for 20 hours, which is odd for >>>>>> contig >>>>>> consensus, unless that contig is very very deep. If so, the ctgcns >>>>>> process >>>>>> will also be large. Do you know how big the process was? >>>>>> >>>>>> Can you make the full log available? >>>>>> >>>>>> It is possible to force the contig to have a consensus sequence. If the >>>>>> job >>>>>> did crash, the other contigs will still need to have consensus generated. >>>>>> >>>>>> The process is the same as editing a unitig in the tigStore: dump the >>>>>> contig >>>>>> in question, edit the file to have a consensus sequence, then load that >>>>>> contig back into the tigStore. A consensus sequence for this contig can >>>>>> be >>>>>> found in one of the earlier tigStore versions; the version just before >>>>>> this >>>>>> one will probably have it. That makes our process even easier: dump the >>>>>> version with a consensus sequence, and load it back into the latest >>>>>> version. >>>>>> >>>>>> A sketch of the steps: >>>>>> >>>>>> 1) Dump the previous version of the contig. check that 'file' does >>>>>> contain >>>>>> a consensus sequence. >>>>>> >>>>>> tigStore -g *gkpStore -t *tigStore <vers-1> -c <ctgID> -d layout > file >>>>>> >>>>>> 2) Load that pervious version into the tigStore as the latest version >>>>>> >>>>>> tigStore -g *gkpStore -t *tigStore <vers> <part> -c <ctgID> -R file >>>>>> >>>>>> Notice that this tigStore command specifies both a version and a partition >>>>>> for the tigStore. >>>>>> >>>>>> 3) Rerun consensus.sh on that partition. It will not attempt to compute >>>>>> the >>>>>> consensus for that contig. >>>>>> >>>>>> I'd be interested in seeing the contig you dump, if only to verify that it >>>>>> is deep. >>>>>> >>>>>> b >>>>>> >>>>>> >>>>>> >>>>>> On 4/10/12 4:05 AM, "Ole Kristian Tørresen" <o.k...@bi...> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> I'm having some problems while doing some low coverage sequencing >>>>>>> assembly testing. I've tried to assemble about 10x coverage of 150 nt >>>>>>> paired Illumina reads of 500 bp fragment size. These are from the >>>>>>> parrot used in the Assemblathon 2 >>>>>>> (http://assemblathon.org/pages/download-data). Everything seems to run >>>>>>> fine, until contig consensus, where 1 partition just don't succeed. It >>>>>>> seems to run for quite some time (20 hours or something) before >>>>>>> failing. These are the last 20 lines from the output of the ctgcns >>>>>>> partition that fails: >>>>>>> Alignment params: 297 333 200 200 0 0.12 1e-06 30 1 >>>>>>> -- e/l = 7/112 = 6.25% >>>>>>> A -----+------+----> [] >>>>>>> B 332 -------> 40 [] >>>>>>> GetAlignmentTrace()-- Overlap ACCEPTED! accept=1000.000000 >>>>>>> lScore=0.026087 (112 vs 115) aScore=0.160000 (332 vs 316) >>>>>>> bScore=0.150000 (-42 vs -27). (CONTIGF) >>>>>>> GetAlignmentTrace()-- Overlap found between 1076523 (U) and 25763657 >>>>>>> (R) expected hangs: a=316 b=-27 erate=0.060000 aligner=Local_Overlap >>>>>>> GetAlignmentTrace()-- Overlap ACCEPTED! accept=1000.000000 >>>>>>> lScore=0.026087 (112 vs 115) aScore=0.160000 (332 vs 316) >>>>>>> bScore=0.150000 (-42 vs -27). (CONTIGF) >>>>>>> Local_Overlap_AS_forCNS found overlap between 1076523 (U) and 25763657 >>>>>>> (R) ahang: 332, bhang: -42 (expected hang was 316) >>>>>>> Alignment params: 298 334 200 200 0 0.12 1e-06 30 1 >>>>>>> -- e/l = 6/112 = 5.36% >>>>>>> A -----+------+----> [] >>>>>>> B 332 -------> 42 [] >>>>>>> GetAlignmentTrace()-- Overlap ACCEPTED! accept=1000.000000 >>>>>>> lScore=0.009009 (110 vs 111) aScore=0.140000 (332 vs 318) >>>>>>> bScore=0.130000 (-42 vs -29). (CONTIGF) >>>>>>> GetAlignmentTrace()-- Overlap found between 1076523 (U) and 57537697 >>>>>>> (R) expected hangs: a=318 b=-29 erate=0.060000 aligner=Local_Overlap >>>>>>> GetAlignmentTrace()-- Overlap ACCEPTED! accept=1000.000000 >>>>>>> lScore=0.009009 (110 vs 111) aScore=0.140000 (332 vs 318) >>>>>>> bScore=0.130000 (-42 vs -29). (CONTIGF) >>>>>>> Local_Overlap_AS_forCNS found overlap between 1076523 (U) and 57537697 >>>>>>> (R) ahang: 332, bhang: -42 (expected hang was 318) >>>>>>> Alignment params: 300 336 200 200 0 0.12 1e-06 30 1 >>>>>>> -- e/l = 6/110 = 5.45% >>>>>>> A -----+------+----> [] >>>>>>> B 332 -------> 42 [] >>>>>>> >>>>>>> This is the error message: >>>>>>> at /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 1237 >>>>>>> main::caFailure('1 consensusAfterScaffolder jobs failed; remove >>>>>>> 8-consensus/co...', undef) called at >>>>>>> /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 5142 >>>>>>> main::postScaffolderConsensus() called at >>>>>>> /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 5885 >>>>>>> >>>>>>> ---------------------------------------- >>>>>>> Failure message: >>>>>>> >>>>>>> 1 consensusAfterScaffolder jobs failed; remove >>>>>>> 8-consensus/consensus.sh to try again >>>>>>> >>>>>>> I've tried removing consensus.sh and running again, but get the same >>>>>>> error. >>>>>>> >>>>>>> This is the spec file: >>>>>>> utgErrorRate=0.03 >>>>>>> utgErrorLimit=2.5 >>>>>>> ovlErrorRate=0.06 >>>>>>> cnsErrorRate=0.06 >>>>>>> cgwErrorRate=0.10 >>>>>>> merSize = 22 >>>>>>> overlapper=ovl >>>>>>> unitigger = bogart >>>>>>> merylMemory = 128000 >>>>>>> merylThreads = 16 >>>>>>> merOverlapperThreads = 2 >>>>>>> merOverlapperExtendConcurrency = 8 >>>>>>> merOverlapperSeedConcurrency = 8 >>>>>>> ovlThreads = 2 >>>>>>> mbtThreads = 2 >>>>>>> mbtConcurrency = 8 >>>>>>> ovlConcurrency = 8 >>>>>>> ovlCorrConcurrency = 16 >>>>>>> ovlRefBlockSize = 32000000 >>>>>>> ovlHashBits = 24 >>>>>>> ovlHashBlockLength = 800000000 >>>>>>> ovlStoreMemory = 128000 >>>>>>> frgCorrThreads = 2 >>>>>>> frgCorrConcurrency = 8 >>>>>>> ovlCorrBatchSize = 1000000 >>>>>>> ovlCorrConcurrency = 16 >>>>>>> cnsConcurrency = 16 >>>>>>> doExtendClearRanges = 0 >>>>>>> >>>>>>> I don't need to have that unitig (1076523 (U)) in my finished >>>>>>> assembly, so it's possible to just remove it as long as I get a >>>>>>> finished assembly. I've also tried to just create the .success file, >>>>>>> but then terminator fails. >>>>>>> >>>>>>> Does anyone have any ideas of what I might do different? Can I just >>>>>>> remove that unitig and proceed? How do I do that? >>>>>>> >>>>>>> Sincerely, >>>>>>> Ole Kristian Tørresen >>>>>>> PhD student >>>>>>> University of Oslo >>>>>>> >>>>>>> ------------------------------------------------------------------------- >>>>>>> -- >>>>>>> --- >>>>>>> Better than sec? Nothing is better than sec when it comes to >>>>>>> monitoring Big Data applications. Try Boundary one-second >>>>>>> resolution app monitoring today. Free. >>>>>>> http://p.sf.net/sfu/Boundary-dev2dev >>>>>>> _______________________________________________ >>>>>>> wgs-assembler-users mailing list >>>>>>> wgs...@li... >>>>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >>>>>> >>> > |
From: Walenz, B. <bw...@jc...> - 2012-04-12 14:51:04
|
Hi Ole- The version numbers will be different in different assemblies. Mine came from a small assembly with little scaffolding work. Larger assemblies can have more than 100 versions. 'ls -l *tigStore' will show the versions - you want to use the last three. b On 4/12/12 3:52 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: > Hi, Brian. > > Thank you for your help so far, but I seem to be missing something. > > I did this: > tigStore -g *gkpStore -t *tigStore 14 -c 107652 -d layout3 > ctg1076523 > tigStore -g *gkpStore -t *tigStore 15 -cp 36 -R ctg1076523 > > But when I dump the same contig from version 15: > tigStore -g *gkpStore -t *tigStore 15 -c 107652 -d layout3 > ctg1076523_v15 > it's without consensus sequence: > contig 1076523 > len 0 > cns > qlt > data.unitig_coverage_stat -9874.792662 > data.unitig_microhet_prob 0.000000 > data.unitig_status X > data.unitig_unique_rept X > data.contig_status U > data.num_frags 14302 > data.num_unitigs 1 > > The if I dump it from version 16, it's identical to the one from > version 14 (that is, with consensus). I've tried loading it several > times, but each time I dump it again it's lost consensus. Do you know > what I'm doing wrong? > > Ole > > On 11 April 2012 20:54, Walenz, Brian <bw...@jc...> wrote: >> Hi, Ole- >> >> Yes, I overlooked a step. In the contig you insert to the latest version, >> update the 'data.contig_status' with what the second to last version has. >> >> FYI, the tigStore should have versions such as: >> >> seqDB.v014.ctg >> seqDB.v014.dat >> seqDB.v014.utg >> >> seqDB.v015.ctg >> seqDB.v015.p001.ctg >> seqDB.v015.p001.dat >> (etc) >> seqDB.v015.utg >> >> seqDB.v016.ctg >> seqDB.v016.p001.ctg >> seqDB.v016.p001.dat >> (etc) >> seqDB.v016.utg >> >> (the v numbers will of course be different in your assembly) >> >> v015 contains the output of scaffolder, which is the input to consensus. >> Contigs here have no consensus sequence, but otherwise all the data is >> present. It is largely just rewriting the data from v014 into partitions >> (p###), so each consensus job can load a single file instead of randomly >> accessing a large file. The status flag on each unitig/contig is also set. >> This flag tells if the unitig/contig was placed in a scaffold, is a >> surrogate, degenerate, etc. >> >> v016 is the output of consensus, the input to terminator. All terminator >> does is to repackage this into ASCII files. >> >> To summarize: grab the contig from v014 (the last with a consensus >> sequence), the status flag from v015, change the status flag in the contig >> you grabbed, and then insert the contig into v016. >> >> by doing this, you'll lose VAR records for this contig, but otherwise the >> consensus sequence is the same (or largely the same; variant detection can >> change it a bit). >> >> b >> >> >> On 4/11/12 6:23 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: >> >>> Hi Brian, >>> ctgcns completed now, but I got an error with asmOutputFasta. From >>> 9-terminator/asmOutputFasta.err: >>> ERROR: Illegal unitigpos type type value 'X' (CCO) at line 1676575956 >>> >>> Is this connected with the procedure I did with inserting the contig >>> from an older tigStore? >>> >>> Thank you for your help so far. >>> >>> Ole >>> >>> On 11 April 2012 08:13, Ole Kristian Tørresen <o.k...@bi...> >>> wrote: >>>> Hi Brian. >>>> >>>> I've done this, and rerunning ctgcns on that last partition. I'll send >>>> the layout and log in a separate email. >>>> >>>> Ole >>>> >>>> On 10 April 2012 21:37, Walenz, Brian <bw...@jc...> wrote: >>>>> Hi Ole- >>>>> >>>>> I don't see anything that looks like an error in the log, so I'll have to >>>>> assume it crashed. You report it runs for 20 hours, which is odd for >>>>> contig >>>>> consensus, unless that contig is very very deep. If so, the ctgcns >>>>> process >>>>> will also be large. Do you know how big the process was? >>>>> >>>>> Can you make the full log available? >>>>> >>>>> It is possible to force the contig to have a consensus sequence. If the >>>>> job >>>>> did crash, the other contigs will still need to have consensus generated. >>>>> >>>>> The process is the same as editing a unitig in the tigStore: dump the >>>>> contig >>>>> in question, edit the file to have a consensus sequence, then load that >>>>> contig back into the tigStore. A consensus sequence for this contig can >>>>> be >>>>> found in one of the earlier tigStore versions; the version just before >>>>> this >>>>> one will probably have it. That makes our process even easier: dump the >>>>> version with a consensus sequence, and load it back into the latest >>>>> version. >>>>> >>>>> A sketch of the steps: >>>>> >>>>> 1) Dump the previous version of the contig. check that 'file' does >>>>> contain >>>>> a consensus sequence. >>>>> >>>>> tigStore -g *gkpStore -t *tigStore <vers-1> -c <ctgID> -d layout > file >>>>> >>>>> 2) Load that pervious version into the tigStore as the latest version >>>>> >>>>> tigStore -g *gkpStore -t *tigStore <vers> <part> -c <ctgID> -R file >>>>> >>>>> Notice that this tigStore command specifies both a version and a partition >>>>> for the tigStore. >>>>> >>>>> 3) Rerun consensus.sh on that partition. It will not attempt to compute >>>>> the >>>>> consensus for that contig. >>>>> >>>>> I'd be interested in seeing the contig you dump, if only to verify that it >>>>> is deep. >>>>> >>>>> b >>>>> >>>>> >>>>> >>>>> On 4/10/12 4:05 AM, "Ole Kristian Tørresen" <o.k...@bi...> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> I'm having some problems while doing some low coverage sequencing >>>>>> assembly testing. I've tried to assemble about 10x coverage of 150 nt >>>>>> paired Illumina reads of 500 bp fragment size. These are from the >>>>>> parrot used in the Assemblathon 2 >>>>>> (http://assemblathon.org/pages/download-data). Everything seems to run >>>>>> fine, until contig consensus, where 1 partition just don't succeed. It >>>>>> seems to run for quite some time (20 hours or something) before >>>>>> failing. These are the last 20 lines from the output of the ctgcns >>>>>> partition that fails: >>>>>> Alignment params: 297 333 200 200 0 0.12 1e-06 30 1 >>>>>> -- e/l = 7/112 = 6.25% >>>>>> A -----+------+----> [] >>>>>> B 332 -------> 40 [] >>>>>> GetAlignmentTrace()-- Overlap ACCEPTED! accept=1000.000000 >>>>>> lScore=0.026087 (112 vs 115) aScore=0.160000 (332 vs 316) >>>>>> bScore=0.150000 (-42 vs -27). (CONTIGF) >>>>>> GetAlignmentTrace()-- Overlap found between 1076523 (U) and 25763657 >>>>>> (R) expected hangs: a=316 b=-27 erate=0.060000 aligner=Local_Overlap >>>>>> GetAlignmentTrace()-- Overlap ACCEPTED! accept=1000.000000 >>>>>> lScore=0.026087 (112 vs 115) aScore=0.160000 (332 vs 316) >>>>>> bScore=0.150000 (-42 vs -27). (CONTIGF) >>>>>> Local_Overlap_AS_forCNS found overlap between 1076523 (U) and 25763657 >>>>>> (R) ahang: 332, bhang: -42 (expected hang was 316) >>>>>> Alignment params: 298 334 200 200 0 0.12 1e-06 30 1 >>>>>> -- e/l = 6/112 = 5.36% >>>>>> A -----+------+----> [] >>>>>> B 332 -------> 42 [] >>>>>> GetAlignmentTrace()-- Overlap ACCEPTED! accept=1000.000000 >>>>>> lScore=0.009009 (110 vs 111) aScore=0.140000 (332 vs 318) >>>>>> bScore=0.130000 (-42 vs -29). (CONTIGF) >>>>>> GetAlignmentTrace()-- Overlap found between 1076523 (U) and 57537697 >>>>>> (R) expected hangs: a=318 b=-29 erate=0.060000 aligner=Local_Overlap >>>>>> GetAlignmentTrace()-- Overlap ACCEPTED! accept=1000.000000 >>>>>> lScore=0.009009 (110 vs 111) aScore=0.140000 (332 vs 318) >>>>>> bScore=0.130000 (-42 vs -29). (CONTIGF) >>>>>> Local_Overlap_AS_forCNS found overlap between 1076523 (U) and 57537697 >>>>>> (R) ahang: 332, bhang: -42 (expected hang was 318) >>>>>> Alignment params: 300 336 200 200 0 0.12 1e-06 30 1 >>>>>> -- e/l = 6/110 = 5.45% >>>>>> A -----+------+----> [] >>>>>> B 332 -------> 42 [] >>>>>> >>>>>> This is the error message: >>>>>> at /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 1237 >>>>>> main::caFailure('1 consensusAfterScaffolder jobs failed; remove >>>>>> 8-consensus/co...', undef) called at >>>>>> /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 5142 >>>>>> main::postScaffolderConsensus() called at >>>>>> /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 5885 >>>>>> >>>>>> ---------------------------------------- >>>>>> Failure message: >>>>>> >>>>>> 1 consensusAfterScaffolder jobs failed; remove >>>>>> 8-consensus/consensus.sh to try again >>>>>> >>>>>> I've tried removing consensus.sh and running again, but get the same >>>>>> error. >>>>>> >>>>>> This is the spec file: >>>>>> utgErrorRate=0.03 >>>>>> utgErrorLimit=2.5 >>>>>> ovlErrorRate=0.06 >>>>>> cnsErrorRate=0.06 >>>>>> cgwErrorRate=0.10 >>>>>> merSize = 22 >>>>>> overlapper=ovl >>>>>> unitigger = bogart >>>>>> merylMemory = 128000 >>>>>> merylThreads = 16 >>>>>> merOverlapperThreads = 2 >>>>>> merOverlapperExtendConcurrency = 8 >>>>>> merOverlapperSeedConcurrency = 8 >>>>>> ovlThreads = 2 >>>>>> mbtThreads = 2 >>>>>> mbtConcurrency = 8 >>>>>> ovlConcurrency = 8 >>>>>> ovlCorrConcurrency = 16 >>>>>> ovlRefBlockSize = 32000000 >>>>>> ovlHashBits = 24 >>>>>> ovlHashBlockLength = 800000000 >>>>>> ovlStoreMemory = 128000 >>>>>> frgCorrThreads = 2 >>>>>> frgCorrConcurrency = 8 >>>>>> ovlCorrBatchSize = 1000000 >>>>>> ovlCorrConcurrency = 16 >>>>>> cnsConcurrency = 16 >>>>>> doExtendClearRanges = 0 >>>>>> >>>>>> I don't need to have that unitig (1076523 (U)) in my finished >>>>>> assembly, so it's possible to just remove it as long as I get a >>>>>> finished assembly. I've also tried to just create the .success file, >>>>>> but then terminator fails. >>>>>> >>>>>> Does anyone have any ideas of what I might do different? Can I just >>>>>> remove that unitig and proceed? How do I do that? >>>>>> >>>>>> Sincerely, >>>>>> Ole Kristian Tørresen >>>>>> PhD student >>>>>> University of Oslo >>>>>> >>>>>> ------------------------------------------------------------------------- >>>>>> -- >>>>>> --- >>>>>> Better than sec? Nothing is better than sec when it comes to >>>>>> monitoring Big Data applications. Try Boundary one-second >>>>>> resolution app monitoring today. Free. >>>>>> http://p.sf.net/sfu/Boundary-dev2dev >>>>>> _______________________________________________ >>>>>> wgs-assembler-users mailing list >>>>>> wgs...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >>>>> >> |
From: Ole K. T. <o.k...@bi...> - 2012-04-12 07:52:39
|
Hi, Brian. Thank you for your help so far, but I seem to be missing something. I did this: tigStore -g *gkpStore -t *tigStore 14 -c 107652 -d layout3 > ctg1076523 tigStore -g *gkpStore -t *tigStore 15 -cp 36 -R ctg1076523 But when I dump the same contig from version 15: tigStore -g *gkpStore -t *tigStore 15 -c 107652 -d layout3 > ctg1076523_v15 it's without consensus sequence: contig 1076523 len 0 cns qlt data.unitig_coverage_stat -9874.792662 data.unitig_microhet_prob 0.000000 data.unitig_status X data.unitig_unique_rept X data.contig_status U data.num_frags 14302 data.num_unitigs 1 The if I dump it from version 16, it's identical to the one from version 14 (that is, with consensus). I've tried loading it several times, but each time I dump it again it's lost consensus. Do you know what I'm doing wrong? Ole On 11 April 2012 20:54, Walenz, Brian <bw...@jc...> wrote: > Hi, Ole- > > Yes, I overlooked a step. In the contig you insert to the latest version, > update the 'data.contig_status' with what the second to last version has. > > FYI, the tigStore should have versions such as: > > seqDB.v014.ctg > seqDB.v014.dat > seqDB.v014.utg > > seqDB.v015.ctg > seqDB.v015.p001.ctg > seqDB.v015.p001.dat > (etc) > seqDB.v015.utg > > seqDB.v016.ctg > seqDB.v016.p001.ctg > seqDB.v016.p001.dat > (etc) > seqDB.v016.utg > > (the v numbers will of course be different in your assembly) > > v015 contains the output of scaffolder, which is the input to consensus. > Contigs here have no consensus sequence, but otherwise all the data is > present. It is largely just rewriting the data from v014 into partitions > (p###), so each consensus job can load a single file instead of randomly > accessing a large file. The status flag on each unitig/contig is also set. > This flag tells if the unitig/contig was placed in a scaffold, is a > surrogate, degenerate, etc. > > v016 is the output of consensus, the input to terminator. All terminator > does is to repackage this into ASCII files. > > To summarize: grab the contig from v014 (the last with a consensus > sequence), the status flag from v015, change the status flag in the contig > you grabbed, and then insert the contig into v016. > > by doing this, you'll lose VAR records for this contig, but otherwise the > consensus sequence is the same (or largely the same; variant detection can > change it a bit). > > b > > > On 4/11/12 6:23 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: > >> Hi Brian, >> ctgcns completed now, but I got an error with asmOutputFasta. From >> 9-terminator/asmOutputFasta.err: >> ERROR: Illegal unitigpos type type value 'X' (CCO) at line 1676575956 >> >> Is this connected with the procedure I did with inserting the contig >> from an older tigStore? >> >> Thank you for your help so far. >> >> Ole >> >> On 11 April 2012 08:13, Ole Kristian Tørresen <o.k...@bi...> wrote: >>> Hi Brian. >>> >>> I've done this, and rerunning ctgcns on that last partition. I'll send >>> the layout and log in a separate email. >>> >>> Ole >>> >>> On 10 April 2012 21:37, Walenz, Brian <bw...@jc...> wrote: >>>> Hi Ole- >>>> >>>> I don't see anything that looks like an error in the log, so I'll have to >>>> assume it crashed. You report it runs for 20 hours, which is odd for contig >>>> consensus, unless that contig is very very deep. If so, the ctgcns process >>>> will also be large. Do you know how big the process was? >>>> >>>> Can you make the full log available? >>>> >>>> It is possible to force the contig to have a consensus sequence. If the job >>>> did crash, the other contigs will still need to have consensus generated. >>>> >>>> The process is the same as editing a unitig in the tigStore: dump the contig >>>> in question, edit the file to have a consensus sequence, then load that >>>> contig back into the tigStore. A consensus sequence for this contig can be >>>> found in one of the earlier tigStore versions; the version just before this >>>> one will probably have it. That makes our process even easier: dump the >>>> version with a consensus sequence, and load it back into the latest version. >>>> >>>> A sketch of the steps: >>>> >>>> 1) Dump the previous version of the contig. check that 'file' does contain >>>> a consensus sequence. >>>> >>>> tigStore -g *gkpStore -t *tigStore <vers-1> -c <ctgID> -d layout > file >>>> >>>> 2) Load that pervious version into the tigStore as the latest version >>>> >>>> tigStore -g *gkpStore -t *tigStore <vers> <part> -c <ctgID> -R file >>>> >>>> Notice that this tigStore command specifies both a version and a partition >>>> for the tigStore. >>>> >>>> 3) Rerun consensus.sh on that partition. It will not attempt to compute the >>>> consensus for that contig. >>>> >>>> I'd be interested in seeing the contig you dump, if only to verify that it >>>> is deep. >>>> >>>> b >>>> >>>> >>>> >>>> On 4/10/12 4:05 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: >>>> >>>>> Hi, >>>>> I'm having some problems while doing some low coverage sequencing >>>>> assembly testing. I've tried to assemble about 10x coverage of 150 nt >>>>> paired Illumina reads of 500 bp fragment size. These are from the >>>>> parrot used in the Assemblathon 2 >>>>> (http://assemblathon.org/pages/download-data). Everything seems to run >>>>> fine, until contig consensus, where 1 partition just don't succeed. It >>>>> seems to run for quite some time (20 hours or something) before >>>>> failing. These are the last 20 lines from the output of the ctgcns >>>>> partition that fails: >>>>> Alignment params: 297 333 200 200 0 0.12 1e-06 30 1 >>>>> -- e/l = 7/112 = 6.25% >>>>> A -----+------+----> [] >>>>> B 332 -------> 40 [] >>>>> GetAlignmentTrace()-- Overlap ACCEPTED! accept=1000.000000 >>>>> lScore=0.026087 (112 vs 115) aScore=0.160000 (332 vs 316) >>>>> bScore=0.150000 (-42 vs -27). (CONTIGF) >>>>> GetAlignmentTrace()-- Overlap found between 1076523 (U) and 25763657 >>>>> (R) expected hangs: a=316 b=-27 erate=0.060000 aligner=Local_Overlap >>>>> GetAlignmentTrace()-- Overlap ACCEPTED! accept=1000.000000 >>>>> lScore=0.026087 (112 vs 115) aScore=0.160000 (332 vs 316) >>>>> bScore=0.150000 (-42 vs -27). (CONTIGF) >>>>> Local_Overlap_AS_forCNS found overlap between 1076523 (U) and 25763657 >>>>> (R) ahang: 332, bhang: -42 (expected hang was 316) >>>>> Alignment params: 298 334 200 200 0 0.12 1e-06 30 1 >>>>> -- e/l = 6/112 = 5.36% >>>>> A -----+------+----> [] >>>>> B 332 -------> 42 [] >>>>> GetAlignmentTrace()-- Overlap ACCEPTED! accept=1000.000000 >>>>> lScore=0.009009 (110 vs 111) aScore=0.140000 (332 vs 318) >>>>> bScore=0.130000 (-42 vs -29). (CONTIGF) >>>>> GetAlignmentTrace()-- Overlap found between 1076523 (U) and 57537697 >>>>> (R) expected hangs: a=318 b=-29 erate=0.060000 aligner=Local_Overlap >>>>> GetAlignmentTrace()-- Overlap ACCEPTED! accept=1000.000000 >>>>> lScore=0.009009 (110 vs 111) aScore=0.140000 (332 vs 318) >>>>> bScore=0.130000 (-42 vs -29). (CONTIGF) >>>>> Local_Overlap_AS_forCNS found overlap between 1076523 (U) and 57537697 >>>>> (R) ahang: 332, bhang: -42 (expected hang was 318) >>>>> Alignment params: 300 336 200 200 0 0.12 1e-06 30 1 >>>>> -- e/l = 6/110 = 5.45% >>>>> A -----+------+----> [] >>>>> B 332 -------> 42 [] >>>>> >>>>> This is the error message: >>>>> at /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 1237 >>>>> main::caFailure('1 consensusAfterScaffolder jobs failed; remove >>>>> 8-consensus/co...', undef) called at >>>>> /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 5142 >>>>> main::postScaffolderConsensus() called at >>>>> /usit/titan/u1/olekto/src/wgs-7.0/Linux-amd64/bin/runCA line 5885 >>>>> >>>>> ---------------------------------------- >>>>> Failure message: >>>>> >>>>> 1 consensusAfterScaffolder jobs failed; remove >>>>> 8-consensus/consensus.sh to try again >>>>> >>>>> I've tried removing consensus.sh and running again, but get the same error. >>>>> >>>>> This is the spec file: >>>>> utgErrorRate=0.03 >>>>> utgErrorLimit=2.5 >>>>> ovlErrorRate=0.06 >>>>> cnsErrorRate=0.06 >>>>> cgwErrorRate=0.10 >>>>> merSize = 22 >>>>> overlapper=ovl >>>>> unitigger = bogart >>>>> merylMemory = 128000 >>>>> merylThreads = 16 >>>>> merOverlapperThreads = 2 >>>>> merOverlapperExtendConcurrency = 8 >>>>> merOverlapperSeedConcurrency = 8 >>>>> ovlThreads = 2 >>>>> mbtThreads = 2 >>>>> mbtConcurrency = 8 >>>>> ovlConcurrency = 8 >>>>> ovlCorrConcurrency = 16 >>>>> ovlRefBlockSize = 32000000 >>>>> ovlHashBits = 24 >>>>> ovlHashBlockLength = 800000000 >>>>> ovlStoreMemory = 128000 >>>>> frgCorrThreads = 2 >>>>> frgCorrConcurrency = 8 >>>>> ovlCorrBatchSize = 1000000 >>>>> ovlCorrConcurrency = 16 >>>>> cnsConcurrency = 16 >>>>> doExtendClearRanges = 0 >>>>> >>>>> I don't need to have that unitig (1076523 (U)) in my finished >>>>> assembly, so it's possible to just remove it as long as I get a >>>>> finished assembly. I've also tried to just create the .success file, >>>>> but then terminator fails. >>>>> >>>>> Does anyone have any ideas of what I might do different? Can I just >>>>> remove that unitig and proceed? How do I do that? >>>>> >>>>> Sincerely, >>>>> Ole Kristian Tørresen >>>>> PhD student >>>>> University of Oslo >>>>> >>>>> --------------------------------------------------------------------------- >>>>> --- >>>>> Better than sec? Nothing is better than sec when it comes to >>>>> monitoring Big Data applications. Try Boundary one-second >>>>> resolution app monitoring today. Free. >>>>> http://p.sf.net/sfu/Boundary-dev2dev >>>>> _______________________________________________ >>>>> wgs-assembler-users mailing list >>>>> wgs...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >>>> > |
From: Christoph H. <chr...@gm...> - 2012-04-11 21:02:56
|
Dear CA developers and users, I am trying to use Celeara assembler 7.0 to assemble a medium sized genome (about 100 Mb) using a combination of 454 and illumina reads. I choose a bad combination of the ovlHashBits, ovlHashBlockLength and ovlThreads options so that my last run stopped at the cluster I am using due to exceeding memory limit in the overlaptrim step. I think I know what the problem was, now, so my question is if it is possible to resume runCA from any given stage. In my particular case I would like to resume from the 0-overlaptrim-overlap stage with altered ovlHashBits, ovlHashBlockLength and ovlThreads options. I want to avaid doing the mercouts and initialtrim steps again, because they seem to have worked fine. I read in the manual about using the /do*/ option to get a kind of /startBefore/ effect. I cant seem to find any more details on this in the manual, so can you maybe help me out or point me to the required information on the webpage. Thanks! Your help is highly appreciated! much obliged, Christoph Hahn PhD student University of Oslo |