You can subscribe to this list here.
2012 |
Jan
(1) |
Feb
(2) |
Mar
|
Apr
(29) |
May
(8) |
Jun
(5) |
Jul
(46) |
Aug
(16) |
Sep
(5) |
Oct
(6) |
Nov
(17) |
Dec
(7) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 |
Jan
(5) |
Feb
(2) |
Mar
(10) |
Apr
(13) |
May
(20) |
Jun
(7) |
Jul
(6) |
Aug
(14) |
Sep
(9) |
Oct
(19) |
Nov
(17) |
Dec
(3) |
2014 |
Jan
(3) |
Feb
|
Mar
(7) |
Apr
(1) |
May
(1) |
Jun
(30) |
Jul
(10) |
Aug
(2) |
Sep
(18) |
Oct
(3) |
Nov
(4) |
Dec
(13) |
2015 |
Jan
(27) |
Feb
|
Mar
(19) |
Apr
(12) |
May
(10) |
Jun
(18) |
Jul
(4) |
Aug
(2) |
Sep
(2) |
Oct
|
Nov
(1) |
Dec
(9) |
2016 |
Jan
(6) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Langhorst, B. <Lan...@ne...> - 2015-03-20 20:15:09
|
Hi: I ran into a problem creating a gkpStore from 3 frg files (pointing to PE fastq files, each about 500M reads) It failed due to a memory allocation error when i tried to import all 3 at once, so I thought I’d try to append to the store like this: $gatekeeper -o $store -T -F $frg_path/run15.frg $gatekeeper -a -o $store -T -F $frg_path/run16.frg $gatekeeper -a -o $store -T -F $frg_path/run17.frg The first one succeeds, but the append fails immediately. Seems like the store is somehow marked read-only. I didn’t expect that since the first command succeeded. Should appending to a store work? Should I try an older gatekeeper? Will that cause trouble later if i try to use 8.3 for following steps? Here’s the log: Starting file '/mnt/galaxy/data/langhorst/deer_unitigs/run15.frg'. Processing INNIE SANGER QV encoding reads from: '/mnt/ngswork/langhorst/deer_assembly/ovi_run15.1.fastq' and '/mnt/ngswork/langhorst/deer_assembly/ovi_run15.2.fastq' GKP finished with 1 alerts or errors: 1 # LIB Alert: stddev too big for mean; reset stddev to 0.1 * mean. Starting file '/mnt/galaxy/data/langhorst/deer_unitigs/run16.frg'. gatekeeper: AS_PER_genericStore.C:425: int64 appendStringStore(StoreStruct*, char*, uint32): Assertion `s->readOnly == false' failed. … [0] /home/NEB/langhorst/wgs-8.3rc1/Linux-amd64/bin/gatekeeper::AS_UTL_catchCrash(int, siginfo*, void*) + 0x2a [0x42587a] [1] /lib/x86_64-linux-gnu/libpthread.so.0::(null) + 0x10340 [0x7ffb82749340] [2] /lib/x86_64-linux-gnu/libc.so.6::(null) + 0x39 [0x7ffb823aacc9] [3] /lib/x86_64-linux-gnu/libc.so.6::(null) + 0x148 [0x7ffb823ae0d8] [4] /lib/x86_64-linux-gnu/libc.so.6::(null) + 0x2fb86 [0x7ffb823a3b86] [5] /lib/x86_64-linux-gnu/libc.so.6::(null) + 0x2fc32 [0x7ffb823a3c32] [6] /home/NEB/langhorst/wgs-8.3rc1/Linux-amd64/bin/gatekeeper() [0x43129d] [7] /home/NEB/langhorst/wgs-8.3rc1/Linux-amd64/bin/gatekeeper::gkStore::gkStore_addUID(char*) + 0x13f [0x436d6f] [8] /home/NEB/langhorst/wgs-8.3rc1/Linux-amd64/bin/gatekeeper::AS_UID_load(char*) + 0x196 [0x4254b6] [9] /home/NEB/langhorst/wgs-8.3rc1/Linux-amd64/bin/gatekeeper::GetUID(char*, _IO_FILE*) + 0x11 [0x4264d1] [10] /home/NEB/langhorst/wgs-8.3rc1/Linux-amd64/bin/gatekeeper() [0x42f902] [11] /home/NEB/langhorst/wgs-8.3rc1/Linux-amd64/bin/gatekeeper::ReadProtoMesg_AS(_IO_FILE*, GenericMesg**) + 0x4aa [0x42719a] [12] /home/NEB/langhorst/wgs-8.3rc1/Linux-amd64/bin/gatekeeper::(null) + 0x5c1 [0x4087b1] [13] /lib/x86_64-linux-gnu/libc.so.6::(null) + 0xf5 [0x7ffb82395ec5] [14] /home/NEB/langhorst/wgs-8.3rc1/Linux-amd64/bin/gatekeeper::(null) + 0xf1 [0x406949] |
From: Serge K. <ser...@gm...> - 2015-03-20 19:47:37
|
Hi, I’d suggest starting with the PBcR wiki page which has examples and spec files: http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR <http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR> You can test those datasets to make sure the installation is working on your system. From your command it looks like you are correcting PacBio reads with Illumina reads. As documented on the PBcR wiki page this mode is no longer being updated and is significantly slower than using only PacBio data (which requires at least 30X coverage but 50X+ is best). If you have enough coverage, I’d recommend that approach instead. You could also try alternate tools to correct the PacBio data with Illumina (like ECTools, provread, LorDEC, etc). If you’d still like to use PBcR for Illumina-based correction, the wiki page documents setting parameters when you have SMRTportal installed/in your path: http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Correcting_Large_.28.3E_100Mbp.29_Genomes_.28Using_high-identity_data_or_CA_8.1.29 <http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Correcting_Large_.28.3E_100Mbp.29_Genomes_.28Using_high-identity_data_or_CA_8.1.29> or not: http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Correcting_Large_.28.3E_100Mbp.29_Genomes_With_CA_7.0_or_older_.28not_recommended.29 <http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Correcting_Large_.28.3E_100Mbp.29_Genomes_With_CA_7.0_or_older_.28not_recommended.29> That should help you configure your run. However, as I said, the correction with Illumina data will be significantly slower than self-correction of PacBio data and you should expect it to take a few hours on bacterial genomes. I’d also recommend not using more than 50X Illumina data as well. Sergey > On Mar 20, 2015, at 10:57 AM, Seth Munholland <mu...@uw...> wrote: > > Hello Everyone, > > I'm new to CA and I'm trying to use 8.3 to correct some PacBio reads. Installation went smoothly, but when it comes time to run I first hit the problem of a missing spec file. After some googling I found an example posted on the seqanswers forums (http://seqanswers.com/forums/archive/index.php/t-18478.html <http://seqanswers.com/forums/archive/index.php/t-18478.html>), however it's for Celera 7. I went through and compared the spec options to the options parameters that print at the start of the PBcR run and removed everything except the memory related entries. I wanted to see what the default values gave me before I tried tweaking things, but I have more memory available to me and I presumed the command line -threads option did the same thing as altering the spec values concerning threads. > > My spec file consisted of the following: > assemble = 0 > ovlMemory = 250 > merylMemory = 256000 > ovlStoreMemory = 256000 > > The command I ran it with was: > PBcR -threads 30 -libraryname PI440795_A08 -s PI440795.spec -fastq PI440795_A08.fastq Pacu1.frg Pacu2.frg > > Once I try to run it, however, I realize I've done something wrong. The PBcR run has been on OverlapInCore for hours at this point and is using ~5GB of RAM. > > The second problem I faced came when I tried using a smaller dataset to see if it was a size based issue and it moved through that stage within a day, and moved beyond the correction, but then it stalled on runPartition.sh, using ~10GB of RAM and taking ~1.5 hours per partition, while showing essentially no CPU usage. > > I've since come across the RunCA wiki page which outlines many of the spec options, and found that many the options I started with from the example spec file don't even exist anymore. Would anyone be able and willing to lend me a hand so I can properly configure my Celera pipeline to correct my PacBio reads please? > > Seth Munholland, B.Sc. > Department of Biological Sciences > Rm. 304 Biology Building > University of Windsor > 401 Sunset Ave. N9B 3P4 > T: (519) 253-3000 Ext: 4755 <>------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for all > things parallel software development, from weekly thought leadership blogs to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. http://goparallel.sourceforge.net/_______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: Seth M. <mu...@uw...> - 2015-03-20 15:20:33
|
Hello Everyone, I'm new to CA and I'm trying to use 8.3 to correct some PacBio reads. Installation went smoothly, but when it comes time to run I first hit the problem of a missing spec file. After some googling I found an example posted on the seqanswers forums ( http://seqanswers.com/forums/archive/index.php/t-18478.html), however it's for Celera 7. I went through and compared the spec options to the options parameters that print at the start of the PBcR run and removed everything except the memory related entries. I wanted to see what the default values gave me before I tried tweaking things, but I have more memory available to me and I presumed the command line -threads option did the same thing as altering the spec values concerning threads. My spec file consisted of the following: assemble = 0 ovlMemory = 250 merylMemory = 256000 ovlStoreMemory = 256000 The command I ran it with was: PBcR -threads 30 -libraryname PI440795_A08 -s PI440795.spec -fastq PI440795_A08.fastq Pacu1.frg Pacu2.frg Once I try to run it, however, I realize I've done something wrong. The PBcR run has been on OverlapInCore for hours at this point and is using ~5GB of RAM. The second problem I faced came when I tried using a smaller dataset to see if it was a size based issue and it moved through that stage within a day, and moved beyond the correction, but then it stalled on runPartition.sh, using ~10GB of RAM and taking ~1.5 hours per partition, while showing essentially no CPU usage. I've since come across the RunCA wiki page which outlines many of the spec options, and found that many the options I started with from the example spec file don't even exist anymore. Would anyone be able and willing to lend me a hand so I can properly configure my Celera pipeline to correct my PacBio reads please? Seth Munholland, B.Sc. Department of Biological Sciences Rm. 304 Biology Building University of Windsor 401 Sunset Ave. N9B 3P4 T: (519) 253-3000 Ext: 4755 |
From: Takashi K. <tak...@gm...> - 2015-03-14 10:06:06
|
I will test both. Thanks guys!! 2015-03-14 12:07 GMT+09:00 Liu, Xinyue <xy...@so...>: > To run Consed while bypassing phd files you can run "consed -nophd". But > as Brian pointed out it might crash with big assemblies. Another assembly > viewer I used to run on CA assemblies is Hawkeye ( > http://amos.sourceforge.net/wiki/index.php?title=Hawkeye). > > Best, > Jerry > > ------------------------------ > *From:* Brian Walenz [th...@gm...] > *Sent:* Friday, March 13, 2015 10:13 PM > *To:* Takashi Koyama > *Cc:* wgs...@li... > *Subject:* Re: [wgs-assembler-users] Question about contaminant trimming > and referring quality value > > I'll be no help on visualization. Most of the things I've assembled > were too big for that. > > For trimming, the assembler will do a respectable job without cleaning up > vector. Reads with both vector and genomic sequence will look like > chimera, and will have the smaller portion removed. Reads of entire vector > will assemble together, and will need to be screened from the output. Mate > pairs across the junction (one read vector, one read genomic) are a > potential problem, and could confuse the scaffold graph enough to prevent > assembly. > > If you have high coverage Illumina, mapping to the ecoli/vector and > discarding the entire pair for any hit is the simplest and safest. Bowtie2 > will do this, but I forget the option. > > It's been a long time since I've had to clean up Sanger reads, and > hopefully I can continue forgetting how to do it. > > In all cases, your best bet is to hard trim -- remove the vector/ecoli > sequence from the reads -- before giving the reads to the assembler. Don't > pass in a clear range to the assembler, as it will probably ignore it. > > b > > > On Fri, Mar 13, 2015 at 9:18 AM, Takashi Koyama <tak...@gm...> > wrote: > >> Hello, I have two questions. >> >> First question. I now trying to assemble BAC clones. As BAC samples >> include BAC vector and some of E. coli genome, assemblers I've ever used >> refer vector and E. coli fasta sequences to trim them. However, I could not >> find an instruction to do that in celera assembly. >> Is it possible to refer some fasta files for trimming in CA? Or if >> impossible, could anyone tell me how I trim them in CA? >> >> Second question. I would like to see how good CA works by assembly >> viewer. I usually use Consed. I could get an ace file using ca2ace.pl >> and open it in Consed. However, Consed told me there is no quality files >> such as phd file. Could anyone tell me any solutions making Consed work >> smooth? >> >> Thank you for your kind helps. >> >> TK >> >> >> ------------------------------------------------------------------------------ >> Dive into the World of Parallel Programming The Go Parallel Website, >> sponsored >> by Intel and developed in partnership with Slashdot Media, is your hub >> for all >> things parallel software development, from weekly thought leadership >> blogs to >> news, videos, case studies, tutorials and more. Take a look and join the >> conversation now. http://goparallel.sourceforge.net/ >> _______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> >> > |
From: Liu, X. <xy...@so...> - 2015-03-14 03:41:18
|
To run Consed while bypassing phd files you can run "consed -nophd". But as Brian pointed out it might crash with big assemblies. Another assembly viewer I used to run on CA assemblies is Hawkeye (http://amos.sourceforge.net/wiki/index.php?title=Hawkeye). Best, Jerry ________________________________ From: Brian Walenz [th...@gm...] Sent: Friday, March 13, 2015 10:13 PM To: Takashi Koyama Cc: wgs...@li... Subject: Re: [wgs-assembler-users] Question about contaminant trimming and referring quality value I'll be no help on visualization. Most of the things I've assembled were too big for that. For trimming, the assembler will do a respectable job without cleaning up vector. Reads with both vector and genomic sequence will look like chimera, and will have the smaller portion removed. Reads of entire vector will assemble together, and will need to be screened from the output. Mate pairs across the junction (one read vector, one read genomic) are a potential problem, and could confuse the scaffold graph enough to prevent assembly. If you have high coverage Illumina, mapping to the ecoli/vector and discarding the entire pair for any hit is the simplest and safest. Bowtie2 will do this, but I forget the option. It's been a long time since I've had to clean up Sanger reads, and hopefully I can continue forgetting how to do it. In all cases, your best bet is to hard trim -- remove the vector/ecoli sequence from the reads -- before giving the reads to the assembler. Don't pass in a clear range to the assembler, as it will probably ignore it. b On Fri, Mar 13, 2015 at 9:18 AM, Takashi Koyama <tak...@gm...<mailto:tak...@gm...>> wrote: Hello, I have two questions. First question. I now trying to assemble BAC clones. As BAC samples include BAC vector and some of E. coli genome, assemblers I've ever used refer vector and E. coli fasta sequences to trim them. However, I could not find an instruction to do that in celera assembly. Is it possible to refer some fasta files for trimming in CA? Or if impossible, could anyone tell me how I trim them in CA? Second question. I would like to see how good CA works by assembly viewer. I usually use Consed. I could get an ace file using ca2ace.pl<http://ca2ace.pl> and open it in Consed. However, Consed told me there is no quality files such as phd file. Could anyone tell me any solutions making Consed work smooth? Thank you for your kind helps. TK ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ wgs-assembler-users mailing list wgs...@li...<mailto:wgs...@li...> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: Brian W. <th...@gm...> - 2015-03-14 02:14:04
|
I'll be no help on visualization. Most of the things I've assembled were too big for that. For trimming, the assembler will do a respectable job without cleaning up vector. Reads with both vector and genomic sequence will look like chimera, and will have the smaller portion removed. Reads of entire vector will assemble together, and will need to be screened from the output. Mate pairs across the junction (one read vector, one read genomic) are a potential problem, and could confuse the scaffold graph enough to prevent assembly. If you have high coverage Illumina, mapping to the ecoli/vector and discarding the entire pair for any hit is the simplest and safest. Bowtie2 will do this, but I forget the option. It's been a long time since I've had to clean up Sanger reads, and hopefully I can continue forgetting how to do it. In all cases, your best bet is to hard trim -- remove the vector/ecoli sequence from the reads -- before giving the reads to the assembler. Don't pass in a clear range to the assembler, as it will probably ignore it. b On Fri, Mar 13, 2015 at 9:18 AM, Takashi Koyama <tak...@gm...> wrote: > Hello, I have two questions. > > First question. I now trying to assemble BAC clones. As BAC samples > include BAC vector and some of E. coli genome, assemblers I've ever used > refer vector and E. coli fasta sequences to trim them. However, I could not > find an instruction to do that in celera assembly. > Is it possible to refer some fasta files for trimming in CA? Or if > impossible, could anyone tell me how I trim them in CA? > > Second question. I would like to see how good CA works by assembly viewer. > I usually use Consed. I could get an ace file using ca2ace.pl and open it > in Consed. However, Consed told me there is no quality files such as phd > file. Could anyone tell me any solutions making Consed work smooth? > > Thank you for your kind helps. > > TK > > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, > sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for > all > things parallel software development, from weekly thought leadership blogs > to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > > |
From: Takashi K. <tak...@gm...> - 2015-03-13 13:18:20
|
Hello, I have two questions. First question. I now trying to assemble BAC clones. As BAC samples include BAC vector and some of E. coli genome, assemblers I've ever used refer vector and E. coli fasta sequences to trim them. However, I could not find an instruction to do that in celera assembly. Is it possible to refer some fasta files for trimming in CA? Or if impossible, could anyone tell me how I trim them in CA? Second question. I would like to see how good CA works by assembly viewer. I usually use Consed. I could get an ace file using ca2ace.pl and open it in Consed. However, Consed told me there is no quality files such as phd file. Could anyone tell me any solutions making Consed work smooth? Thank you for your kind helps. TK |
From: Brian W. <th...@gm...> - 2015-03-13 04:13:41
|
Unfortunately, there is a fair bit of obsolete code in the assembler, and it appears you're trying to use some of it. 1) Don't use option vectorTrimmer=figaro. 2) Don't use option closureOverlaps. I'll stick these on my list for the next time I need to kill some time. Thanks for testing them. ;-) b On Thu, Mar 12, 2015 at 10:55 PM, Takashi Koyama <tak...@gm...> wrote: > Hello. > I recently started to work with wgs-8.3rc1 and have two problems in runCA. > It would be appreciated if anyone give me solutions. > > In first problem, I got an error when runCA run gatekeeper but error has > gone away if I retry runCA without any modification. > The error message is below: > ----------------------------------------START Fri Mar 13 11:39:14 2015 > /opt/bio/wgs-8.3rc1/Linux-amd64/bin/gatekeeper -dumpfastaseq -clear UNTRIM > /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/163a23_assembly1.gkpStore > 2> > /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/0-preoverlap/gatekeeper.err > > > /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/163a23_assembly1.fasta > sh: > /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/0-preoverlap/gatekeeper.err: > そのようなファイルやディレクトリはありません > ----------------------------------------END Fri Mar 13 11:39:14 2015 (0 > seconds) > ERROR: Failed with signal HUP (1) > > ================================================================================ > > runCA failed. > > ---------------------------------------- > Stack trace: > > at /usr/local/genome/bin/runCA line 1649. > main::caFailure("failed to dump gatekeeper store for figaro trimmer", > "/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"...) called > at /usr/local/genome/bin/runCA line 2971 > main::generateVectorTrim() called at /usr/local/genome/bin/runCA line > 1991 > > main::preoverlap("/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"..., > "/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"..., > "/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"..., > "/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"..., > "/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"..., > "/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"...) called > at /usr/local/genome/bin/runCA line 6551 > > ---------------------------------------- > Failure message: > > failed to dump gatekeeper store for figaro trimmer > > > > > In second problem, I got an error when runCA run overlapStoreBuild. The > error message is below: > ----------------------------------------START Fri Mar 13 11:47:29 2015 > /opt/bio/wgs-8.3rc1/Linux-amd64/bin/overlapStoreBuild -o > /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/163a23_assembly1.ovlStore.BUILDING > -g > /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/163a23_assembly1.gkpStore > -i 0 -M 24000 -L > /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/163a23_assembly1.ovlStore.list > > > /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/163a23_assembly1.ovlStore.err > 2>&1 > ----------------------------------------END Fri Mar 13 11:47:29 2015 (0 > seconds) > ERROR: Failed with signal HUP (1) > > ================================================================================ > > runCA failed. > > ---------------------------------------- > Stack trace: > > at /usr/local/genome/bin/runCA line 1649, <J> line 2. > main::caFailure("failed to create the overlap store", > "/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"...) called > at /usr/local/genome/bin/runCA line 3993 > main::createOverlapStore() called at /usr/local/genome/bin/runCA line > 6556 > > ---------------------------------------- > Last few lines of the relevant log file > (/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/163a23_assembly1.ovlStore.err): > > /opt/bio/wgs-8.3rc1/Linux-amd64/bin/overlapStoreBuild: unknown option '-i'. > usage: /opt/bio/wgs-8.3rc1/Linux-amd64/bin/overlapStoreBuild -o > asm.ovlStore -g asm.gkpStore [opts] [-L fileList | *.ovb.gz] > -o asm.ovlStore path to store to create > -g asm.gkpStore path to gkpStore for this assembly > > -F f use up to 'f' files for store creation > -M m use up to 'm' MB memory for store creation > > -plc t type of filtering for PLC fragments -- NOT > SUPPORTED > -obt filter overlaps for OBT > -dup filter overlaps for OBT/dedupe > > -e e filter overlaps above e fraction error > -L fileList read input filenames from 'flieList' > > -big iid handle a large number of overlaps in the last > library > iid is the first read iid in the last library, from > 'gatekeeper -dumpinfo *gkpStore' > > ---------------------------------------- > Failure message: > > failed to create the overlap store > > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, > sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for > all > things parallel software development, from weekly thought leadership blogs > to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > > |
From: Takashi K. <tak...@gm...> - 2015-03-13 02:55:12
|
Hello. I recently started to work with wgs-8.3rc1 and have two problems in runCA. It would be appreciated if anyone give me solutions. In first problem, I got an error when runCA run gatekeeper but error has gone away if I retry runCA without any modification. The error message is below: ----------------------------------------START Fri Mar 13 11:39:14 2015 /opt/bio/wgs-8.3rc1/Linux-amd64/bin/gatekeeper -dumpfastaseq -clear UNTRIM /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/163a23_assembly1.gkpStore 2> /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/0-preoverlap/gatekeeper.err > /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/163a23_assembly1.fasta sh: /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/0-preoverlap/gatekeeper.err: そのようなファイルやディレクトリはありません ----------------------------------------END Fri Mar 13 11:39:14 2015 (0 seconds) ERROR: Failed with signal HUP (1) ================================================================================ runCA failed. ---------------------------------------- Stack trace: at /usr/local/genome/bin/runCA line 1649. main::caFailure("failed to dump gatekeeper store for figaro trimmer", "/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"...) called at /usr/local/genome/bin/runCA line 2971 main::generateVectorTrim() called at /usr/local/genome/bin/runCA line 1991 main::preoverlap("/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"..., "/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"..., "/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"..., "/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"..., "/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"..., "/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"...) called at /usr/local/genome/bin/runCA line 6551 ---------------------------------------- Failure message: failed to dump gatekeeper store for figaro trimmer In second problem, I got an error when runCA run overlapStoreBuild. The error message is below: ----------------------------------------START Fri Mar 13 11:47:29 2015 /opt/bio/wgs-8.3rc1/Linux-amd64/bin/overlapStoreBuild -o /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/163a23_assembly1.ovlStore.BUILDING -g /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/163a23_assembly1.gkpStore -i 0 -M 24000 -L /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/163a23_assembly1.ovlStore.list > /home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/163a23_assembly1.ovlStore.err 2>&1 ----------------------------------------END Fri Mar 13 11:47:29 2015 (0 seconds) ERROR: Failed with signal HUP (1) ================================================================================ runCA failed. ---------------------------------------- Stack trace: at /usr/local/genome/bin/runCA line 1649, <J> line 2. main::caFailure("failed to create the overlap store", "/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraA"...) called at /usr/local/genome/bin/runCA line 3993 main::createOverlapStore() called at /usr/local/genome/bin/runCA line 6556 ---------------------------------------- Last few lines of the relevant log file (/home/tkoyama/Documents/SquSex/seqs/SquSD_BACassembly/CeleraAssembler/163a23/assembly1/163a23_assembly1.ovlStore.err): /opt/bio/wgs-8.3rc1/Linux-amd64/bin/overlapStoreBuild: unknown option '-i'. usage: /opt/bio/wgs-8.3rc1/Linux-amd64/bin/overlapStoreBuild -o asm.ovlStore -g asm.gkpStore [opts] [-L fileList | *.ovb.gz] -o asm.ovlStore path to store to create -g asm.gkpStore path to gkpStore for this assembly -F f use up to 'f' files for store creation -M m use up to 'm' MB memory for store creation -plc t type of filtering for PLC fragments -- NOT SUPPORTED -obt filter overlaps for OBT -dup filter overlaps for OBT/dedupe -e e filter overlaps above e fraction error -L fileList read input filenames from 'flieList' -big iid handle a large number of overlaps in the last library iid is the first read iid in the last library, from 'gatekeeper -dumpinfo *gkpStore' ---------------------------------------- Failure message: failed to create the overlap store |
From: Brian W. <th...@gm...> - 2015-03-06 22:37:27
|
Restatement: you want to assemble three BACs where the ends of them share something artificial that shouldn't be assembled across BACs. Correct? Add the kmers in the vector that shouldn't be assembled to the nmers.fasta in 0-mercounts. The overlapper will not seed overlaps with these kmers, but will extend overlaps into them. For two reads [vector][seq] and [vector][seq], the overlap will be seeded from a kmer in [seq], and the overlap will cover both reads entirely. Reads [vector] and [vector][seq] will share only kmers in nmers.fasta which will be ignored. To get the kmers, build a fasta of all the vector sequences from all the reads, and run meryl as the assembler does (IMPORTANT: with the -C flag). Append these to the nmers.fasta, or use only these kmers (with option ovlFrequentMers) and seed off of all overlaps in the BAC sequences (if your pool is small). An alternative -- but a pita to do -- would be to filter the overlaps to remove any vector-vector overlaps you don't want to assemble together. To do this, the ovlStore need to be dumped, then filtered, then rebuilt. We can't edit an overlap store to mark overlaps as 'don't use'. The filtering can probably be done based only on read id, so easy to do from the dumps. b On Fri, Mar 6, 2015 at 4:49 PM, mathog <ma...@ca...> wrote: > We have some data that consists of reads (Sanger) from pooled BACs. > Let's say for the sake of illustration that there are 3 BACs in each > pool and let's look at the 5' end of the insert. There will be 3 > classes of reads that look like: > > [vector][seq1] > [vector][seq2] > [vector][seq3] > > where vector is the BAC vector, not the sequencing vector, and where of > course the amount of sequence one each side of the junction will vary > from read to read. > > It is important to keep track of these end sequences. Is that possible > with this assembler? > > One option is to note in a file somewhere that these reads are ends, and > cut off the vector ahead of time. A problem with that is that there > isn't a huge amount of data in hand and some of the remaining pieces > will be small, so they will be dropped from the assembly. That is, it > may cause an "edge effect" which would most likely cause many bases to > be lost from each end, even if the rest of the assembly works. One > would also need to tell the assembler somehow that these are ends, so it > doesn't mistakenly assemble things on the other side if the sequence at > the junction happens to be repetitive. (Is there a way to mark an input > sequence like that?) Finally, one would need to be able to map whatever > name the assembler uses internally for the reads back to the ones in the > saved file. > > The other option is to leave the vector in, but that will result in a > forked structure when the vector sequences line up during overlap, and > the assembler will cut off the vector at the base of the fork anyway. > Which goes right back to the first case. Unless there is some way to > give the assembler the BAC vector and then have it "do the right thing" > by not cutting the forked structure at the junction, but instead > splitting it into the 3 classes. Is there a way to tell wgs to do that? > > Thanks, > > David Mathog > ma...@ca... > Manager, Sequence Analysis Facility, Biology Division, Caltech > > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, > sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for > all > things parallel software development, from weekly thought leadership blogs > to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > |
From: mathog <ma...@ca...> - 2015-03-06 21:49:43
|
We have some data that consists of reads (Sanger) from pooled BACs. Let's say for the sake of illustration that there are 3 BACs in each pool and let's look at the 5' end of the insert. There will be 3 classes of reads that look like: [vector][seq1] [vector][seq2] [vector][seq3] where vector is the BAC vector, not the sequencing vector, and where of course the amount of sequence one each side of the junction will vary from read to read. It is important to keep track of these end sequences. Is that possible with this assembler? One option is to note in a file somewhere that these reads are ends, and cut off the vector ahead of time. A problem with that is that there isn't a huge amount of data in hand and some of the remaining pieces will be small, so they will be dropped from the assembly. That is, it may cause an "edge effect" which would most likely cause many bases to be lost from each end, even if the rest of the assembly works. One would also need to tell the assembler somehow that these are ends, so it doesn't mistakenly assemble things on the other side if the sequence at the junction happens to be repetitive. (Is there a way to mark an input sequence like that?) Finally, one would need to be able to map whatever name the assembler uses internally for the reads back to the ones in the saved file. The other option is to leave the vector in, but that will result in a forked structure when the vector sequences line up during overlap, and the assembler will cut off the vector at the base of the fork anyway. Which goes right back to the first case. Unless there is some way to give the assembler the BAC vector and then have it "do the right thing" by not cutting the forked structure at the junction, but instead splitting it into the 3 classes. Is there a way to tell wgs to do that? Thanks, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |
From: Waldbieser, G. <Geo...@AR...> - 2015-01-27 02:12:45
|
MHAP in wgs-8.2 can utilize PacBio fastq files (extracted from .hd5 files) to self-correct and assemble. Is there a way to feed the pairwise alignments into quiver to polish the assembly? Is there a better method for polishing? Geoff ________________________________ Geoffrey C. Waldbieser Research Molecular Biologist USDA, ARS, Warmwater Aquaculture Research Unit 141 Experiment Station Road Stoneville, MS 38776 This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. |
From: Walenz, B. <wa...@nb...> - 2015-01-23 22:04:26
|
You can generate the fasta and posmap outputs with: asmOutputFasta -p a6 < a6.asm buildPosMap -o a6 -g a6.gkpStore < a6.asm I'm worried that the asm file is incomplete. It's possible that an incomplete asm has no data that these two programs would report, resulting in no output files. If the asm file is incomplete, move 9-terminator someplace safe (like 9-terminator-old) and restart runCA. It will recreate outputs. Check disk space! The caqc script died because it relies on perl module Statistics/Descriptive. Caqc computes the stats that are in the *.qc file. b ________________________________ From: Miguel Grau [mi...@uj...] Sent: Friday, January 23, 2015 3:03 AM To: wgs...@li... Subject: [wgs-assembler-users] Missing output fasta files and information from qc file, . Dear all, I have finished my assembly using wgs 8.2 assembler but in my ouput folder, I only have the .asm and the .qc file, without fasta files. I didn't have any error during all the process apart of this one in the last step: .... ----------------------------------------START Fri Jan 23 10:25:04 2015 /usr/bin/env perl /miquel/wgs-8.2/Linux-amd64/bin/caqc.pl -euid /reads/a6/9-terminator/a6.asm Can't locate Statistics/Descriptive.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /miquel/wgs-8.2/Linux-amd64/bin/caqc.pl line 18. BEGIN failed--compilation aborted at /miquel/wgs-8.2/Linux-amd64/bin/caqc.pl line 18. ----------------------------------------END Fri Jan 23 10:25:04 2015 (0 seconds) ERROR: Failed with signal INT (2) The Cleaner has arrived. Doing 'none'. And the first lines from my qc file: [Unitig Consensus] NumColumnsInUnitigs=5385950647 NumGapsInUnitigs=378938 NumRunsOfGapsInUnitigReads=13067488 [Contig Consensus] NumColumnsInUnitigs=312297292 NumGapsInUnitigs=398693 NumRunsOfGapsInUnitigReads=10707275 NumColumnsInContigs=312248428 NumGapsInContigs=347998 NumRunsOfGapsInContigReads=9673938 NumAAMismatches=171477362 NumVARStringsWithFlankingGaps=80202 [Read Depth Histogram] d < 3Kbp < 10Kbp < 1Mbp < inf 0 162862 601904 619465 0 1 136303 189622 163968 0 ... ... Thank you, Miquel |
From: Miguel G. <mi...@uj...> - 2015-01-23 08:04:08
|
Dear all, I have finished my assembly using wgs 8.2 assembler but in my ouput folder, I only have the .asm and the .qc file, without fasta files. I didn't have any error during all the process apart of this one in the last step: .... /----------------------------------------START Fri Jan 23 10:25:04 2015// ///usr/bin/env perl /miquel/wgs-8.2/Linux-amd64/bin/caqc.pl -euid /reads/a6/9-terminator/a6.asm// //Can't locate Statistics/Descriptive.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /miquel/wgs-8.2/Linux-amd64/bin/caqc.pl line 18.// //BEGIN failed--compilation aborted at /miquel/wgs-8.2/Linux-amd64/bin/caqc.pl line 18.// //----------------------------------------END Fri Jan 23 10:25:04 2015 (0 seconds)// //ERROR: Failed with signal INT (2)// //The Cleaner has arrived. Doing 'none'./ And the first lines from my qc file: /[Unitig Consensus]// //NumColumnsInUnitigs=5385950647// //NumGapsInUnitigs=378938// //NumRunsOfGapsInUnitigReads=13067488// // //[Contig Consensus]// //NumColumnsInUnitigs=312297292// //NumGapsInUnitigs=398693// //NumRunsOfGapsInUnitigReads=10707275// //NumColumnsInContigs=312248428// //NumGapsInContigs=347998// //NumRunsOfGapsInContigReads=9673938// //NumAAMismatches=171477362// //NumVARStringsWithFlankingGaps=80202// // //[Read Depth Histogram]// //d < 3Kbp < 10Kbp < 1Mbp < inf// //0 162862 601904 619465 0 // //1 136303 189622 163968 0/ ... ... Thank you, Miquel |
From: Walenz, B. <wa...@nb...> - 2015-01-22 22:58:00
|
Unfortunately there isn’t anything as nice as ‘run X jobs for stage Y’. For overlaps, the hash parameters generally set how much memory each job will use, and the ref parameters control how long each job will run – and indirectly, how many jobs are generated. There is, of course, some dependence on the number of jobs on the hash parameters. I tend to find a set of those that work well on my data type (Illumina, 454, etc) and my hardware, then leave them alone and use the ref size to tune the number of jobs. A trick I use here is to “useGrid=1 scriptOnGrid=0” to get up to the overlap stage. This will set up the overlap compute but not launch it. You can then see how many jobs, and maybe run a few to see how much memory they use. To reconfigure, remove the overlap.sh script and rerun runCA. All it will do is recompute the job partitioning, make a new script, and tell you to run it. For consensus, cnsMinFrags and cnsPartitions control the number of jobs. it will try to make cnsPartitions jobs, unless there are fewer than cnsMinFrags in a job, in which case, there will be fewer jobs. Going above, iirc, cnsPartitions about 200 can cause the ‘too many open files’ error. Unlike overlaps, the partitioning here is set at the output of unitigger, and to change, unitigs must be recomputed. b From: Miguel Grau [mailto:mi...@uj...] Sent: Wednesday, January 21, 2015 8:37 PM To: Brian Walenz Cc: wgs...@li... Subject: Re: [wgs-assembler-users] Configuration SGE Hi Brian, There is some way to select the number of trim jobs running on sge mode? I mean, if I select the sge mode, for some steps (utg for example) the main job is trimmed in several small jobs (130 in my case). There are some parameters to set, for example, 1000 trim small jobs? Or the only way is to play with ovlHashBits & ovlHashBlockLength parameters? Thanks for your help, Miquel On 2015年01月20日 12:00, Brian Walenz wrote: Definitely better with smaller values! NOTE! You need to keep ovlThreads and the sge -pe thread values the same. As you have it now, you've told overlapper to run with 6 threads, but only requested 2 from SGE. This value is totally up to you, whatever works at your site. If possible, run a job or two by hand before submitting to the grid (sh overlap.sh <jobNum>). This will report stats on the hash table usage, and let you see memory usage and run time. If not possible, check the log files in the overlap directory as it runs. You want to check that the hash table isn't totally empty (a load less than 50%). If it is, increase the hash block length or decrease the bits. The other side (too full) isn't really a problem - it'll just do multiple passes to compute all the overlaps. b On Mon, Jan 19, 2015 at 9:17 PM, Miguel Grau <mi...@uj...<mailto:mi...@uj...>> wrote: @Ludovic. virtual_free and h_vmem are mandatory to work in our cluster. Thanks for the answer. @Brian. I increased these values because my batch of fastq files has around 40Gb so I thought I had to use (following the ovlHashBits table from here<http://wgs-assembler.sourceforge.net/wiki/index.php/RunCA>, if I want to use 2 threads on sge): ovlHashBits = 27 ovlHashBlockLength = 260000000 ovlRefBlockSize = 7630000 ovlThreads = 2 sge = -pe thread 2 -l h_vmem=50G Instead this, it works better if I decrease the ovlHasBits & ovlHashBlockLength values and increase the ovlRefBlockSize & ovlThreads values?: ovlHashBits = 25 ovlHashBlockLength = 240000000 ovlRefBlockSize = 18000000 ovlThreads = 6 sge = -pe thread 2 -l h_vmem=50G Thanks for your help, Miquel On 2015年01月20日 00:29, Brian Walenz wrote: I've never seen large overlap jobs perform better than small jobs. Target an 8gb job with ~4 CPUs each. My default configuration is: ovlHashBits = 22 ovlHashBlockLength = 200000000 ovlRefBlockSize = 18000000 ovlThreads = 6 The two 'hash' sizes control how big the job is. The 'ref block size' controls how many reads are processed by each job, i.e., how long the job runs. b On Mon, Jan 19, 2015 at 5:10 AM, Ludovic Mallet <lud...@un...<mailto:lud...@un...>> wrote: Hi, Not the best expert, but to me, virtual_free allow the job to swap, which you should try to avoid. and I think h_vmem is the hard limit, so the job would be killed whenever the line is crossed. from http://gridengine.eu/grid-engine-internals "hard limitation: All processes of the job combined are limited from the Linux kernel that they are able to use only the requested amount of memory. Further malloc() calls will fail." whether h_vmem is hard by default if GE has to be checked again, but I'd rather use mem_free instead Best, ludovic On 19/01/15 02:22, Miguel Grau wrote: > Dear all, > > I am having some troubles to config wgs 8.2 assembler with SGE options. > I always get a malloc memory error and I am not sure why. I am working > with 3 paired fastq files (6 files in total) with 100b length reads (15 > million reads in each fastq file). My config file: > > useGrid = 1 > scriptOnGrid = 1 > > sge = -A assembly > sgeMerTrim = -l h_vmem=150G -l virtual_free=150G > sgeScript = -l h_vmem=50G -l virtual_free=50G > sgeOverlap = -l h_vmem=100G -l virtual_free=100G > sgeMerOverlapSeed = -l h_vmem=100G -l virtual_free=100G > sgeMerOverlapExtend = -l h_vmem=100G -l virtual_free=100G > sgeConsensus = -l h_vmem=100G -l virtual_free=100G > sgeFragmentCorrection = -l h_vmem=100G -l virtual_free=100G > sgeOverlapCorrection = -l h_vmem=100G -l virtual_free=100G > > overlapper = ovl #Best for illumina > unitigger = bogart #Best for illumina > > #For 50GB... > ovlHashBits = 28 > ovlHashBlockLength = 480000000 > #100Gb for overlap > ovlStoreMemory=102400 > > ovlThreads = 2 > ovlRefBlockSize = 7630000 > frgCorrBatchSize = 1000000 > frgCorrThreads = 8 > > The error that I have now is: > > ------------------------------------------------------------------------------ > bucketizing /reads/a6/0-overlaptrim-overlap/001/000278.ovb.gz > bucketizing /reads/a6/0-overlaptrim-overlap/001/000276.ovb.gz > bucketizing /reads/a6/0-overlaptrim-overlap/001/000275.ovb.gz > bucketizing /reads/a6/0-overlaptrim-overlap/001/000280.ovb.gz > bucketizing DONE! > overlaps skipped: > 1211882406 OBT - low quality > 0 DUP - non-duplicate overlap > 0 DUP - different library > 0 DUP - dedup not requested > terminate called after throwing an instance of 'std::bad_alloc' > what(): std::bad_alloc > > Failed with 'Aborted' > > Backtrace (mangled): > > /miquel/wgs-8.2/Linux-amd64/bin/overlapStoreBuild(_Z17AS_UTL_catchCrashiP7siginfoPv+0x27)[0x40a697] > /lib64/libpthread.so.0[0x3ff1c0f710] > /lib64/libc.so.6(gsignal+0x35)[0x3ff1432925] > /lib64/libc.so.6(abort+0x175)[0x3ff1434105] > .... > ---------------------------------------------------------------------------------- > > Some idea for the best config? > > Thank you, > > > Miguel > > > > > > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li...<mailto:wgs...@li...> > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users ------------------------------------------------------------------------------ New Year. New Location. New Benefits. New Data Center in Ashburn, VA. GigeNET is offering a free month of service with a new server in Ashburn. Choose from 2 high performing configs, both with 100TB of bandwidth. Higher redundancy.Lower latency.Increased capacity.Completely compliant. http://p.sf.net/sfu/gigenet _______________________________________________ wgs-assembler-users mailing list wgs...@li...<mailto:wgs...@li...> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users ------------------------------------------------------------------------------ New Year. New Location. New Benefits. New Data Center in Ashburn, VA. GigeNET is offering a free month of service with a new server in Ashburn. Choose from 2 high performing configs, both with 100TB of bandwidth. Higher redundancy.Lower latency.Increased capacity.Completely compliant. http://p.sf.net/sfu/gigenet _______________________________________________ wgs-assembler-users mailing list wgs...@li...<mailto:wgs...@li...> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: Walenz, B. <wa...@nb...> - 2015-01-22 22:39:51
|
Thanks, Sergey! David - Sorry for being vague with the wildcards. You figured it out correctly. For bogart, the screwed up mate pairs won't matter. It uses mates to decide if it crossed a repeat correctly, but not to break/construct unitigs. If you can't get much better unitigs, you can cross CA off your list (if you haven't already). Someone should compile a list of genomes that can't be assembled. I've been on a few that failed to assemble no matter what awful hacks we tried. -----Original Message----- From: Serge Koren [mailto:ser...@gm...] Sent: Wednesday, January 21, 2015 4:16 PM To: mathog Cc: wgs...@li... Subject: Re: [wgs-assembler-users] S. purpuratus parameters In reply to your earlier error with bogart (bogart -G copygkpStore -O ..ovlStore -T e10.tigStore -o test.bogart \ -eg 0.10 -Eg $VAL -em 0.10 -Em $VAL 2>&1 | tee bogart_25.log). When writing output, bogart will split the resulting contigs into multiple partitions for consensus to be computed in parallel. Since you’re not setting this option, it will put one contig per partition and thus you are running out of file pointers. If you add -B 75000 that should fix your issue (where -B specifies # sequences per partition). If you have 10M sequences this will mean about 130 partitions. You can adjust the 75000 up to ensure you end up with less than 1024 partitions and fit into your open file limit. As far as specifying innie/outtie, I don’t think the classification does not run automatically. It has to be enabled with the runCA parameters dncMPlibraries and dncBBlibraries, where dncMPlibraries is the list of your mate-pair libraries while dncBBlibraries are the paired-end libraries. I think it will corrected the innie/outtie designation for any libraries listed in the dncMPlibraries list. Brian would know better so he can correct me. I’d also second Brian’s suggestion to run MaSuRCA or another assembler like ALLPATHS-LG (if you have the required libraries) for an Illumina dataset. Celera Assembler is not well-designed to handle Illumina datasets and there are other/faster options available for assembly. Sergey > On Jan 21, 2015, at 3:45 PM, mathog <ma...@ca...> wrote: > > Three of the four Illumina data sets used are "Nextera Mate Pair > Reads", but the frg files for those differ from the other one only in > library and file names, and this: > > < mea:1000.000 > < std:100.000 > --- >> mea:3000.000 >> std:300.000 > > These mate pair reads were described as "innie" in the frg files, but > with respect to the source DNA, I'm thinking now they probably should > have been "outie". Or maybe not, since there is supposed to be code > to detect and handle these: > > http://wgs-assembler.sourceforge.net/wiki/index.php/Pair_classificatio > n_within_Illumina_mate_pair_data > > Does the analysis described in the preceding link occur automatically > for Illumina data, or is something special needed to turn it on? > > Thanks, > > David Mathog > ma...@ca... > Manager, Sequence Analysis Facility, Biology Division, Caltech > > ---------------------------------------------------------------------- > -------- New Year. New Location. New Benefits. New Data Center in > Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users ------------------------------------------------------------------------------ New Year. New Location. New Benefits. New Data Center in Ashburn, VA. GigeNET is offering a free month of service with a new server in Ashburn. Choose from 2 high performing configs, both with 100TB of bandwidth. Higher redundancy.Lower latency.Increased capacity.Completely compliant. http://p.sf.net/sfu/gigenet _______________________________________________ wgs-assembler-users mailing list wgs...@li... https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: Miguel G. <mi...@uj...> - 2015-01-22 01:37:47
|
Hi Brian, There is some way to select the number of trim jobs running on sge mode? I mean, if I select the sge mode, for some steps (utg for example) the main job is trimmed in several small jobs (130 in my case). There are some parameters to set, for example, 1000 trim small jobs? Or the only way is to play with ovlHashBits & ovlHashBlockLength parameters? Thanks for your help, Miquel On 2015年01月20日 12:00, Brian Walenz wrote: > Definitely better with smaller values! > > NOTE! You need to keep ovlThreads and the sge -pe thread values the > same. As you have it now, you've told overlapper to run with 6 > threads, but only requested 2 from SGE. This value is totally up to > you, whatever works at your site. > > If possible, run a job or two by hand before submitting to the grid > (sh overlap.sh <jobNum>). This will report stats on the hash table > usage, and let you see memory usage and run time. If not possible, > check the log files in the overlap directory as it runs. You want to > check that the hash table isn't totally empty (a load less than 50%). > If it is, increase the hash block length or decrease the bits. The > other side (too full) isn't really a problem - it'll just do multiple > passes to compute all the overlaps. > > b > > > On Mon, Jan 19, 2015 at 9:17 PM, Miguel Grau <mi...@uj... > <mailto:mi...@uj...>> wrote: > > @Ludovic. virtual_free and h_vmem are mandatory to work in our > cluster. Thanks for the answer. > > @Brian. I increased these values because my batch of fastq files > has around 40Gb so I thought I had to use (following the > ovlHashBits table from here > <http://wgs-assembler.sourceforge.net/wiki/index.php/RunCA>, if I > want to use 2 threads on sge): > > ovlHashBits = 27 > ovlHashBlockLength = 260000000 > ovlRefBlockSize = 7630000 > ovlThreads = 2 > sge = -pe thread 2 -l h_vmem=50G > > Instead this, it works better if I decrease the ovlHasBits & > ovlHashBlockLength values and increase the ovlRefBlockSize & > ovlThreads values?: > > ovlHashBits = 25 > ovlHashBlockLength = 240000000 > ovlRefBlockSize = 18000000 > ovlThreads = 6 > sge = -pe thread 2 -l h_vmem=50G > > Thanks for your help, > > Miquel > > > > > > On 2015年01月20日 00:29, Brian Walenz wrote: >> I've never seen large overlap jobs perform better than small >> jobs. Target an 8gb job with ~4 CPUs each. My default >> configuration is: >> >> ovlHashBits = 22 >> ovlHashBlockLength = 200000000 >> ovlRefBlockSize = 18000000 >> ovlThreads = 6 >> >> The two 'hash' sizes control how big the job is. The 'ref block >> size' controls how many reads are processed by each job, i.e., >> how long the job runs. >> >> b >> >> >> On Mon, Jan 19, 2015 at 5:10 AM, Ludovic Mallet >> <lud...@un... >> <mailto:lud...@un...>> wrote: >> >> Hi, >> Not the best expert, but to me, virtual_free allow the job to >> swap, >> which you should try to avoid. >> and I think h_vmem is the hard limit, so the job would be killed >> whenever the line is crossed. >> >> from http://gridengine.eu/grid-engine-internals >> "hard limitation: All processes of the job combined are >> limited from the >> Linux kernel that they are able to use only the requested >> amount of >> memory. Further malloc() calls will fail." >> >> whether h_vmem is hard by default if GE has to be checked >> again, but I'd >> rather use mem_free instead >> >> Best, >> ludovic >> >> On 19/01/15 02:22, Miguel Grau wrote: >> > Dear all, >> > >> > I am having some troubles to config wgs 8.2 assembler with >> SGE options. >> > I always get a malloc memory error and I am not sure why. I >> am working >> > with 3 paired fastq files (6 files in total) with 100b >> length reads (15 >> > million reads in each fastq file). My config file: >> > >> > useGrid = 1 >> > scriptOnGrid = 1 >> > >> > sge = -A assembly >> > sgeMerTrim = -l h_vmem=150G -l virtual_free=150G >> > sgeScript = -l h_vmem=50G -l virtual_free=50G >> > sgeOverlap = -l h_vmem=100G -l virtual_free=100G >> > sgeMerOverlapSeed = -l h_vmem=100G -l virtual_free=100G >> > sgeMerOverlapExtend = -l h_vmem=100G -l virtual_free=100G >> > sgeConsensus = -l h_vmem=100G -l virtual_free=100G >> > sgeFragmentCorrection = -l h_vmem=100G -l virtual_free=100G >> > sgeOverlapCorrection = -l h_vmem=100G -l virtual_free=100G >> > >> > overlapper = ovl #Best for illumina >> > unitigger = bogart #Best for illumina >> > >> > #For 50GB... >> > ovlHashBits = 28 >> > ovlHashBlockLength = 480000000 >> > #100Gb for overlap >> > ovlStoreMemory=102400 >> > >> > ovlThreads = 2 >> > ovlRefBlockSize = 7630000 >> > frgCorrBatchSize = 1000000 >> > frgCorrThreads = 8 >> > >> > The error that I have now is: >> > >> > >> ------------------------------------------------------------------------------ >> > bucketizing /reads/a6/0-overlaptrim-overlap/001/000278.ovb.gz >> > bucketizing /reads/a6/0-overlaptrim-overlap/001/000276.ovb.gz >> > bucketizing /reads/a6/0-overlaptrim-overlap/001/000275.ovb.gz >> > bucketizing /reads/a6/0-overlaptrim-overlap/001/000280.ovb.gz >> > bucketizing DONE! >> > overlaps skipped: >> > 1211882406 OBT - low quality >> > 0 DUP - non-duplicate overlap >> > 0 DUP - different library >> > 0 DUP - dedup not requested >> > terminate called after throwing an instance of 'std::bad_alloc' >> > what(): std::bad_alloc >> > >> > Failed with 'Aborted' >> > >> > Backtrace (mangled): >> > >> > >> /miquel/wgs-8.2/Linux-amd64/bin/overlapStoreBuild(_Z17AS_UTL_catchCrashiP7siginfoPv+0x27)[0x40a697] >> > /lib64/libpthread.so.0[0x3ff1c0f710] >> > /lib64/libc.so.6(gsignal+0x35)[0x3ff1432925] >> > /lib64/libc.so.6(abort+0x175)[0x3ff1434105] >> > .... >> > >> ---------------------------------------------------------------------------------- >> > >> > Some idea for the best config? >> > >> > Thank you, >> > >> > >> > Miguel >> > >> > >> > >> > >> > >> > >> > >> ------------------------------------------------------------------------------ >> > New Year. New Location. New Benefits. New Data Center in >> Ashburn, VA. >> > GigeNET is offering a free month of service with a new >> server in Ashburn. >> > Choose from 2 high performing configs, both with 100TB of >> bandwidth. >> > Higher redundancy.Lower latency.Increased >> capacity.Completely compliant. >> > http://p.sf.net/sfu/gigenet >> > _______________________________________________ >> > wgs-assembler-users mailing list >> > wgs...@li... >> <mailto:wgs...@li...> >> > >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> >> >> ------------------------------------------------------------------------------ >> New Year. New Location. New Benefits. New Data Center in >> Ashburn, VA. >> GigeNET is offering a free month of service with a new server >> in Ashburn. >> Choose from 2 high performing configs, both with 100TB of >> bandwidth. >> Higher redundancy.Lower latency.Increased capacity.Completely >> compliant. >> http://p.sf.net/sfu/gigenet >> _______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> <mailto:wgs...@li...> >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> >> >> >> >> ------------------------------------------------------------------------------ >> New Year. New Location. New Benefits. New Data Center in Ashburn, VA. >> GigeNET is offering a free month of service with a new server in Ashburn. >> Choose from 2 high performing configs, both with 100TB of bandwidth. >> Higher redundancy.Lower latency.Increased capacity.Completely compliant. >> http://p.sf.net/sfu/gigenet >> >> >> _______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... <mailto:wgs...@li...> >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > > |
From: Serge K. <ser...@gm...> - 2015-01-21 21:16:00
|
In reply to your earlier error with bogart (bogart -G copygkpStore -O ..ovlStore -T e10.tigStore -o test.bogart \ -eg 0.10 -Eg $VAL -em 0.10 -Em $VAL 2>&1 | tee bogart_25.log). When writing output, bogart will split the resulting contigs into multiple partitions for consensus to be computed in parallel. Since you’re not setting this option, it will put one contig per partition and thus you are running out of file pointers. If you add -B 75000 that should fix your issue (where -B specifies # sequences per partition). If you have 10M sequences this will mean about 130 partitions. You can adjust the 75000 up to ensure you end up with less than 1024 partitions and fit into your open file limit. As far as specifying innie/outtie, I don’t think the classification does not run automatically. It has to be enabled with the runCA parameters dncMPlibraries and dncBBlibraries, where dncMPlibraries is the list of your mate-pair libraries while dncBBlibraries are the paired-end libraries. I think it will corrected the innie/outtie designation for any libraries listed in the dncMPlibraries list. Brian would know better so he can correct me. I’d also second Brian’s suggestion to run MaSuRCA or another assembler like ALLPATHS-LG (if you have the required libraries) for an Illumina dataset. Celera Assembler is not well-designed to handle Illumina datasets and there are other/faster options available for assembly. Sergey > On Jan 21, 2015, at 3:45 PM, mathog <ma...@ca...> wrote: > > Three of the four Illumina data sets used are "Nextera Mate Pair Reads", > but > the frg files for those differ from the other one only in library and > file names, and this: > > < mea:1000.000 > < std:100.000 > --- >> mea:3000.000 >> std:300.000 > > These mate pair reads were described as "innie" in the frg files, but > with respect to the source DNA, I'm thinking now they probably should > have been "outie". Or maybe not, since there is supposed to be code to > detect and handle these: > > http://wgs-assembler.sourceforge.net/wiki/index.php/Pair_classification_within_Illumina_mate_pair_data > > Does the analysis described in the preceding link occur automatically > for Illumina data, or is something special needed to turn it on? > > Thanks, > > David Mathog > ma...@ca... > Manager, Sequence Analysis Facility, Biology Division, Caltech > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: mathog <ma...@ca...> - 2015-01-21 20:45:08
|
Three of the four Illumina data sets used are "Nextera Mate Pair Reads", but the frg files for those differ from the other one only in library and file names, and this: < mea:1000.000 < std:100.000 --- > mea:3000.000 > std:300.000 These mate pair reads were described as "innie" in the frg files, but with respect to the source DNA, I'm thinking now they probably should have been "outie". Or maybe not, since there is supposed to be code to detect and handle these: http://wgs-assembler.sourceforge.net/wiki/index.php/Pair_classification_within_Illumina_mate_pair_data Does the analysis described in the preceding link occur automatically for Illumina data, or is something special needed to turn it on? Thanks, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |
From: mathog <ma...@ca...> - 2015-01-21 18:51:55
|
On 21-Jan-2015 00:39, Ludovic Mallet wrote: > on debian-like at least, be root: > > #setting the maximum number of file open > sed -i 's/#<domain> <type> <item> <value>/#<domain> <type> > <item> <value>\n\* soft nofile > 65536\n#/' > /etc/security/limits.conf > > though it might be tweaked for RH flavors. > Should be run on every node. Added to limits.conf: mathog hard nofiles 60000 mathog soft nofiles 60000 logged out, logged back in, and saw: % ulimit -Sn 1024 % ulimit -Hn 4096 % ulimit -n 1024 % ulimit -n 60000 bash: ulimit: open files: cannot modify limit: Operation not permitted % ulimit -n 4096 % ulimit -n 4096 The 4096 limit seems to be coming from /proc/1/limits which has: Max open files 1024 4096 files root can set ulimit -n as high as it wants, while also running in bash. Not sure where the 4096 being applied to normal processes is coming from. Regards, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |
From: Ludovic M. <lud...@un...> - 2015-01-21 08:40:56
|
on debian-like at least, be root: #setting the maximum number of file open sed -i 's/#<domain> <type> <item> <value>/#<domain> <type> <item> <value>\n\* soft nofile 65536\n#/' /etc/security/limits.conf though it might be tweaked for RH flavors. Should be run on every node. best, On 21/01/15 01:03, mathog wrote: > On 20-Jan-2015 12:37, mathog wrote: > >> VAL=2.5 #2.5 percent >> bogart -G copygkpStore -O copyovlStore -T e10.tigStore -o test.bogart \ >> -eg 0.10 -Eg $VAL -em 0.10 -Em $VAL >> tigStore -g copygkpStore -t e10.tigStore 1 -U -d sizes -s 800000000 > Tried this: > > VAL=2.5 > bogart -G copygkpStore -O ..ovlStore -T e10.tigStore -o test.bogart \ > -eg 0.10 -Eg $VAL -em 0.10 -Em $VAL 2>&1 | tee bogart_25.log > > and it ran along happily until dropping dead here: > > ... > OverlapCache()-- Loading overlap information: overlaps processed > 4128333921 (098.08%) loaded 4128333921 (098.08%) (at read iid > 152548896) > OverlapCache()-- Loading overlap information: overlaps processed > 4158431504 (098.79%) loaded 4158431504 (098.79%) (at read iid > 153676157) > OverlapCache()-- Loading overlap information: overlaps processed > 4188318291 (099.50%) loaded 4188318291 (099.50%) (at read iid > 154804535) > OverlapCache()-- Loading overlap information: overlaps processed > 4209225138 (100.00%) loaded 4209225138 (100.00%) > setLogFile()-- Now logging to 'test.bogart.002.bestOverlapGraph' > setLogFile()-- Now logging to 'test.bogart.004.ChunkGraph' > setLogFile()-- Now logging to 'test.bogart.005.buildUnitigs' > setLogFile()-- Now logging to 'test.bogart.006.placeContains' > setLogFile()-- Now logging to 'test.bogart.007.placeZombies' > setLogFile()-- Now logging to 'test.bogart.008.mergeSplitJoin' > setLogFile()-- Now logging to 'test.bogart.009.popBubbles' > setLogFile()-- Now logging to 'test.bogart.010.mergeSplitJoin' > setLogFile()-- Now logging to 'test.bogart.011.cleanup' > setLogFile()-- Now logging to 'test.bogart.012.setParentAndHang' > setLogFile()-- Now logging to 'test.bogart.013.output' > MultiAlignStore::openDB()-- Failed to open > 'e10.tigStore/seqDB.v001.p1010.dat': Too many open files > MultiAlignStore::openDB()-- Trying again. > MultiAlignStore::openDB()-- Failed to open > 'e10.tigStore/seqDB.v001.p1010.dat': Too many open files > WARNING: open file 'test.bogart.013.output.thr000' > > Not suprisingly, tigStore wouldn't work with what was left: > > % tigStore -g copygkpStore -t e10.tigStore 1 -U -d sizes -s 800000000 > MultiAlignStore::MultiAlignStore()-- ERROR, didn't find any unitigs or > contigs in the store. > MultiAlignStore::MultiAlignStore()-- asked for store > 'e10.tigStore', correct? > MultiAlignStore::MultiAlignStore()-- asked for version '1', > correct? > MultiAlignStore::MultiAlignStore()-- asked for partition unitig=0 > contig=0, correct? > MultiAlignStore::MultiAlignStore()-- asked for writable=0 > inplace=0 append=0, correct? > > System information: > % cat /etc/centos-release > CentOS release 6.6 (Final) > % ulimit > unlimited > % ulimit -n > 1024 > % cat /proc/sys/fs/file-max > 52605611 > > The version of wgs is trunk downloaded and built on July 3, 2014. > > Suggestions? > > Thanks, > > David Mathog > ma...@ca... > Manager, Sequence Analysis Facility, Biology Division, Caltech > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |
From: mathog <ma...@ca...> - 2015-01-21 00:03:14
|
On 20-Jan-2015 12:37, mathog wrote: > VAL=2.5 #2.5 percent > bogart -G copygkpStore -O copyovlStore -T e10.tigStore -o test.bogart \ > -eg 0.10 -Eg $VAL -em 0.10 -Em $VAL > tigStore -g copygkpStore -t e10.tigStore 1 -U -d sizes -s 800000000 Tried this: VAL=2.5 bogart -G copygkpStore -O ..ovlStore -T e10.tigStore -o test.bogart \ -eg 0.10 -Eg $VAL -em 0.10 -Em $VAL 2>&1 | tee bogart_25.log and it ran along happily until dropping dead here: ... OverlapCache()-- Loading overlap information: overlaps processed 4128333921 (098.08%) loaded 4128333921 (098.08%) (at read iid 152548896) OverlapCache()-- Loading overlap information: overlaps processed 4158431504 (098.79%) loaded 4158431504 (098.79%) (at read iid 153676157) OverlapCache()-- Loading overlap information: overlaps processed 4188318291 (099.50%) loaded 4188318291 (099.50%) (at read iid 154804535) OverlapCache()-- Loading overlap information: overlaps processed 4209225138 (100.00%) loaded 4209225138 (100.00%) setLogFile()-- Now logging to 'test.bogart.002.bestOverlapGraph' setLogFile()-- Now logging to 'test.bogart.004.ChunkGraph' setLogFile()-- Now logging to 'test.bogart.005.buildUnitigs' setLogFile()-- Now logging to 'test.bogart.006.placeContains' setLogFile()-- Now logging to 'test.bogart.007.placeZombies' setLogFile()-- Now logging to 'test.bogart.008.mergeSplitJoin' setLogFile()-- Now logging to 'test.bogart.009.popBubbles' setLogFile()-- Now logging to 'test.bogart.010.mergeSplitJoin' setLogFile()-- Now logging to 'test.bogart.011.cleanup' setLogFile()-- Now logging to 'test.bogart.012.setParentAndHang' setLogFile()-- Now logging to 'test.bogart.013.output' MultiAlignStore::openDB()-- Failed to open 'e10.tigStore/seqDB.v001.p1010.dat': Too many open files MultiAlignStore::openDB()-- Trying again. MultiAlignStore::openDB()-- Failed to open 'e10.tigStore/seqDB.v001.p1010.dat': Too many open files WARNING: open file 'test.bogart.013.output.thr000' Not suprisingly, tigStore wouldn't work with what was left: % tigStore -g copygkpStore -t e10.tigStore 1 -U -d sizes -s 800000000 MultiAlignStore::MultiAlignStore()-- ERROR, didn't find any unitigs or contigs in the store. MultiAlignStore::MultiAlignStore()-- asked for store 'e10.tigStore', correct? MultiAlignStore::MultiAlignStore()-- asked for version '1', correct? MultiAlignStore::MultiAlignStore()-- asked for partition unitig=0 contig=0, correct? MultiAlignStore::MultiAlignStore()-- asked for writable=0 inplace=0 append=0, correct? System information: % cat /etc/centos-release CentOS release 6.6 (Final) % ulimit unlimited % ulimit -n 1024 % cat /proc/sys/fs/file-max 52605611 The version of wgs is trunk downloaded and built on July 3, 2014. Suggestions? Thanks, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |
From: mathog <ma...@ca...> - 2015-01-20 20:37:18
|
(This is a followup to: Re: [wgs-assembler-users] mer, mertrim running single threaded on large SMP machine) On 19-Jan-2015 18:52, Brian Walenz wrote: > I didn't poke through the data much, just enough to see it was > Illumina. > My immediate reaction is to suggest trying masurca. It handles > illumina > much much better than plain CA, but does probably require more reads > because more crap gets filtered out. Will look into that. Also found Meraculous, also for Illumina. (So many assemblers, so little time...) > With your current assembly, I see two things I don't like: 1) bog > instead > of bogart, 2) 3% error rate. > > > You can do some experiments with the current assembly without too much > pain. All we're going to do is run bogart a few times, and look at the > resulting unitigs. No consensus generation, just unitig layouts. > > On a COPY of the gkpStore, run > > gatekeeper --revertclear OBTCHIMERA *gkpStore Did this: cp -r ..gkpStore copygkpStore cp ..gkpStore.err copygkpStore.err cp ..gkpStore.errorLog copygkpStore.errorLog cp ..gkpStore.fastqUIDmap copygkpStore.fastqUIDmap cp ..gkpStore.info copygkpStore.info export PATH=$PATH:/home/wgs_project/wgs/Linux-amd64/bin # gatekeeper --revertclear OBTCHIMERA copygkpStore > > This will restore the clear ranges to the state they had just after > trimming, and just before unitigging. > > Then a bunch of iterations of bogart: > > bogart -G *.gkpStore -O *.ovlStore -T e10.tigStore -o test.bogart -eg > 0.10 > -Eg 2.5 -em 0.10 -Em 2.5 > > Where the eg and em parameter is varied between 2 and 6 (percent > error). > By default, overlaps are generated to only 6% error, not that higher > would > be feasible with short reads. The Eg and Em parameters measure overlap > error as 'number of errors', to get around the problem of a 50-base > overlap > with one error resulting in 2% error. You can mostly ignore this for > the > higher error rates. Sorry, the wild card in that line is throwing me. Also I'm confused if you mean big Eg,Em (where 2.5 is in the range specified) or little eg,em (where values are not in that range). Given what I called the copy, is this what you want to run? VAL=2.5 #2.5 percent bogart -G copygkpStore -O copyovlStore -T e10.tigStore -o test.bogart \ -eg 0.10 -Eg $VAL -em 0.10 -Em $VAL tigStore -g copygkpStore -t e10.tigStore 1 -U -d sizes -s 800000000 VAL=3.0 #3.0 percent bogart -G copygkpStore -O copyovlStore -T e10.tigStore -o test.bogart \ -eg 0.10 -Eg $VAL -em 0.10 -Em $VAL tigStore -g copygkpStore -t e10.tigStore 1 -U -d sizes -s 800000000 # etc. The bogart command fails because "'copyovlStore' is not an ovelrapStore". Use the overlapStore from the first run in that command? (note the typo in the error message, that's what it says) Erase the e10.tigStore between runs? Do something to the overlapStore between runs? running tigStore on the original (not so useful) run gave this: tigStore -g ..gkpStore -t ..tigStore 1 -U -d sizes -s copygkpStore.info utgLenUnassigned n10 siz 528 sum 304316578 idx 479977 utgLenUnassigned n20 siz 400 sum 608633078 idx 1148939 utgLenUnassigned n30 siz 291 sum 912949618 idx 2026098 utgLenUnassigned n40 siz 179 sum 1217266213 idx 3353557 utgLenUnassigned n50 siz 150 sum 1521582630 idx 5307416 utgLenUnassigned n60 siz 145 sum 1825899170 idx 7367619 utgLenUnassigned n70 siz 126 sum 2130215760 idx 9584603 utgLenUnassigned n80 siz 122 sum 2434532234 idx 12033900 utgLenUnassigned n90 siz 102 sum 2738848751 idx 14689647 utgLenUnassigned sum 3043165239 (genomeSize 0) utgLenUnassigned num 18384123 utgLenUnassigned ave 165 tigLenSingleton n10 siz 150 sum 142617831 idx 907450 tigLenSingleton n20 siz 148 sum 285235697 idx 1865321 tigLenSingleton n30 siz 145 sum 427853436 idx 2837943 tigLenSingleton n40 siz 134 sum 570471289 idx 3850926 tigLenSingleton n50 siz 125 sum 713089018 idx 4969720 tigLenSingleton n60 siz 123 sum 855706883 idx 6116341 tigLenSingleton n70 siz 121 sum 998324590 idx 7282617 tigLenSingleton n80 siz 108 sum 1140942414 idx 8518814 tigLenSingleton n90 siz 87 sum 1283560221 idx 9981733 tigLenSingleton sum 1426177984 (genomeSize 0) tigLenSingleton num 11893391 tigLenSingleton ave 119 tigLenAssembled n10 siz 630 sum 161699171 idx 231237 tigLenAssembled n20 siz 517 sum 323397821 idx 516513 tigLenAssembled n30 siz 443 sum 485096301 idx 855316 tigLenAssembled n40 siz 389 sum 646795227 idx 1245703 tigLenAssembled n50 siz 335 sum 808493956 idx 1690952 tigLenAssembled n60 siz 266 sum 970192570 idx 2232349 tigLenAssembled n70 siz 205 sum 1131891234 idx 2921817 tigLenAssembled n80 siz 157 sum 1293589836 idx 3836637 tigLenAssembled n90 siz 136 sum 1455288608 idx 4933675 tigLenAssembled sum 1616987255 (genomeSize 0) tigLenAssembled num 6490732 tigLenAssembled ave 249 Presumably we want to see many more of the tigLenAssembled and fewer of the utgLenUnassigned and tigLenSingleton. Thanks, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |
From: Brian W. <th...@gm...> - 2015-01-20 03:00:43
|
Definitely better with smaller values! NOTE! You need to keep ovlThreads and the sge -pe thread values the same. As you have it now, you've told overlapper to run with 6 threads, but only requested 2 from SGE. This value is totally up to you, whatever works at your site. If possible, run a job or two by hand before submitting to the grid (sh overlap.sh <jobNum>). This will report stats on the hash table usage, and let you see memory usage and run time. If not possible, check the log files in the overlap directory as it runs. You want to check that the hash table isn't totally empty (a load less than 50%). If it is, increase the hash block length or decrease the bits. The other side (too full) isn't really a problem - it'll just do multiple passes to compute all the overlaps. b On Mon, Jan 19, 2015 at 9:17 PM, Miguel Grau <mi...@uj...> wrote: > @Ludovic. virtual_free and h_vmem are mandatory to work in our cluster. Thanks > for the answer. > > @Brian. I increased these values because my batch of fastq files has > around 40Gb so I thought I had to use (following the ovlHashBits table from > here <http://wgs-assembler.sourceforge.net/wiki/index.php/RunCA>, if I > want to use 2 threads on sge): > > ovlHashBits = 27 > ovlHashBlockLength = 260000000 > ovlRefBlockSize = 7630000 > ovlThreads = 2 > sge = -pe thread 2 -l h_vmem=50G > > Instead this, it works better if I decrease the ovlHasBits & > ovlHashBlockLength values and increase the ovlRefBlockSize & ovlThreads > values?: > > ovlHashBits = 25 > ovlHashBlockLength = 240000000 > ovlRefBlockSize = 18000000 > ovlThreads = 6 > sge = -pe thread 2 -l h_vmem=50G > > Thanks for your help, > > Miquel > > > > > > On 2015年01月20日 00:29, Brian Walenz wrote: > > I've never seen large overlap jobs perform better than small jobs. > Target an 8gb job with ~4 CPUs each. My default configuration is: > > ovlHashBits = 22 > ovlHashBlockLength = 200000000 > ovlRefBlockSize = 18000000 > ovlThreads = 6 > > The two 'hash' sizes control how big the job is. The 'ref block size' > controls how many reads are processed by each job, i.e., how long the job > runs. > > b > > > On Mon, Jan 19, 2015 at 5:10 AM, Ludovic Mallet < > lud...@un...> wrote: > >> Hi, >> Not the best expert, but to me, virtual_free allow the job to swap, >> which you should try to avoid. >> and I think h_vmem is the hard limit, so the job would be killed >> whenever the line is crossed. >> >> from http://gridengine.eu/grid-engine-internals >> "hard limitation: All processes of the job combined are limited from the >> Linux kernel that they are able to use only the requested amount of >> memory. Further malloc() calls will fail." >> >> whether h_vmem is hard by default if GE has to be checked again, but I'd >> rather use mem_free instead >> >> Best, >> ludovic >> >> On 19/01/15 02:22, Miguel Grau wrote: >> > Dear all, >> > >> > I am having some troubles to config wgs 8.2 assembler with SGE options. >> > I always get a malloc memory error and I am not sure why. I am working >> > with 3 paired fastq files (6 files in total) with 100b length reads (15 >> > million reads in each fastq file). My config file: >> > >> > useGrid = 1 >> > scriptOnGrid = 1 >> > >> > sge = -A assembly >> > sgeMerTrim = -l h_vmem=150G -l virtual_free=150G >> > sgeScript = -l h_vmem=50G -l virtual_free=50G >> > sgeOverlap = -l h_vmem=100G -l virtual_free=100G >> > sgeMerOverlapSeed = -l h_vmem=100G -l virtual_free=100G >> > sgeMerOverlapExtend = -l h_vmem=100G -l virtual_free=100G >> > sgeConsensus = -l h_vmem=100G -l virtual_free=100G >> > sgeFragmentCorrection = -l h_vmem=100G -l virtual_free=100G >> > sgeOverlapCorrection = -l h_vmem=100G -l virtual_free=100G >> > >> > overlapper = ovl #Best for illumina >> > unitigger = bogart #Best for illumina >> > >> > #For 50GB... >> > ovlHashBits = 28 >> > ovlHashBlockLength = 480000000 >> > #100Gb for overlap >> > ovlStoreMemory=102400 >> > >> > ovlThreads = 2 >> > ovlRefBlockSize = 7630000 >> > frgCorrBatchSize = 1000000 >> > frgCorrThreads = 8 >> > >> > The error that I have now is: >> > >> > >> ------------------------------------------------------------------------------ >> > bucketizing /reads/a6/0-overlaptrim-overlap/001/000278.ovb.gz >> > bucketizing /reads/a6/0-overlaptrim-overlap/001/000276.ovb.gz >> > bucketizing /reads/a6/0-overlaptrim-overlap/001/000275.ovb.gz >> > bucketizing /reads/a6/0-overlaptrim-overlap/001/000280.ovb.gz >> > bucketizing DONE! >> > overlaps skipped: >> > 1211882406 OBT - low quality >> > 0 DUP - non-duplicate overlap >> > 0 DUP - different library >> > 0 DUP - dedup not requested >> > terminate called after throwing an instance of 'std::bad_alloc' >> > what(): std::bad_alloc >> > >> > Failed with 'Aborted' >> > >> > Backtrace (mangled): >> > >> > >> /miquel/wgs-8.2/Linux-amd64/bin/overlapStoreBuild(_Z17AS_UTL_catchCrashiP7siginfoPv+0x27)[0x40a697] >> > /lib64/libpthread.so.0[0x3ff1c0f710] >> > /lib64/libc.so.6(gsignal+0x35)[0x3ff1432925] >> > /lib64/libc.so.6(abort+0x175)[0x3ff1434105] >> > .... >> > >> ---------------------------------------------------------------------------------- >> > >> > Some idea for the best config? >> > >> > Thank you, >> > >> > >> > Miguel >> > >> > >> > >> > >> > >> > >> > >> ------------------------------------------------------------------------------ >> > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. >> > GigeNET is offering a free month of service with a new server in >> Ashburn. >> > Choose from 2 high performing configs, both with 100TB of bandwidth. >> > Higher redundancy.Lower latency.Increased capacity.Completely compliant. >> > http://p.sf.net/sfu/gigenet >> > _______________________________________________ >> > wgs-assembler-users mailing list >> > wgs...@li... >> > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> >> >> >> ------------------------------------------------------------------------------ >> New Year. New Location. New Benefits. New Data Center in Ashburn, VA. >> GigeNET is offering a free month of service with a new server in Ashburn. >> Choose from 2 high performing configs, both with 100TB of bandwidth. >> Higher redundancy.Lower latency.Increased capacity.Completely compliant. >> http://p.sf.net/sfu/gigenet >> _______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> > > > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant.http://p.sf.net/sfu/gigenet > > > > _______________________________________________ > wgs-assembler-users mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > > > |
From: Miguel G. <mi...@uj...> - 2015-01-20 02:24:00
|
@Ludovic. virtual_free and h_vmem are mandatory to work in our cluster. Thanks for the answer. @Brian. I increased these values because my batch of fastq files has around 40Gb so I thought I had to use (following the ovlHashBits table from here <http://wgs-assembler.sourceforge.net/wiki/index.php/RunCA>, if I want to use 2 threads on sge): ovlHashBits = 27 ovlHashBlockLength = 260000000 ovlRefBlockSize = 7630000 ovlThreads = 2 sge = -pe thread 2 -l h_vmem=50G Instead this, it works better if I decrease the ovlHasBits & ovlHashBlockLength values and increase the ovlRefBlockSize & ovlThreads values?: ovlHashBits = 25 ovlHashBlockLength = 240000000 ovlRefBlockSize = 18000000 ovlThreads = 6 sge = -pe thread 2 -l h_vmem=50G Thanks for your help, Miquel On 2015?01?20? 00:29, Brian Walenz wrote: > I've never seen large overlap jobs perform better than small jobs. > Target an 8gb job with ~4 CPUs each. My default configuration is: > > ovlHashBits = 22 > ovlHashBlockLength = 200000000 > ovlRefBlockSize = 18000000 > ovlThreads = 6 > > The two 'hash' sizes control how big the job is. The 'ref block size' > controls how many reads are processed by each job, i.e., how long the > job runs. > > b > > > On Mon, Jan 19, 2015 at 5:10 AM, Ludovic Mallet > <lud...@un... > <mailto:lud...@un...>> wrote: > > Hi, > Not the best expert, but to me, virtual_free allow the job to swap, > which you should try to avoid. > and I think h_vmem is the hard limit, so the job would be killed > whenever the line is crossed. > > from http://gridengine.eu/grid-engine-internals > "hard limitation: All processes of the job combined are limited > from the > Linux kernel that they are able to use only the requested amount of > memory. Further malloc() calls will fail." > > whether h_vmem is hard by default if GE has to be checked again, > but I'd > rather use mem_free instead > > Best, > ludovic > > On 19/01/15 02:22, Miguel Grau wrote: > > Dear all, > > > > I am having some troubles to config wgs 8.2 assembler with SGE > options. > > I always get a malloc memory error and I am not sure why. I am > working > > with 3 paired fastq files (6 files in total) with 100b length > reads (15 > > million reads in each fastq file). My config file: > > > > useGrid = 1 > > scriptOnGrid = 1 > > > > sge = -A assembly > > sgeMerTrim = -l h_vmem=150G -l virtual_free=150G > > sgeScript = -l h_vmem=50G -l virtual_free=50G > > sgeOverlap = -l h_vmem=100G -l virtual_free=100G > > sgeMerOverlapSeed = -l h_vmem=100G -l virtual_free=100G > > sgeMerOverlapExtend = -l h_vmem=100G -l virtual_free=100G > > sgeConsensus = -l h_vmem=100G -l virtual_free=100G > > sgeFragmentCorrection = -l h_vmem=100G -l virtual_free=100G > > sgeOverlapCorrection = -l h_vmem=100G -l virtual_free=100G > > > > overlapper = ovl #Best for illumina > > unitigger = bogart #Best for illumina > > > > #For 50GB... > > ovlHashBits = 28 > > ovlHashBlockLength = 480000000 > > #100Gb for overlap > > ovlStoreMemory=102400 > > > > ovlThreads = 2 > > ovlRefBlockSize = 7630000 > > frgCorrBatchSize = 1000000 > > frgCorrThreads = 8 > > > > The error that I have now is: > > > > > ------------------------------------------------------------------------------ > > bucketizing /reads/a6/0-overlaptrim-overlap/001/000278.ovb.gz > > bucketizing /reads/a6/0-overlaptrim-overlap/001/000276.ovb.gz > > bucketizing /reads/a6/0-overlaptrim-overlap/001/000275.ovb.gz > > bucketizing /reads/a6/0-overlaptrim-overlap/001/000280.ovb.gz > > bucketizing DONE! > > overlaps skipped: > > 1211882406 OBT - low quality > > 0 DUP - non-duplicate overlap > > 0 DUP - different library > > 0 DUP - dedup not requested > > terminate called after throwing an instance of 'std::bad_alloc' > > what(): std::bad_alloc > > > > Failed with 'Aborted' > > > > Backtrace (mangled): > > > > > /miquel/wgs-8.2/Linux-amd64/bin/overlapStoreBuild(_Z17AS_UTL_catchCrashiP7siginfoPv+0x27)[0x40a697] > > /lib64/libpthread.so.0[0x3ff1c0f710] > > /lib64/libc.so.6(gsignal+0x35)[0x3ff1432925] > > /lib64/libc.so.6(abort+0x175)[0x3ff1434105] > > .... > > > ---------------------------------------------------------------------------------- > > > > Some idea for the best config? > > > > Thank you, > > > > > > Miguel > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > New Year. New Location. New Benefits. New Data Center in > Ashburn, VA. > > GigeNET is offering a free month of service with a new server in > Ashburn. > > Choose from 2 high performing configs, both with 100TB of bandwidth. > > Higher redundancy.Lower latency.Increased capacity.Completely > compliant. > > http://p.sf.net/sfu/gigenet > > _______________________________________________ > > wgs-assembler-users mailing list > > wgs...@li... > <mailto:wgs...@li...> > > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in > Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely > compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > <mailto:wgs...@li...> > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > > > > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > > > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users |