From: Elton V. <elt...@iq...> - 2015-06-09 18:30:13
|
Thanks very much, Serge! I'll try your suggestions and we´ll get back if it is necessary. Cheers, Elton 2015-06-09 14:47 GMT-03:00 Serge Koren <ser...@gm...>: > See my replies below inline. > > On Jun 9, 2015, at 1:29 PM, Elton Vasconcelos <elt...@iq...> wrote: > > Hy Brian and Serge, > > I forgot to tell you last week that I am not doing PacBio reads > self-correction. Instead I'm doing hybrid assembly (26 SMRTcells plus 3 > paired-end Illumina libraries). > ### Question 1: ### > Is it still worthwhile doing that in a single multi-thread machine? > Cause I've seen a Sergey's comment that the pipeline is quite slower when > considering correction with Illumina reads ( > http://ehc.ac/p/wgs-assembler/mailman/message/33620582/) > > A single machine will likely take several weeks to run the correction with > Illumina data for your size genome, I’d advise against it. Based on your > genome size and # smrtcells, I’d guess you have around 30X pacbio so I’d > suggest trying the low coverage options for self-correction instead: > > http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Low_Coverage_Assembly > It will still be significantly faster than the Illumina based correction. > On a recent 20X assembly of a 2.4GB genome, self-correction ran about 50 > times faster than illumina-based correction. You can try one of the more > recent tools for hybrid correction (LoRDEC, proovread) though I haven’t > personally run them and can’t say how much time they would require. > > > I am now running CA on a sinlge multi-thread server that has 80 threads > and 1T RAM. > It spent about 5 days on the "overlapInCore" step and I had to kill the > process because of server owner's complaint about too many threads being > consumed for a long time period. > > ### Question 2: ### > Could you explain me why "overlapInCore" is using all available threads > (80) instead of only the user's request (20)? > My spec file is attached and I ran the following command: > $ nohup /home/elton/wgs-8.3rc2/Linux-amd64/bin/pacBioToCA -l Test01 > -threads 20 -shortReads -genomeSize 380000000 -s pacbio.spec -fastq > 26-SMRTcells-filtered_subreads.fastq illumina-NEW.frg & > > Your spec file specifies: > ovlThreads=20 > ovlConcurrency=20 > This means run 20 jobs each using 20 cores, thus it is really trying to > use 400 cores on your system. If you set ovlConcurrency=1 it will use 20 > cores. > > > Thanks a lot again for your attention and support, > Best, > Elton > > > 2015-06-03 11:43 GMT-03:00 Serge Koren <ser...@gm...>: > >> Yes, the latest CA 8.3 release can assemble D. melanogaster in < 700CPU >> hours. You can see the updated timings here: >> >> http://wgs-assembler.sourceforge.net/wiki/index.php/Version_8.3_Release_Notes >> >> I’ve routinely run D. melanogaster on a 16-core, 32GB machine in less >> than a day (I haven’t timed it exactly) so for your genome you’re looking >> at 3-4K cpu hours. You should be able to run it on a single 16-core 32GB >> machine in a couple of days so I think it’s easiest to run it on a single >> largish machine you have access to. >> >> Sergey >> >> On Jun 2, 2015, at 9:12 PM, Brian Walenz <th...@gm...> wrote: >> >> That's an old page. The most recent page, linked from >> http://wgs-assembler.sourceforge.net/wiki/index.php?title=Main_Page, is: >> >> http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR >> >> (look for 'self correction') >> >> I've run drosophilia on my 12-core development machine in a few hours to >> overnight (I haven't timed it). Sergey replaced blasr with a much much >> faster algorithm, and that was where most of the time was spent. >> >> b >> >> >> On Tue, Jun 2, 2015 at 9:02 PM, Elton Vasconcelos <elt...@iq...> >> wrote: >> >>> Thanks for the hints, Brian! >>> >>> We'll try everything you suggested tomorrow, back in the lab. >>> Then I'll tell you what we got. >>> For now, I only wanna say that our main concern, instead of running >>> runCA itself, is gonna be with the pre-assembly (correction) step, running >>> PacBiotoCA and PBcR pipeline that are embedded in the wgs package. >>> Please take a look at the following strategy to assemble the Drosophila >>> genome sequenced by PacBio technology (which presents a high error rate on >>> the base calling, ~15%) at CBCB in Maryland : >>> http://cbcb.umd.edu/software/PBcR/dmel.html >>> They mentioned 621K CPU hours to correct that genome of ~122 Mb. >>> Our organism genome is something like 380 Mb long. Three times >>> Drosophila's one. >>> Well, just to let you know again! ;-) >>> >>> Talk to you later, >>> Thanks again. >>> Good night! >>> Elton >>> >>> 2015-06-02 20:19 GMT-03:00 Brian Walenz <th...@gm...>: >>> >>>> For the link problems - all those symbols come out of the kmer >>>> package. Check that the flags and compilers and whatnot are compatible >>>> with those in wgs-assembler. >>>> >>>> The kmer configuration is a bit awkward. A shell script (configure.sh) >>>> dumps a config to Make.compilers, which is read by the main Makefile. >>>> 'gmake real-clean' will remove the previous build AND the Make.compilers >>>> file. 'gmake' by itself will first build a Make.compilers by calling >>>> configure.sh, then continue on with the build. The proper way to modify >>>> this is: >>>> >>>> edit configure.sh >>>> gmake real-clean >>>> gmake install >>>> repeat until it works >>>> >>>> In configure.sh, there is a block of flags for Linux-amd64. I think >>>> it'll be easy to apply the same changes made for wgs-assembler. >>>> >>>> After rebuilding kmer, the wgs-assembler build should need to just link >>>> -- in other words, remove just wgs-assembler/Linux-amd64/bin -- don't do >>>> 'gmake clean' here! You might need to remove the dependency directory >>>> 'dep' too. >>>> >>>> >>>> For running - the assembler will emit an SGE submit command to run a >>>> single shell script on tens-to-hundreds-to-thousands of jobs. Each job >>>> will be 8-32gb (tunable) and 1-32 cores (nothing special here: more is >>>> faster, fewer is slower). If you can figure out how to run jobs of the >>>> form "command.sh 1", "command.sh 2", "command.sh 3", ..., "command.sh N" on >>>> on BG/Q you're most of the way to running CA. To make it output such a >>>> submit command, supply "useGrid=1 scriptOnGrid=0" to runCA. >>>> >>>> The other half of the assembler will be either large I/O or large >>>> memory. If you've got access to a machine with 256gb and 32 cores you >>>> should be fine. I don't know what a minimum usable machine size would be. >>>> >>>> So, the flow of the computer will be: >>>> >>>> On the 256gb machine: runCA useGrid=1 scriptOnGrid=0 .... >>>> Wait for it to emit a submit command >>>> Launch those jobs on BG/Q >>>> Wait for those to finish >>>> Relaunch runCA on the 256gb machine. It'll check that the job outputs >>>> are complete, and continue processing, probably emitting another submit >>>> command, so repeat. >>>> >>>> Historical note: back when runCA was first developed, we had a DEC >>>> Alpha Tru64 machine with 4 CPUs and 32gb of RAM, and a grid of a few >>>> hundred 2 CPU, 2gb, 32-bit Linux machines. The Alpha wasn't in the grid, >>>> and a different architecture anyway, so we had to run CA this way. It was >>>> a real chore. We're all spoiled with our 4 core 8gb laptops now... >>>> >>>> b >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Jun 2, 2015 at 5:49 PM, Elton Vasconcelos <elt...@iq...> >>>> wrote: >>>> >>>>> Thanks Brian, Serge and Huang, >>>>> >>>>> We've gone through fixing several error messages during the >>>>> compilation within the src/ dir from the latest wgs-8.3rc2.tar.bz2 package. >>>>> At the end of the day we stopped on "undefined reference" errors on >>>>> static libraries (mainly libseq.a, please see make_progs.log file). >>>>> >>>>> The 'gmake install' command within the kmer/ dir ran just fine. >>>>> >>>>> The following indicates BGQ OS type: >>>>> [erv3@bgq-fn src]$ uname -a >>>>> Linux bgq-fn.rcsg.rice.edu 2.6.32-431.el6.ppc64 #1 SMP Sun Nov 10 >>>>> 22:17:43 EST 2013 ppc64 ppc64 ppc64 GNU/Linux >>>>> >>>>> We also had to edit c_make.as file, adding some -I options (to >>>>> indicate paths to libraries) on the CFLAGS fields from the "OSTYPE, Linux" >>>>> section. >>>>> >>>>> Running "make objs" and "make libs" separately, everything appeared to >>>>> work fine (see attached files make_objs.log and make_libs.log). >>>>> The above mentioned trouble came up on the "make progs" final command >>>>> we ran (make_progs.log file). >>>>> >>>>> Well, just to let you guys know and to see whether some light can be >>>>> shed. >>>>> >>>>> Thanks a lot, >>>>> Cheers, >>>>> Elton >>>>> >>>>> PS: I also noticed about the MPI cluster system on BGQ, Brian. So, do >>>>> you think it isn't worthwhile keeping the attempt to install CA on BGQ? >>>>> >>>>> >>>>> >>> >>> >>> -- >>> Elton Vasconcelos, DVM, PhD >>> Post-doc at Verjovski-Almeida Lab >>> Department of Biochemistry - Institute of Chemistry >>> University of Sao Paulo, Brazil >>> >>> >> >> > > > -- > Elton Vasconcelos, DVM, PhD > Post-doc at Verjovski-Almeida Lab > Department of Biochemistry - Institute of Chemistry > University of Sao Paulo, Brazil > > <pacbio.spec> > > > -- Elton Vasconcelos, DVM, PhD Post-doc at Verjovski-Almeida Lab Department of Biochemistry - Institute of Chemistry University of Sao Paulo, Brazil |