Re: [wgs-assembler-users] CA on BlueGene server at Rice University

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thanks very much, Serge!

I'll try your suggestions and we´ll get back if it is necessary.

Cheers,
Elton

2015-06-09 14:47 GMT-03:00 Serge Koren <ser...@gm...>:

> See my replies below inline.
>
> On Jun 9, 2015, at 1:29 PM, Elton Vasconcelos <elt...@iq...> wrote:
>
> Hy Brian and Serge,
>
> I forgot to tell you last week that I am not doing PacBio reads
> self-correction. Instead I'm doing hybrid assembly (26 SMRTcells plus 3
> paired-end Illumina libraries).
> ### Question 1: ###
> Is it still worthwhile doing that in a single multi-thread machine?
> Cause I've seen a Sergey's comment that the pipeline is quite slower when
> considering correction with Illumina reads (
> http://ehc.ac/p/wgs-assembler/mailman/message/33620582/)
>
> A single machine will likely take several weeks to run the correction with
> Illumina data for your size genome, I’d advise against it. Based on your
> genome size and # smrtcells, I’d guess you have around 30X pacbio so I’d
> suggest trying the low coverage options for self-correction instead:
>
> http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Low_Coverage_Assembly
> It will still be significantly faster than the Illumina based correction.
> On a recent 20X assembly of a 2.4GB genome, self-correction ran about 50
> times faster than illumina-based correction. You can try one of the more
> recent tools for hybrid correction (LoRDEC, proovread) though I haven’t
> personally run them and can’t say how much time they would require.
>
>
> I am now running CA on a sinlge multi-thread server that has 80 threads
> and 1T RAM.
> It spent about 5 days on the "overlapInCore" step and I had to kill the
> process because of server owner's complaint about too many threads being
> consumed for a long time period.
>
> ### Question 2: ###
> Could you explain me why "overlapInCore" is using all available threads
> (80) instead of only the user's request (20)?
> My spec file is attached and I ran the following command:
> $ nohup /home/elton/wgs-8.3rc2/Linux-amd64/bin/pacBioToCA -l Test01
> -threads 20 -shortReads -genomeSize 380000000 -s pacbio.spec -fastq
> 26-SMRTcells-filtered_subreads.fastq illumina-NEW.frg  &
>
> Your spec file specifies:
> ovlThreads=20
> ovlConcurrency=20
> This means run 20 jobs each using 20 cores, thus it is really trying to
> use 400 cores on your system. If you set ovlConcurrency=1 it will use 20
> cores.
>
>
> Thanks a lot again for your attention and support,
> Best,
> Elton
>
>
> 2015-06-03 11:43 GMT-03:00 Serge Koren <ser...@gm...>:
>
>> Yes, the latest CA 8.3 release can assemble D. melanogaster in < 700CPU
>> hours. You can see the updated timings here:
>>
>> http://wgs-assembler.sourceforge.net/wiki/index.php/Version_8.3_Release_Notes
>>
>> I’ve routinely run D. melanogaster on a 16-core, 32GB machine in less
>> than a day (I haven’t timed it exactly) so for your genome you’re looking
>> at 3-4K cpu hours. You should be able to run it on a single 16-core 32GB
>> machine in a couple of days so I think it’s easiest to run it on a single
>> largish machine you have access to.
>>
>> Sergey
>>
>> On Jun 2, 2015, at 9:12 PM, Brian Walenz <th...@gm...> wrote:
>>
>> That's an old page.  The most recent page, linked from
>> http://wgs-assembler.sourceforge.net/wiki/index.php?title=Main_Page, is:
>>
>> http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR
>>
>> (look for 'self correction')
>>
>> I've run drosophilia on my 12-core development machine in a few hours to
>> overnight (I haven't timed it).  Sergey replaced blasr with a much much
>> faster algorithm, and that was where most of the time was spent.
>>
>> b
>>
>>
>> On Tue, Jun 2, 2015 at 9:02 PM, Elton Vasconcelos <elt...@iq...>
>> wrote:
>>
>>> Thanks for the hints, Brian!
>>>
>>> We'll try everything you suggested tomorrow, back in the lab.
>>> Then I'll tell you what we got.
>>> For now, I only wanna say that our main concern, instead of running
>>> runCA itself, is gonna be with the pre-assembly (correction) step, running
>>> PacBiotoCA and PBcR pipeline that are embedded in the wgs package.
>>> Please take a look at the following strategy to assemble the Drosophila
>>> genome sequenced by PacBio technology (which presents a high error rate on
>>> the base calling, ~15%)  at CBCB in Maryland :
>>> http://cbcb.umd.edu/software/PBcR/dmel.html
>>> They mentioned 621K CPU hours to correct that genome of ~122 Mb.
>>> Our organism genome is something like 380 Mb long. Three times
>>> Drosophila's one.
>>> Well, just to let you know again! ;-)
>>>
>>> Talk to you later,
>>> Thanks again.
>>> Good night!
>>> Elton
>>>
>>> 2015-06-02 20:19 GMT-03:00 Brian Walenz <th...@gm...>:
>>>
>>>> For the link problems - all those symbols come out of the kmer
>>>> package.  Check that the flags and compilers and whatnot are compatible
>>>> with those in wgs-assembler.
>>>>
>>>> The kmer configuration is a bit awkward.  A shell script (configure.sh)
>>>> dumps a config to Make.compilers, which is read by the main Makefile.
>>>> 'gmake real-clean' will remove the previous build AND the Make.compilers
>>>> file.  'gmake' by itself will first build a Make.compilers by calling
>>>> configure.sh, then continue on with the build.  The proper way to modify
>>>> this is:
>>>>
>>>> edit configure.sh
>>>> gmake real-clean
>>>> gmake install
>>>> repeat until it works
>>>>
>>>> In configure.sh, there is a block of flags for Linux-amd64.  I think
>>>> it'll be easy to apply the same changes made for wgs-assembler.
>>>>
>>>> After rebuilding kmer, the wgs-assembler build should need to just link
>>>> -- in other words, remove just wgs-assembler/Linux-amd64/bin -- don't do
>>>> 'gmake clean' here!  You might need to remove the dependency directory
>>>> 'dep' too.
>>>>
>>>>
>>>> For running - the assembler will emit an SGE submit command to run a
>>>> single shell script on tens-to-hundreds-to-thousands of jobs.  Each job
>>>> will be 8-32gb (tunable) and 1-32 cores (nothing special here: more is
>>>> faster, fewer is slower).  If you can figure out how to run jobs of the
>>>> form "command.sh 1", "command.sh 2", "command.sh 3", ..., "command.sh N" on
>>>> on BG/Q you're most of the way to running CA.  To make it output such a
>>>> submit command, supply "useGrid=1 scriptOnGrid=0" to runCA.
>>>>
>>>> The other half of the assembler will be either large I/O or large
>>>> memory.  If you've got access to a machine with 256gb and 32 cores you
>>>> should be fine.  I don't know what a minimum usable machine size would be.
>>>>
>>>> So, the flow of the computer will be:
>>>>
>>>> On the 256gb machine:  runCA useGrid=1 scriptOnGrid=0 ....
>>>> Wait for it to emit a submit command
>>>> Launch those jobs on BG/Q
>>>> Wait for those to finish
>>>> Relaunch runCA on the 256gb machine.  It'll check that the job outputs
>>>> are complete, and continue processing, probably emitting another submit
>>>> command, so repeat.
>>>>
>>>> Historical note: back when runCA was first developed, we had a DEC
>>>> Alpha Tru64 machine with 4 CPUs and 32gb of RAM, and a grid of a few
>>>> hundred 2 CPU, 2gb, 32-bit Linux machines.  The Alpha wasn't in the grid,
>>>> and a different architecture anyway, so we had to run CA this way.  It was
>>>> a real chore.  We're all spoiled with our 4 core 8gb laptops now...
>>>>
>>>> b
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 2, 2015 at 5:49 PM, Elton Vasconcelos <elt...@iq...>
>>>> wrote:
>>>>
>>>>> Thanks Brian, Serge and Huang,
>>>>>
>>>>> We've gone through fixing several error messages during the
>>>>> compilation within the src/ dir from the latest wgs-8.3rc2.tar.bz2 package.
>>>>> At the end of the day we stopped on "undefined reference" errors on
>>>>> static libraries (mainly libseq.a, please see make_progs.log file).
>>>>>
>>>>> The 'gmake install' command within the kmer/ dir ran just fine.
>>>>>
>>>>> The following indicates BGQ OS type:
>>>>> [erv3@bgq-fn src]$ uname -a
>>>>> Linux bgq-fn.rcsg.rice.edu 2.6.32-431.el6.ppc64 #1 SMP Sun Nov 10
>>>>> 22:17:43 EST 2013 ppc64 ppc64 ppc64 GNU/Linux
>>>>>
>>>>> We also had to edit c_make.as file, adding some -I options (to
>>>>> indicate paths to libraries) on the CFLAGS fields from the "OSTYPE, Linux"
>>>>> section.
>>>>>
>>>>> Running "make objs" and "make libs" separately, everything appeared to
>>>>> work fine (see attached files make_objs.log and make_libs.log).
>>>>> The above mentioned trouble came up on the "make progs" final command
>>>>> we ran (make_progs.log file).
>>>>>
>>>>> Well, just to let you guys know and to see whether some light can be
>>>>> shed.
>>>>>
>>>>> Thanks a lot,
>>>>> Cheers,
>>>>> Elton
>>>>>
>>>>> PS: I also noticed about the MPI cluster system on BGQ, Brian. So, do
>>>>> you think it isn't worthwhile keeping the attempt to install CA on BGQ?
>>>>>
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Elton Vasconcelos, DVM, PhD
>>> Post-doc at Verjovski-Almeida Lab
>>> Department of Biochemistry - Institute of Chemistry
>>> University of Sao Paulo, Brazil
>>>
>>>
>>
>>
>
>
> --
> Elton Vasconcelos, DVM, PhD
> Post-doc at Verjovski-Almeida Lab
> Department of Biochemistry - Institute of Chemistry
> University of Sao Paulo, Brazil
>
>  <pacbio.spec>
>
>
>

-- 
Elton Vasconcelos, DVM, PhD
Post-doc at Verjovski-Almeida Lab
Department of Biochemistry - Institute of Chemistry
University of Sao Paulo, Brazil