Thread: Re: [wgs-assembler-users] resuming runCA after stop

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Brian,

Thanks so much for your help!

I have resumed the assembly now with the following settings:
ovlHashBits=23
ovlHashBlockLength=260000000

This consumes some 8.5Gb per job and in my tests gave me a nice load of 
some 70% (see ex1 below), but I have discovered that the load drops to 
some 43% after the 13th overlapper job and stays constant after that 
(currently job 77, see ex2 below). So, again not very efficient. What 
could be the reason for that? Could it be because I am feeding CA with 
two separate illumina datasets (one small single end library and one 
large paired end library)?

ex1:
HASH LOADING STOPPED: strings       3524789 out of      3524789 max.
HASH LOADING STOPPED: length      260000046 out of    260000046 max.
HASH LOADING STOPPED: entries     127378102 out of    132120576 max 
(load 72.31).
### realloc  Extra_Ref_Space  max_extra_ref_ct = 76183793
String_Ct = 3524789  Extra_String_Ct = 755  Extra_String_Subcount = 35
Read 563144 kmers to mark to skip
  Kmer hits without olaps = 13633635
     Kmer hits with olaps = 2890745
   Multiple overlaps/pair = 0
  Total overlaps produced = 2837254
       Contained overlaps = 0
        Dovetail overlaps = 0

ex2:
HASH LOADING STOPPED: strings       3393657 out of      3393657 max.
HASH LOADING STOPPED: length      260000052 out of    260000052 max.
HASH LOADING STOPPED: entries      76303061 out of    132120576 max 
(load 43.31).
### realloc  Extra_Ref_Space  max_extra_ref_ct = 127528828
String_Ct = 3393657  Extra_String_Ct = 13  Extra_String_Subcount = 7
Read 563144 kmers to mark to skip
  Kmer hits without olaps = 5141573
     Kmer hits with olaps = 3859708
   Multiple overlaps/pair = 0
  Total overlaps produced = 3728782
       Contained overlaps = 0
        Dovetail overlaps = 0

I also looked at the size of the *gkpStore/inf file. It has 1.1Gb. How 
do I affect which fragments are loaded first? Is it simply done by the 
order they are listed in the specfile? If so I have loaded the illumina 
fragments first.

Thanks again for your help! I really appreciate it!

cheers,
Christoph

Am 13.04.2012 17:00, schrieb Walenz, Brian:
> I've seen this too, and am a bit confused where the extra space is used.
> Some assemblies are spot on, others are up to twice as large.
>
> The entries below is 264..., where 957... of them are used.  In this case,
> you can either increase hashBlockLength (more memory) or decrease hashBits
> (less memory).  The important stat in what you show is ~30% load - most of
> that 3.5gb hash table is empty.  We target 70% load.  Any higher and the
> table does inefficient lookups, and lower wastes space and increases
> overlapper overhead (more jobs).
>
> One thing to check is the size of file *gkpStore/inf.  This is loaded into
> memory nThreads+1 times.  The next version (or the CVS tip version) will
> make this less of a problem.  If the 'inf' file is large, loading Illumina
> fragments first should reduce the size.
>
> b
>
>
>
> On 4/13/12 10:52 AM, "Christoph Hahn"<chr...@gm...>  wrote:
>
>> Hi Brian,
>>
>> Thanks for your reply and suggestions!
>>
>> I did follow your suggestion and configured the overlap jobs with
>> ³useGrid=1, scriptOnGrid=0². I subsequently ran overlap.sh 1, etc. to
>> check the memory usage.
>>
>> I am using the following overlap parameters:
>>
>> ovlHashBits=24, ovlHashBlockLength=200000000
>>
>> according to my calculations this would consume some 6 GB of memory
>> (3.5GB from ovlHashBits=24 + 0.5 GB overhang + some 2 GB for the 200 Mb
>> of sequence loaded) per thread.
>>
>> The actual max memory consumption is about 9.6 GB (I ran several
>> overlap.sh jobs by hand), so there is a difference of some 3.5 GB of
>> memory consumption between calculated and observed. Am I missing
>> anything? Where is the error in my calculation?
>>
>> When running the overlap.sh I get something like this:
>> HASH LOADING STOPPED: strings       2695151 out of      2695151 max.
>> HASH LOADING STOPPED: length      200000024 out of    200000024 max.
>> HASH LOADING STOPPED: entries      95738763 out of    264241152 max
>> (load 27.17).
>>
>> In order to optimize, one question to your rule of thumb ("As a rule of
>> thumb, setting ovlHashBlockLength to twice the number of entries
>> available in the table seems reasonable."): in my example, which one is
>> the number of entries available in the table? 95738763 or 264241152? I
>> am a little confused with the terminology... sorry.
>>
>> Thanks again for your kind help!
>>
>> cheers,
>> Christoph
>>
>> On 12.04.2012 21:55, Walenz, Brian wrote:
>>> Hi, Christoph-
>>>
>>> In general (but with exceptions) you can delete a stage and runCA will
>>> pick up from there. For example, you can delete 4-unitigger, fiddle with
>>> parameters, and restart exactly at creating unitigs.
>>>
>>> This works fine with overlaps. Just delete 0-overlaptrim-overlap (and
>>> nothing else!), change parameters and restart runCA. It will skip
>>> gatekeeper, meryl, any trimming, and move straight to configuring overlaps.
>>>
>>> Tip: For overlaps on large assemblies, I like to set ³useGrid=1
>>> scriptOnGrid=0². This will configure the overlap jobs, then print out a
>>> qsub command to run them on SGE, but not actually submit them. I then
>>> run several jobs by hand to see memory size and compute performance. To
>>> run by hand, in 0-overlaptrim-overlap, run ³overlap.sh 1², ³overlap.sh
>>> 2² etc. If you stop these early, they will leave an incomplete
>>> ³*.WORKING.gz² file in the output directory (001/ 002/ 003/ etc). I
>>> don¹t think overlap.sh checks for these files, so you don¹t even have to
>>> remove them before submitting the full batch.
>>>
>>> b
>>>
>>>
>>> On 4/11/12 5:02 PM, "Christoph Hahn"<chr...@gm...>  wrote:
>>>
>>>      Dear CA developers and users,
>>>
>>>      I am trying to use Celeara assembler 7.0 to assemble a medium sized
>>>      genome (about 100 Mb) using a combination of 454 and illumina reads.
>>>
>>>      I choose a bad combination of the ovlHashBits, ovlHashBlockLength
>>>      and ovlThreads options so that my last run stopped at the cluster I
>>>      am using due to exceeding memory limit in the overlaptrim step. I
>>>      think I know what the problem was, now, so my question is if it is
>>>      possible to resume runCA from any given stage. In my particular case
>>>      I would like to resume from the 0-overlaptrim-overlap stage with
>>>      altered ovlHashBits, ovlHashBlockLength and ovlThreads options. I
>>>      want to avaid doing the mercouts and initialtrim steps again,
>>>      because they seem to have worked fine.
>>>
>>>      I read in the manual about using the /do*/ option to get a kind of
>>>      /startBefore/ effect. I cant seem to find any more details on this
>>>      in the manual, so can you maybe help me out or point me to the
>>>      required information on the webpage. Thanks!
>>>
>>>      Your help is highly appreciated!
>>>
>>>      much obliged,
>>>      Christoph Hahn
>>>      PhD student
>>>      University of Oslo
>>>
>>>