From: Christoph H. <chr...@gm...> - 2012-04-15 14:06:02
|
Hi Brian, Thanks so much for your help! I have resumed the assembly now with the following settings: ovlHashBits=23 ovlHashBlockLength=260000000 This consumes some 8.5Gb per job and in my tests gave me a nice load of some 70% (see ex1 below), but I have discovered that the load drops to some 43% after the 13th overlapper job and stays constant after that (currently job 77, see ex2 below). So, again not very efficient. What could be the reason for that? Could it be because I am feeding CA with two separate illumina datasets (one small single end library and one large paired end library)? ex1: HASH LOADING STOPPED: strings 3524789 out of 3524789 max. HASH LOADING STOPPED: length 260000046 out of 260000046 max. HASH LOADING STOPPED: entries 127378102 out of 132120576 max (load 72.31). ### realloc Extra_Ref_Space max_extra_ref_ct = 76183793 String_Ct = 3524789 Extra_String_Ct = 755 Extra_String_Subcount = 35 Read 563144 kmers to mark to skip Kmer hits without olaps = 13633635 Kmer hits with olaps = 2890745 Multiple overlaps/pair = 0 Total overlaps produced = 2837254 Contained overlaps = 0 Dovetail overlaps = 0 ex2: HASH LOADING STOPPED: strings 3393657 out of 3393657 max. HASH LOADING STOPPED: length 260000052 out of 260000052 max. HASH LOADING STOPPED: entries 76303061 out of 132120576 max (load 43.31). ### realloc Extra_Ref_Space max_extra_ref_ct = 127528828 String_Ct = 3393657 Extra_String_Ct = 13 Extra_String_Subcount = 7 Read 563144 kmers to mark to skip Kmer hits without olaps = 5141573 Kmer hits with olaps = 3859708 Multiple overlaps/pair = 0 Total overlaps produced = 3728782 Contained overlaps = 0 Dovetail overlaps = 0 I also looked at the size of the *gkpStore/inf file. It has 1.1Gb. How do I affect which fragments are loaded first? Is it simply done by the order they are listed in the specfile? If so I have loaded the illumina fragments first. Thanks again for your help! I really appreciate it! cheers, Christoph Am 13.04.2012 17:00, schrieb Walenz, Brian: > I've seen this too, and am a bit confused where the extra space is used. > Some assemblies are spot on, others are up to twice as large. > > The entries below is 264..., where 957... of them are used. In this case, > you can either increase hashBlockLength (more memory) or decrease hashBits > (less memory). The important stat in what you show is ~30% load - most of > that 3.5gb hash table is empty. We target 70% load. Any higher and the > table does inefficient lookups, and lower wastes space and increases > overlapper overhead (more jobs). > > One thing to check is the size of file *gkpStore/inf. This is loaded into > memory nThreads+1 times. The next version (or the CVS tip version) will > make this less of a problem. If the 'inf' file is large, loading Illumina > fragments first should reduce the size. > > b > > > > On 4/13/12 10:52 AM, "Christoph Hahn"<chr...@gm...> wrote: > >> Hi Brian, >> >> Thanks for your reply and suggestions! >> >> I did follow your suggestion and configured the overlap jobs with >> ³useGrid=1, scriptOnGrid=0². I subsequently ran overlap.sh 1, etc. to >> check the memory usage. >> >> I am using the following overlap parameters: >> >> ovlHashBits=24, ovlHashBlockLength=200000000 >> >> according to my calculations this would consume some 6 GB of memory >> (3.5GB from ovlHashBits=24 + 0.5 GB overhang + some 2 GB for the 200 Mb >> of sequence loaded) per thread. >> >> The actual max memory consumption is about 9.6 GB (I ran several >> overlap.sh jobs by hand), so there is a difference of some 3.5 GB of >> memory consumption between calculated and observed. Am I missing >> anything? Where is the error in my calculation? >> >> When running the overlap.sh I get something like this: >> HASH LOADING STOPPED: strings 2695151 out of 2695151 max. >> HASH LOADING STOPPED: length 200000024 out of 200000024 max. >> HASH LOADING STOPPED: entries 95738763 out of 264241152 max >> (load 27.17). >> >> In order to optimize, one question to your rule of thumb ("As a rule of >> thumb, setting ovlHashBlockLength to twice the number of entries >> available in the table seems reasonable."): in my example, which one is >> the number of entries available in the table? 95738763 or 264241152? I >> am a little confused with the terminology... sorry. >> >> Thanks again for your kind help! >> >> cheers, >> Christoph >> >> On 12.04.2012 21:55, Walenz, Brian wrote: >>> Hi, Christoph- >>> >>> In general (but with exceptions) you can delete a stage and runCA will >>> pick up from there. For example, you can delete 4-unitigger, fiddle with >>> parameters, and restart exactly at creating unitigs. >>> >>> This works fine with overlaps. Just delete 0-overlaptrim-overlap (and >>> nothing else!), change parameters and restart runCA. It will skip >>> gatekeeper, meryl, any trimming, and move straight to configuring overlaps. >>> >>> Tip: For overlaps on large assemblies, I like to set ³useGrid=1 >>> scriptOnGrid=0². This will configure the overlap jobs, then print out a >>> qsub command to run them on SGE, but not actually submit them. I then >>> run several jobs by hand to see memory size and compute performance. To >>> run by hand, in 0-overlaptrim-overlap, run ³overlap.sh 1², ³overlap.sh >>> 2² etc. If you stop these early, they will leave an incomplete >>> ³*.WORKING.gz² file in the output directory (001/ 002/ 003/ etc). I >>> don¹t think overlap.sh checks for these files, so you don¹t even have to >>> remove them before submitting the full batch. >>> >>> b >>> >>> >>> On 4/11/12 5:02 PM, "Christoph Hahn"<chr...@gm...> wrote: >>> >>> Dear CA developers and users, >>> >>> I am trying to use Celeara assembler 7.0 to assemble a medium sized >>> genome (about 100 Mb) using a combination of 454 and illumina reads. >>> >>> I choose a bad combination of the ovlHashBits, ovlHashBlockLength >>> and ovlThreads options so that my last run stopped at the cluster I >>> am using due to exceeding memory limit in the overlaptrim step. I >>> think I know what the problem was, now, so my question is if it is >>> possible to resume runCA from any given stage. In my particular case >>> I would like to resume from the 0-overlaptrim-overlap stage with >>> altered ovlHashBits, ovlHashBlockLength and ovlThreads options. I >>> want to avaid doing the mercouts and initialtrim steps again, >>> because they seem to have worked fine. >>> >>> I read in the manual about using the /do*/ option to get a kind of >>> /startBefore/ effect. I cant seem to find any more details on this >>> in the manual, so can you maybe help me out or point me to the >>> required information on the webpage. Thanks! >>> >>> Your help is highly appreciated! >>> >>> much obliged, >>> Christoph Hahn >>> PhD student >>> University of Oslo >>> >>> |