Re: [wgs-assembler-users] caught in overlap.sh?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dear Christoph,

I have successfully done an assembly of about 350 mio reads for a 1.2 Gb
genome using Celera Assembler 6.1 (which version do you use?) from 454 and
Solexa data. Anyway it took about 1.5 month to complete on a 48 core server
and used plenty of disk space (2 - 3 TB) and there were lots of manual work
with failed contigs that had to be corrected manually. So 10 days might be
not enough. (The data will not be lost after the ten days, as you can
resume the overlap.sh jobs manually and everything done so far is saved to
disk) I also see from your mail that you are using a very high coverage of
your genome. Celera may not take profit from that. Maybe you could reduce
your dataset to a 50-70X coverage. That would reduce the computing time
dramatically as computing time increases quadratically with
(readnumber/coverage). It also depends how you did configure the
overlapper. Depending on the configuration calculating the overlap jobs
might take longer for each job or be more or less constant in computing
time for each job.

Another possibility I tried for a different genome (2.5Gb 10^9 reads -> I
did not want to wait for three month...) is to use an debrujin graph
assembler to assemble the Illumina data (I would recommend SOAPdenovo or
CLC, the later one can also make use of the 454 data), split the resulting
scaffolds to contigs smaller than 32000bp and feed them together with 454
data and a little (i.e. 5X) coverage of the illumina paired ends into the
long read version of Celera assembler supplied with the pacificbio
correction pipeline. These steps took about 1 week and delivered a much
better assembly compared to using de bruijn graph assemblers alone.

Question to other users/developers, did you also experience that if
Illumina reads are stored in the packed format, the overlap jobs do not
reach the maximum speed they should? I mean for example an overlap job
configured to 12 threads is running only on 8 threads on average. Has
anyone encountered this problem?

I wish you good luck,

Heiner

On Tue, 24 Apr 2012 19:59:07 +0200, Christoph Hahn
<chr...@gm...>
wrote:
> Thanks for that Ariel! Leaves me with little hope though..
> Nevertheless I understand that these kind of jobs did finish in your 
> experience, right?
> 
>  From my tests and the number of overlap.sh jobs created in the inital 
> phase I was assuming to be on the safe side with a wall clock limit of 
> 10 days to finish this stage. I can maybe ask the cluster administration

> to prolong the wall clock limit, but I`d need some estimate of by how 
> long..
> I am using some 1.1 Million 454 reads (~500 bp in length) plus some 200 
> Million paired end reads plus some 14 Million single end illumina reads 
> (76 bp read length, respecitively). The genome is estimated to be only 
> about 70-100 Mb in size, but we have reason to expect a substantial 
> amount of contamination from the host (as we are dealing with a 
> parasitic organism), and also a fair bit of polymorphisms as the 
> libraries were prepared from a pooled sample.
> 
> Can anyone suggest a reasonable time frame for reaching a checkpoint 
> from which I can then resume the assembly?
> 
> Thanks in advance!!
> 
> Christoph
> 
> 
> Am 24.04.2012 18:47, schrieb Schwartz, Ariel:
>> I have experienced the same issue with our hybrid assemblies.
>> Currently I am waiting for an overlap job that has been running for 
>> almost two weeks.
>>
>> I wonder if there are some recommended settings that could be used to 
>> alleviate this problem.
>>
>> Thanks,
>>
>> Ariel
>>
>> Ariel Schwartz, Ph.D.
>> Senior Scientist, Bioinformatics
>> Synthetic Genomics, Inc.
>>
>> On 4/24/12 4:44 AM, "Christoph Hahn" <chr...@gm... 
>> <mailto:chr...@gm...>> wrote:
>>
>>     Dear CABOG developers and users,
>>
>>     I am trying to do a hybrid assembly using a combination of 454 and
>>     single- as well as paired-end illumina data.
>>
>>     After initial trouble with optimization in the
0-overlaptrim-overlap
>>     stage of my assembly I got it to run succesfully and during the
>>     previous
>>     7+ days the pipeline succesfully completetd some 2260 overlap.sh
>>     jobs.
>>     Now I am encoutering something strange:  The last pending
>>     overlap.sh job
>>     (2148 of 2261) is running now already for over 36 hours. The
>>     002148.ovb.WORKING.gz file created by this job is slowly but
steadily
>>     growing. It presently has some 631 M. Is this normal? Has anyone
>>     had a
>>     similar experience before? Maybe it will sort out it self
eventually
>>     anyway, I am just a little concerned that CABOG will not finish
>>     the job
>>     until it hits the 10 days wall clock limit that is set on the
cluster
>>     for the job, which would result in thousands of CPU hours going
>>     down the
>>     drain..
>>
>>     Please share your wisdom with me!
>>
>>     much obliged,
>>     Christoph Hahn
>>     PhD fellow
>>     University of Oslo
>>     Norway
>>
>>    
------------------------------------------------------------------------------
>>     Live Security Virtual Conference
>>     Exclusive live event will cover all the ways today's security and
>>     threat landscape has changed and how IT managers can respond.
>>     Discussions
>>     will include endpoint security, mobile security and the latest in
>>     malware
>>     threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>     _______________________________________________
>>     wgs-assembler-users mailing list
>>     wgs...@li...
>>     <mailto:wgs...@li...>
>>     https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>

-- 
---------------------------------------------------------------
Dr. Heiner Kuhl
MPI Molecular Genetics            Tel:   + 49 + 30 / 8413 1551
Next Generation Sequencing        
Ihnestrasse 73                    email: ku...@mo...
D-14195 Berlin                    http://www.molgen.mpg.de
---------------------------------------------------------------