Re: [wgs-assembler-users] caught in overlap.sh?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Heiner,

Thanks for your effort and helpful comments! The overlap job did 
actually finish now, but unfortunately CABOG crashed right afterwards 
because of exceeding disc space. Very unfortunate, but I have to ask for 
more disk space before I can resume the assembly manually. I was not 
expecting to complete the whole assembly in ten days, just the 
overlap-trim stage for now..

Concerning reducing the coverage: I thought about that, but I have also 
tested several DeBrujin graph assemblers and have discovered that I get 
the best results when using all the illumina data (instead of only a 
subset of it). The illumina data I am using is already errorcorrected. I 
decided to use the data like that and to rely on the CABOG trimming 
algorithm. With stringent manual trimming prior to CABOG I could reduce 
the number of illumina reads to some 160 Mio (paired end reads). Also, I 
suppose to leave the 14 MIO single end illumina reads out will not 
substantially affect the result. That would result in some 160 Mio 
illumina reads (76 bp) + 1.1 Mio 454 reads (500bp) - assuming a 100Mb 
genome still a theoretical 130x coverage - when assuming some 20-30 % 
host and bacterial contamination we reach about 100x coverage. The 
question now is, what would be more effective. Either resume the 
assembly with the data as it is, or start from scratch with the trimmed 
data.
An effective solution in terms of runtime is unfortunately very 
important to me as I only have a limited amount of CPU hours available 
on the cluster. I can ask for more but only after the initial quota is 
exceeded and then it involves annoying bureaucracy and waiting time. 
Just to clearify why CPU hours are such an issue for me - sorry to 
bother you with that..

I put quite some time and effort into the configuration of the overlap 
jobs to reach a hash table load of some 70%, as suggested on the manual 
page. This was not so easy because the load varied between libraries, so 
I decided to focus on the paired end illumina library as this is the 
vast majority of the data. I had configured for 8 threads and the 
pipeline was constantly using all 8 threads. My illumina data is in 
zipped format.

The alternative approach you are mentioning below sounds very 
interesting, especially as I already have the best possible (I believe 
so  at least :-)) solexa only assembly available.. Can you give me some 
more detailed information on that? Where to find this Celera version? 
The snag is that I would need to convince the cluster administration to 
install the other Celera version.. Almost forgot: I am using Celera 
assembler 7.0 right now.

Thanks again for your suggestions and apologies for a long message..!

cheers,
Christoph

Am 25.04.2012 11:28, schrieb kuhl:
> Dear Christoph,
>
> I have successfully done an assembly of about 350 mio reads for a 1.2 Gb
> genome using Celera Assembler 6.1 (which version do you use?) from 454 and
> Solexa data. Anyway it took about 1.5 month to complete on a 48 core server
> and used plenty of disk space (2 - 3 TB) and there were lots of manual work
> with failed contigs that had to be corrected manually. So 10 days might be
> not enough. (The data will not be lost after the ten days, as you can
> resume the overlap.sh jobs manually and everything done so far is saved to
> disk) I also see from your mail that you are using a very high coverage of
> your genome. Celera may not take profit from that. Maybe you could reduce
> your dataset to a 50-70X coverage. That would reduce the computing time
> dramatically as computing time increases quadratically with
> (readnumber/coverage). It also depends how you did configure the
> overlapper. Depending on the configuration calculating the overlap jobs
> might take longer for each job or be more or less constant in computing
> time for each job.
>
> Another possibility I tried for a different genome (2.5Gb 10^9 reads ->  I
> did not want to wait for three month...) is to use an debrujin graph
> assembler to assemble the Illumina data (I would recommend SOAPdenovo or
> CLC, the later one can also make use of the 454 data), split the resulting
> scaffolds to contigs smaller than 32000bp and feed them together with 454
> data and a little (i.e. 5X) coverage of the illumina paired ends into the
> long read version of Celera assembler supplied with the pacificbio
> correction pipeline. These steps took about 1 week and delivered a much
> better assembly compared to using de bruijn graph assemblers alone.
>
> Question to other users/developers, did you also experience that if
> Illumina reads are stored in the packed format, the overlap jobs do not
> reach the maximum speed they should? I mean for example an overlap job
> configured to 12 threads is running only on 8 threads on average. Has
> anyone encountered this problem?
>
>
> I wish you good luck,
>
> Heiner
>
>
> On Tue, 24 Apr 2012 19:59:07 +0200, Christoph Hahn
> <chr...@gm...>
> wrote:
>> Thanks for that Ariel! Leaves me with little hope though..
>> Nevertheless I understand that these kind of jobs did finish in your
>> experience, right?
>>
>>   From my tests and the number of overlap.sh jobs created in the inital
>> phase I was assuming to be on the safe side with a wall clock limit of
>> 10 days to finish this stage. I can maybe ask the cluster administration
>> to prolong the wall clock limit, but I`d need some estimate of by how
>> long..
>> I am using some 1.1 Million 454 reads (~500 bp in length) plus some 200
>> Million paired end reads plus some 14 Million single end illumina reads
>> (76 bp read length, respecitively). The genome is estimated to be only
>> about 70-100 Mb in size, but we have reason to expect a substantial
>> amount of contamination from the host (as we are dealing with a
>> parasitic organism), and also a fair bit of polymorphisms as the
>> libraries were prepared from a pooled sample.
>>
>> Can anyone suggest a reasonable time frame for reaching a checkpoint
>> from which I can then resume the assembly?
>>
>> Thanks in advance!!
>>
>> Christoph
>>
>>
>> Am 24.04.2012 18:47, schrieb Schwartz, Ariel:
>>> I have experienced the same issue with our hybrid assemblies.
>>> Currently I am waiting for an overlap job that has been running for
>>> almost two weeks.
>>>
>>> I wonder if there are some recommended settings that could be used to
>>> alleviate this problem.
>>>
>>> Thanks,
>>>
>>> Ariel
>>>
>>> Ariel Schwartz, Ph.D.
>>> Senior Scientist, Bioinformatics
>>> Synthetic Genomics, Inc.
>>>
>>> On 4/24/12 4:44 AM, "Christoph Hahn"<chr...@gm...
>>> <mailto:chr...@gm...>>  wrote:
>>>
>>>      Dear CABOG developers and users,
>>>
>>>      I am trying to do a hybrid assembly using a combination of 454 and
>>>      single- as well as paired-end illumina data.
>>>
>>>      After initial trouble with optimization in the
> 0-overlaptrim-overlap
>>>      stage of my assembly I got it to run succesfully and during the
>>>      previous
>>>      7+ days the pipeline succesfully completetd some 2260 overlap.sh
>>>      jobs.
>>>      Now I am encoutering something strange:  The last pending
>>>      overlap.sh job
>>>      (2148 of 2261) is running now already for over 36 hours. The
>>>      002148.ovb.WORKING.gz file created by this job is slowly but
> steadily
>>>      growing. It presently has some 631 M. Is this normal? Has anyone
>>>      had a
>>>      similar experience before? Maybe it will sort out it self
> eventually
>>>      anyway, I am just a little concerned that CABOG will not finish
>>>      the job
>>>      until it hits the 10 days wall clock limit that is set on the
> cluster
>>>      for the job, which would result in thousands of CPU hours going
>>>      down the
>>>      drain..
>>>
>>>      Please share your wisdom with me!
>>>
>>>      much obliged,
>>>      Christoph Hahn
>>>      PhD fellow
>>>      University of Oslo
>>>      Norway
>>>
>>>
> ------------------------------------------------------------------------------
>>>      Live Security Virtual Conference
>>>      Exclusive live event will cover all the ways today's security and
>>>      threat landscape has changed and how IT managers can respond.
>>>      Discussions
>>>      will include endpoint security, mobile security and the latest in
>>>      malware
>>>      threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>>      _______________________________________________
>>>      wgs-assembler-users mailing list
>>>      wgs...@li...
>>>      <mailto:wgs...@li...>
>>>      https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>>