From: kuhl <ku...@mo...> - 2012-04-25 10:04:08
|
Dear Christoph, I have successfully done an assembly of about 350 mio reads for a 1.2 Gb genome using Celera Assembler 6.1 (which version do you use?) from 454 and Solexa data. Anyway it took about 1.5 month to complete on a 48 core server and used plenty of disk space (2 - 3 TB) and there were lots of manual work with failed contigs that had to be corrected manually. So 10 days might be not enough. (The data will not be lost after the ten days, as you can resume the overlap.sh jobs manually and everything done so far is saved to disk) I also see from your mail that you are using a very high coverage of your genome. Celera may not take profit from that. Maybe you could reduce your dataset to a 50-70X coverage. That would reduce the computing time dramatically as computing time increases quadratically with (readnumber/coverage). It also depends how you did configure the overlapper. Depending on the configuration calculating the overlap jobs might take longer for each job or be more or less constant in computing time for each job. Another possibility I tried for a different genome (2.5Gb 10^9 reads -> I did not want to wait for three month...) is to use an debrujin graph assembler to assemble the Illumina data (I would recommend SOAPdenovo or CLC, the later one can also make use of the 454 data), split the resulting scaffolds to contigs smaller than 32000bp and feed them together with 454 data and a little (i.e. 5X) coverage of the illumina paired ends into the long read version of Celera assembler supplied with the pacificbio correction pipeline. These steps took about 1 week and delivered a much better assembly compared to using de bruijn graph assemblers alone. Question to other users/developers, did you also experience that if Illumina reads are stored in the packed format, the overlap jobs do not reach the maximum speed they should? I mean for example an overlap job configured to 12 threads is running only on 8 threads on average. Has anyone encountered this problem? I wish you good luck, Heiner On Tue, 24 Apr 2012 19:59:07 +0200, Christoph Hahn <chr...@gm...> wrote: > Thanks for that Ariel! Leaves me with little hope though.. > Nevertheless I understand that these kind of jobs did finish in your > experience, right? > > From my tests and the number of overlap.sh jobs created in the inital > phase I was assuming to be on the safe side with a wall clock limit of > 10 days to finish this stage. I can maybe ask the cluster administration > to prolong the wall clock limit, but I`d need some estimate of by how > long.. > I am using some 1.1 Million 454 reads (~500 bp in length) plus some 200 > Million paired end reads plus some 14 Million single end illumina reads > (76 bp read length, respecitively). The genome is estimated to be only > about 70-100 Mb in size, but we have reason to expect a substantial > amount of contamination from the host (as we are dealing with a > parasitic organism), and also a fair bit of polymorphisms as the > libraries were prepared from a pooled sample. > > Can anyone suggest a reasonable time frame for reaching a checkpoint > from which I can then resume the assembly? > > Thanks in advance!! > > Christoph > > > Am 24.04.2012 18:47, schrieb Schwartz, Ariel: >> I have experienced the same issue with our hybrid assemblies. >> Currently I am waiting for an overlap job that has been running for >> almost two weeks. >> >> I wonder if there are some recommended settings that could be used to >> alleviate this problem. >> >> Thanks, >> >> Ariel >> >> Ariel Schwartz, Ph.D. >> Senior Scientist, Bioinformatics >> Synthetic Genomics, Inc. >> >> On 4/24/12 4:44 AM, "Christoph Hahn" <chr...@gm... >> <mailto:chr...@gm...>> wrote: >> >> Dear CABOG developers and users, >> >> I am trying to do a hybrid assembly using a combination of 454 and >> single- as well as paired-end illumina data. >> >> After initial trouble with optimization in the 0-overlaptrim-overlap >> stage of my assembly I got it to run succesfully and during the >> previous >> 7+ days the pipeline succesfully completetd some 2260 overlap.sh >> jobs. >> Now I am encoutering something strange: The last pending >> overlap.sh job >> (2148 of 2261) is running now already for over 36 hours. The >> 002148.ovb.WORKING.gz file created by this job is slowly but steadily >> growing. It presently has some 631 M. Is this normal? Has anyone >> had a >> similar experience before? Maybe it will sort out it self eventually >> anyway, I am just a little concerned that CABOG will not finish >> the job >> until it hits the 10 days wall clock limit that is set on the cluster >> for the job, which would result in thousands of CPU hours going >> down the >> drain.. >> >> Please share your wisdom with me! >> >> much obliged, >> Christoph Hahn >> PhD fellow >> University of Oslo >> Norway >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. >> Discussions >> will include endpoint security, mobile security and the latest in >> malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> wgs-assembler-users mailing list >> wgs...@li... >> <mailto:wgs...@li...> >> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> -- --------------------------------------------------------------- Dr. Heiner Kuhl MPI Molecular Genetics Tel: + 49 + 30 / 8413 1551 Next Generation Sequencing Ihnestrasse 73 email: ku...@mo... D-14195 Berlin http://www.molgen.mpg.de --------------------------------------------------------------- |