Re: [wgs-assembler-users] PGM, multithreading and default settings

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Sebastian.

I'll try to answer some of your questions, see below.

On 18 February 2013 16:52, Sebastian Juenemann
<jue...@ce...> wrote:
> Hi everybody,
>
> i want to use CELERA to assemble data from different platforms (PGM,
> MiSeq, GS Junior) independently and i'm a bit confused about how to set
> the parameters right.
>
> First off, does anyone already have experience using CELERA and Ion
> Torrent PGM data? I would suggest to treat PGM data as 454 data in terms
> of parameters but some shared experience could help. Also, as PGM
> provides sff files i would like to know if CELERA is capable to handle
> PGM based sff files or if i better should fastq exports.

I don't think CA has explicit support for PGM data, so I think you
should just treat it like 454 data. You could either use sffToCA
(http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=SffToCA)
or convert to fastq and use fastqToCA
(http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=FastqToCA).
I'm not sure which options you should use with sffToCA, which clear
range etc.

>
> Next, which options should be set to which values in order to run CELERA
> in a non-grid parallel environment. Say i do have a single machine
> having 12 CPUs (sockets) each consisting of 4 cores and i want that each
> consecutive step of CELERA is using all 48 cores. Is there one single
> option for that or do i have to set all *Threads options individually
> (e.g. 'ovlThreads=48', 'mbtThreads=48' etc.). Also, are all those steps
> for which a thread number can be given really executed consecutive?
> Regarding the memory consumption, are all *Memeory settings (as e.g.
> 'merylMemory') process based? If so it's crucial to know, because if
> relying heavily on multithreading the memory should be raised
> accordingly. Would be something like memory=number_of_threads*400MB a
> good rule of thumb estimate for those settings?

For some steps you can set both the number of processes and number of
threads. The number of threads should not affect memory much.

I often use this for ovl:
ovlConcurrency=4
ovlThreads=8
This will 4 concurrent processes, each using 8 threads.

Memory consumption is sometimes harder to control. It's easy for meryl:
merylMemory=32678
will let meryl use 32 GB RAM.

For overlapper, I use
ovlHashBits = 24
ovlHashBlockLength = 800000000
which will consume about 13 GB RAM per process, so with 4 processes
(which I set above) this will consume a total of about 52 GB.

>
> There are also a lot of other options for which i could not find any
> useful recommendation or documentation. Do i have to set all those >150
> options in order to get good results or does CELERA have good default
> values based on the input data. This confusion already starts at what
> overlapper and unitigger should i use for PGM and MiSeq (ovl, mer, umd
> and utg, bog, bogart)? Where are the main differences? Does CELERA
> determine the input type based on the filename or on the type given when
> converting to FRG format (-type).

Use ovl for all the different types, PGM, Illumina, 454, PacBio etc.
For longer reads with not a huge amount of coverage, use bog, while
for shorter reads with a lot of coverage, use bogart. Someone else
needs to explain it better, but I usually use bogart for Illumina
reads or Illumina reads in combination with other reads, while I use
bog for 454 and PacBio reads.

You have to set -type for all quality scores that's not sanger (but I
think sanger is default on 454 and Illumina today, and probably PGM
and PacBio too). You have to set -technology for everything that's not
Illumina (MiSeq is Illumina).

>
> I know these are many questions but there also a lot of settings that
> can be altered tweak, so i hope forgive me.

Set the options you have a good guess about (all the concurrency and
thread options for example) and then just try the default and see how
it works. There's some pretty good explanations on the website
(http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=RunCA),
but you have to tweak it to your organism and server. You have to
experiment a bit, and if it doesn't work, ask again on this mailing
list describing what you've done, and someone will usually be able to
come up with some good advice.

Ole

>
> Tanks,
>
> Sebastian
>
> --
> Sebastian Jünemann
> CeBiTec - Center for Biotechnology
> Bielefeld University, D-33594 Bielefeld, Germany
> Office: V6-147 -- Phone: +49-(0)521-106-4827
> eMail: jue...@Ce...
>
> ------------------------------------------------------------------------------
> The Go Parallel Website, sponsored by Intel - in partnership with Geeknet,
> is your hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials, tech docs,
> whitepapers, evaluation guides, and opinion stories. Check out the most
> recent posts - join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> wgs-assembler-users mailing list
> wgs...@li...
> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users