From: Brian W. <th...@gm...> - 2015-01-15 23:35:16
|
I can't argue with the option bloat in CA. There are a lot of options that should be removed or shouldn't have been exposed in the first place. This is the first time I've seen merTrim be a bottleneck. I suspect it might be spending lots of time building data structures. I'll admit that runCA support for this part is weak; on large assemblies, I run the trimming by hand. The merTrim binary has a '-enablecache' option that will build, dump, and reuse the data structures between jobs. There isn't runCA support for it though. Ah! if that is your bottleneck, then we are moving the wrong way by making jobs smaller. We want to be generating one job with 48 threads enabled. So, build data structures once, then let 48 threads process all the reads in the same job. I was thinking that you're not getting multiple threads for some reason. I'm also none too pleased with sourceforge performance. They killed off support for mediawiki, forcing everyone to either use rewrite pages for their inferior wiki (no tables in the markup!) or install individual mediawiki instances. It's free, so I can't really complain too much. On Thu, Jan 15, 2015 at 12:59 PM, mathog <ma...@ca...> wrote: > On 15-Jan-2015 09:02, Brian Walenz wrote: > >> The option you're looking for is mbtThreads, with a default of 4. >> >> Also look into option mbtBatchSize, which sets how many reads to process >> per job. The default is 1 million, and you've already got at least 48 >> jobs, so this is probably not an issue. >> > > (snip) > > So, in summary, I don't know why you're not getting multiple CPUs on >> these. You can work around the problem by dropping the batch size to make >> jobs with about 8gb memory (smaller than 512/48), then run 48 jobs in >> parallel. >> > > So many options, so little time. I don't suppose anybody has put together > a script that asks for the relevant system and data information and then > emits a SPEC file to run at something approximating optimal speed on the > equipment at hand? The input would be something like (no doubt I'm leaving > out key information): > > primary node: > RAM=, CPU=, DISK= #fill in the max to use, actual could be more > cluster: Y # N if none > type=older N=10, RAM=, CPU=, DISK= > type=newer N=20, RAM=, CPU=, DISK= > queue_system=SGE > FRG types: 2 #at least 1 > Illumina N=3, totalreads= > Sanger N=2, totalreads= > > As it is now, there are a lot of parameters to fiddle with > > runCA -options | wc > 184 <- !!!! > > which probably all make perfect sense to people experienced with this > software but which are fairly mysterious when first encountered > > In any case, I did try modifying the -t parameter on 0-mertrim/mertrim.sh > while the jobs were running, and the new settings "took" as each new job > started. The run times were: > > -t ~minutes > 4 22 > 16 14 > 40 12-13 > > So there isn't much to be gained by pushing that parameter up. > > You can increase the number of jobs running at once with mbtConcurrency. >> > > Kind of my point about the script, I overlooked that one. I did use > merylThreads, but didn't realize that trim and count used different > parameters. Concurrency x Threads, that is simultaneous jobs x cpus/job? > There are 7 of the former parameters and 6 of the latter. Presumably if I > spent a couple of hours reading all the documentation (which for some > reason has been loading really, really slowly from sourceforge) I could > make a guess at what would probably work best. The hypothetical script I > alluded to would be a lot more convenient! > > Thanks, > > David Mathog > ma...@ca... > Manager, Sequence Analysis Facility, Biology Division, Caltech > |