Re: [svtoolkit-help] scalability
Status: Beta
Brought to you by:
bhandsaker
From: Bob H. <han...@br...> - 2015-01-28 14:28:46
|
Hi, Amrita, I'll take a stab below.... On 1/27/15, 7:27 PM, Basu, Amrita wrote: > I am a potential user of GenomeStrip and had a few support related questions in order to make sure this tool is compatible with my system to process thousands of genomes. > > > 1. Is it hadoop-friendly? No, the methods are designed around direct bam file access (or cram in the future). > 2. Does it have a parallelization strategy? Yes, based on the Queue workflow engine. See the documentation pages here: http://www.broadinstitute.org/software/genomestrip/documentation > 3. Is the algorithm tractable? Any I/O issues? Well, all NGS algorithms have to swallow a lot of data, so disk bandwidth tends to be an issue at scale. We have focused a lot of our efforts recently on the scalability of our new CNV discovery pipeline. We are routinely calling in batches of 500-1000 deep (30x) whole genomes with this pipeline and we appear to have headroom to scale further. The older deletion discovery pipeline has suffered a bit from lack of recent attention to performance, but we did successfully run it on 1000 Genomes Phase 3 (2500 individuals at 4-8x coverage). The best strategy for scaling up the deletion discovery pipeline is to limit the batch sizes (we ran 1000G phase 3 in 5 batches of 500 samples each). > 4. Scalable? Can be run on multiple nodes? Which parts are not scalable? Yes, see above. Our design target is to write all of the code to run in java on large data sets using a maximum 4G heap. We can't always achieve this, but most of the code meets this criteria. One notable place where the memory depends on the input data is the read pair clustering step in deletion discovery. That part of the code probably has the worst theoretical (and practical) scalability. > 5. How well does it do on the genome in a bottle dataset? As far as I know, genomes in a bottle is only NA12878. Genome STRiP is a set of population-based methods, so we haven't tried to benchmark against genomes-in-a-bottle. In theory, calling more samples together should improve results. In our recent testing, we are finding this to be even more true than I would have expected. For example, we do OK when calling in 100 deep (30x) whole genome samples, but much better when we call in 500 or 1000 together. Hope this helps, -Bob > > Thanks, > Amrita > > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming. The Go Parallel Website, > sponsored by Intel and developed in partnership with Slashdot Media, is your > hub for all things parallel software development, from weekly thought > leadership blogs to news, videos, case studies, tutorials and more. Take a > look and join the conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > svtoolkit-help mailing list > svt...@li... > https://lists.sourceforge.net/lists/listinfo/svtoolkit-help |