Re: [svtoolkit-help] scalability

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi, Amrita,
I'll take a stab below....

On 1/27/15, 7:27 PM, Basu, Amrita wrote:
> I am a potential user of GenomeStrip and had a few support related questions in order to make sure this tool is compatible with my system to process thousands of genomes.
>
>
>    1.  Is it hadoop-friendly?
No, the methods are designed around direct bam file access (or cram in 
the future).
>    2.  Does it have a parallelization strategy?
Yes, based on the Queue workflow engine.  See the documentation pages 
here: http://www.broadinstitute.org/software/genomestrip/documentation
>    3.  Is the algorithm tractable? Any I/O issues?
Well, all NGS algorithms have to swallow a lot of data, so disk 
bandwidth tends to be an issue at scale.

We have focused a lot of our efforts recently on the scalability of our 
new CNV discovery pipeline.
We are routinely calling in batches of 500-1000 deep (30x) whole genomes 
with this pipeline and we appear to have headroom to scale further.

The older deletion discovery pipeline has suffered a bit from lack of 
recent attention to performance, but we did successfully run it on 1000 
Genomes Phase 3 (2500 individuals at 4-8x coverage).
The best strategy for scaling up the deletion discovery pipeline is to 
limit the batch sizes (we ran 1000G phase 3 in 5 batches of 500 samples 
each).

>    4.  Scalable? Can be run on multiple nodes? Which parts are not scalable?
Yes, see above.
Our design target is to write all of the code to run in java on large 
data sets using a maximum 4G heap.
We can't always achieve this, but most of the code meets this criteria.
One notable place where the memory depends on the input data is the read 
pair clustering step in deletion discovery.
That part of the code probably has the worst theoretical (and practical) 
scalability.
>    5.  How well does it do on the genome in a bottle dataset?
As far as I know, genomes in a bottle is only NA12878.
Genome STRiP is a set of population-based methods, so we haven't tried 
to benchmark against genomes-in-a-bottle.
In theory, calling more samples together should improve results.  In our 
recent testing, we are finding this to be even more true than I would 
have expected.
For example, we do OK when calling in 100 deep (30x) whole genome 
samples, but much better when we call in 500 or 1000 together.

Hope this helps,
-Bob
>
> Thanks,
> Amrita
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> svtoolkit-help mailing list
> svt...@li...
> https://lists.sourceforge.net/lists/listinfo/svtoolkit-help