Re: [Samtools-help] Fastest Way From BAM to SAM

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Mark,

Those are some great results that you described in your email. I'll download and play with GATK to get more detail information, and especially want to try on SGE. I would love to take advantage on parallel computing.

Thanks.
David
-----Original Message-----
From: Mark A. DePristo [mailto:dep...@br...] 
Sent: Saturday, December 12, 2009 8:25 AM
To: Zhao, David
Cc: Sean Davis; sam...@li...; gs...@br...; Matt Hanna; Aaron McKenna
Subject: Re: [Samtools-help] Fastest Way From BAM to SAM

I do think the GATK can be used to do distributed merging and  
conversion to SAM.  If you have N bam files available, each indexed  
separately, you can run:

	java -jar GenomeAnalysisTK -T PrintReads -I bam1 -I bam2 -o  
merged.sam -L chr1:1-10,000,000

This will dynamically merge the N input bam files and print the  
resulting merged SAM record stream to merged.sam for all reads  
covering chr1 1-10mb.  You can run this any number of ways parallel to  
generate N separately merged sam file.  You'll get some duplicate  
reads at the boundaries, but you can do it efficiently any number of  
ways parallel like this.  If you keep the breaks at chromosome  
boundaries it'll be fine.

You can use the GATK on a SGE -- it's just a command line tool and  
will work anywhere where you have Java 1.6 available and a shell.

Also, it's worth noting that this is precisely how we create merged  
BAM files for the pilot 1 arm of the 1000 Genomes project.  On a  
thumper file system (not particularly fast) we can merge ~5tb of bam  
files from 180 individuals into 3 populations of 60 individuals with  
23 single-chromosome BAM files in < 10 hours only 10 ways parallel on  
our farm.  Each single job there is accessing 60 BAM files + indices,  
so this is something like the equivalent of 600 parallel reads and  
writes.

Have a look at the docs here:

http://www.broadinstitute.org/gsa/wiki/index.php/Built-in_walkers

Best,

Mark

On Dec 12, 2009, at 8:21 AM, Zhao, David wrote:

> Haven't use GATK map/reduce yet. Can GATK be deployed on a sun grid  
> engine?
>
> Thanks.
> David
>
> -----Original Message-----
> From: sea...@gm... [mailto:sea...@gm...] On Behalf Of  
> Sean Davis
> Sent: Friday, December 11, 2009 8:44 PM
> To: Zhao, David
> Cc: sam...@li...
> Subject: Re: [Samtools-help] Fastest Way From BAM to SAM
>
> On Fri, Dec 11, 2009 at 6:10 PM, Zhao, David <Dav...@st...>  
> wrote:
>> If I have a large BAM file (e.g. 200 GB) and need to convert to a  
>> SAM file
>> in a very short period of time (e.g. 15 minutes), what options do I  
>> have?
>>
>
> My guess is that the process is IO limited.  You would need a VERY
> fast disk system on a local machine to do this, even if you just read
> and write without a conversion.  Even on a cluster, you may have
> problems if the cluster shares a single disk subsystem that is not
> distributed.
>
>>
>> Are there tools to split the large BAM file into multiple small BAM  
>> files,
>> then I can use a grid to run multiple BAM to SAM in parallel, and  
>> merge
>> these multiple SAM files back together into one SAM file?
>
> I wonder if GATK could be used in a hadoop cluster (which would, then,
> have a distributed file system) to accomplish the task?
>
> Sean
>
>
> Email Disclaimer:  www.stjude.org/emaildisclaimer
> ------------------------------------------------------------------------------
> Return on Information:
> Google Enterprise Search pays you back
> Get the facts.
> http://p.sf.net/sfu/google-dev2dev
> _______________________________________________
> Samtools-help mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-help

Mark A. DePristo, Ph.D.
Manager, Medical and Population Genetics Analysis
Broad Institute of Harvard and MIT
dep...@br...
ma...@de...