|
From: Zhao, D. <Dav...@ST...> - 2009-12-13 12:45:33
|
Mark, Those are some great results that you described in your email. I'll download and play with GATK to get more detail information, and especially want to try on SGE. I would love to take advantage on parallel computing. Thanks. David -----Original Message----- From: Mark A. DePristo [mailto:dep...@br...] Sent: Saturday, December 12, 2009 8:25 AM To: Zhao, David Cc: Sean Davis; sam...@li...; gs...@br...; Matt Hanna; Aaron McKenna Subject: Re: [Samtools-help] Fastest Way From BAM to SAM I do think the GATK can be used to do distributed merging and conversion to SAM. If you have N bam files available, each indexed separately, you can run: java -jar GenomeAnalysisTK -T PrintReads -I bam1 -I bam2 -o merged.sam -L chr1:1-10,000,000 This will dynamically merge the N input bam files and print the resulting merged SAM record stream to merged.sam for all reads covering chr1 1-10mb. You can run this any number of ways parallel to generate N separately merged sam file. You'll get some duplicate reads at the boundaries, but you can do it efficiently any number of ways parallel like this. If you keep the breaks at chromosome boundaries it'll be fine. You can use the GATK on a SGE -- it's just a command line tool and will work anywhere where you have Java 1.6 available and a shell. Also, it's worth noting that this is precisely how we create merged BAM files for the pilot 1 arm of the 1000 Genomes project. On a thumper file system (not particularly fast) we can merge ~5tb of bam files from 180 individuals into 3 populations of 60 individuals with 23 single-chromosome BAM files in < 10 hours only 10 ways parallel on our farm. Each single job there is accessing 60 BAM files + indices, so this is something like the equivalent of 600 parallel reads and writes. Have a look at the docs here: http://www.broadinstitute.org/gsa/wiki/index.php/Built-in_walkers Best, Mark On Dec 12, 2009, at 8:21 AM, Zhao, David wrote: > Haven't use GATK map/reduce yet. Can GATK be deployed on a sun grid > engine? > > Thanks. > David > > -----Original Message----- > From: sea...@gm... [mailto:sea...@gm...] On Behalf Of > Sean Davis > Sent: Friday, December 11, 2009 8:44 PM > To: Zhao, David > Cc: sam...@li... > Subject: Re: [Samtools-help] Fastest Way From BAM to SAM > > On Fri, Dec 11, 2009 at 6:10 PM, Zhao, David <Dav...@st...> > wrote: >> If I have a large BAM file (e.g. 200 GB) and need to convert to a >> SAM file >> in a very short period of time (e.g. 15 minutes), what options do I >> have? >> > > My guess is that the process is IO limited. You would need a VERY > fast disk system on a local machine to do this, even if you just read > and write without a conversion. Even on a cluster, you may have > problems if the cluster shares a single disk subsystem that is not > distributed. > >> >> Are there tools to split the large BAM file into multiple small BAM >> files, >> then I can use a grid to run multiple BAM to SAM in parallel, and >> merge >> these multiple SAM files back together into one SAM file? > > I wonder if GATK could be used in a hadoop cluster (which would, then, > have a distributed file system) to accomplish the task? > > Sean > > > Email Disclaimer: www.stjude.org/emaildisclaimer > ------------------------------------------------------------------------------ > Return on Information: > Google Enterprise Search pays you back > Get the facts. > http://p.sf.net/sfu/google-dev2dev > _______________________________________________ > Samtools-help mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-help Mark A. DePristo, Ph.D. Manager, Medical and Population Genetics Analysis Broad Institute of Harvard and MIT dep...@br... ma...@de... |