From: Colin H. <co...@no...> - 2011-12-22 02:11:54
|
Hi Shaun, Compression is pretty slow, using --fast helps a lot and you still get most of the compression. It's a good option except perhaps for archival storage. A lot of the samtools functions let you set the compression level to fast with option -1 or off with -u. It's in merge but not in sort so you're stuck with default compression which is quite slow. It's also a bit silly they don't have it as a lot of sorted bam files are subsequently merged and use of fast compression for the sort could be a significant benefit. If you have time could you try picard SortSam, it lets you set the compression level so you could try with equivalent of --fast. Kind Regards, Colin On Thu, Dec 22, 2011 at 4:35 AM, Shaun Jackman <sja...@bc...> wrote: > Hi Colin, > > Good point. I often work with SAM files or compressed SAM files rather > than BAM files. For me, a tool that takes SAM input and produces SAM > output is often more useful than a tool that produces BAM output. > > I ran some more timing tests taking a SAM input and producing a > compressed output, either .bam or .sam.gz format. The compression, it > seems, is as much work as the sorting (for a 3.5 GB SAM file). > > Jared Simpson pointed out that I should set the memory buffer to the > same amount for the two tools. I've set the memory buffer to 8 GB for a > 3.5 GB file so that both tools will sort entirely in main memory. > > The fastest way to sort and compress a SAM file was a UNIX sort piped > into gzip --fast, which was 30% faster than samtools sort. The gzip > --fast compressed SAM file was 18% larger than the BAM file. The default > gzip compressed SAM file was 7% smaller than the BAM file, but took 15% > longer than samtools sort. > > 2m57s samtools view -Su |samtools sort > 3m28s sort |samtools view -Sb > 3m47s sort |gzip > 2m3s sort |gzip --fast > > 627 MB samtools view -Su |samtools sort > 627 MB sort |samtools view -Sb > 586 MB sort |gzip > 737 MB sort |gzip --fast > > Cheers, > Shaun > > $ time samtools view -Su test.sam |samtools sort -m 8589934592 -o - - > >/dev/null > real 2m57.482s > user 2m55.054s > sys 0m6.648s > > $ time sort -S8G -snk3 -k4 test.sam |samtools view -Sbt GRCh37.fa - > >/dev/null > real 3m28.060s > user 3m26.836s > sys 0m4.762s > > $ time sort -S8G -snk3 -k4 test.sam |gzip >/dev/null > > real 3m47.821s > user 3m47.739s > sys 0m3.286s > > $ time sort -S8G -snk3 -k4 test.sam |gzip --fast >/dev/null > > real 2m3.292s > user 2m3.019s > sys 0m4.336s > > On Tue, 2011-12-20 at 18:13 -0800, Colin Hercus wrote: > > Hi Shaun, > > > > That's interesting but you are ending up with two different results. > > With samtools sort you end up with a compressed bam file and with > > Linux sort you still have a sam file (and with no headers). Add the > > sam to compressed bam cost to Linux sort and I think samtools is the > > winner. > > > > Kind Regards, Colin > > > > On Wed, Dec 21, 2011 at 4:04 AM, Shaun Jackman <sja...@bc...> > > wrote: > > Hi, > > > > To sort a SAM file, UNIX sort takes less than half the time of > > samtools. > > Here's a test with a 3.5 GB SAM file: > > > > $ time samtools view -Su test.sam |samtools sort -o - - > > >/dev/null > > [samopen] SAM header is present: 25 sequences. > > [bam_sort_core] merging from 7 files... > > > > real 3m55.149s > > user 3m48.554s > > sys 0m5.623s > > > > $ time sort -snk3 -k4 test.sam >/dev/null > > > > real 1m38.004s > > user 1m26.216s > > sys 0m7.494s > > > > This trick works if your sequence IDs are in an order that can > > be sorted > > by UNIX sort. That is, the @SQ headers must be sorted either > > alphabetically or numerically. The above sort command uses the > > -n option > > to sort numerically. > > > > Cheers, > > Shaun > > > > $ sort --version > > sort (GNU coreutils) 7.6 > > $ samtools > > Program: samtools (Tools for alignments in the SAM format) > > Version: 0.1.18 (r982:295) > > > > $ time samtools view -Su 30NE8AAXX_3.sam >test.bam > > [samopen] SAM header is present: 25 sequences. > > > > real 1m8.586s > > user 0m40.915s > > sys 0m4.096s > > > > $ du -h test.bam > > 2.9G test.bam > > > > $ time samtools sort -o test.bam - >/dev/null > > [bam_sort_core] merging from 7 files... > > > > real 3m47.551s > > user 3m3.334s > > sys 0m3.145s > > > > $ time samtools view -Sb 30NE8AAXX_3.sam >test.bam > > [samopen] SAM header is present: 25 sequences. > > > > real 2m37.593s > > user 2m33.125s > > sys 0m3.267s > > > > $ du -h test.bam > > 835M test.bam > > > > $ time samtools sort -o test.bam - >/dev/null > > [bam_sort_core] merging from 7 files... > > > > real 3m28.348s > > user 3m16.909s > > sys 0m2.065s > > > > > > > > > > > ------------------------------------------------------------------------------ > > Write once. Port to many. > > Get the SDK and tools to simplify cross-platform app > > development. Create > > new or port existing apps to sell to consumers worldwide. > > Explore the > > Intel AppUpSM program developer opportunity. > > appdeveloper.intel.com/join > > http://p.sf.net/sfu/intel-appdev > > _______________________________________________ > > Samtools-devel mailing list > > Sam...@li... > > https://lists.sourceforge.net/lists/listinfo/samtools-devel > > > > |