From: Shaun J. <sja...@bc...> - 2011-12-23 20:23:25
|
Hi Colin, I've added Picard SortSam to the table. 2m57s 627 MB samtools view -Su |samtools sort 3m28s 627 MB sort |samtools view -Sb 3m47s 586 MB sort |gzip 2m3s 737 MB sort |gzip --fast 3m42s 605 MB sort |bgzip 5m54s 737 MB Picard SortSam COMPRESSION_LEVEL=1 6m46s 632 MB Picard SortSam COMPRESSION_LEVEL=5 Cheers, Shaun $ java -version java version "1.7.0_02" Java(TM) SE Runtime Environment (build 1.7.0_02-b13) Java HotSpot(TM) 64-Bit Server VM (build 22.0-b10, mixed mode) $ time java -Xmx48G -jar ~/src/picard/picard-tools-1.58/SortSam.jar SO=coordinate I=30NE8AAXX_3.sam O=/dev/stdout MAX_RECORDS_IN_RAM=10000000 COMPRESSION_LEVEL=1 VALIDATION_STRINGENCY=SILENT |wc -c [Thu Dec 22 17:34:29 PST 2011] net.sf.picard.sam.SortSam INPUT=30NE8AAXX_3.sam OUTPUT=/dev/stdout SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=1 MAX_RECORDS_IN_RAM=10000000 VERBOSITY=INFO QUIET=false CREATE_INDEX=false CREATE_MD5_FILE=false [Thu Dec 22 17:34:29 PST 2011] Executing as sja...@xh... on Linux 2.6.18-194.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_02-b13; Picard version: 1.58(1046) INFO 2011-12-22 17:36:24 SortSam Read 10000000 records. INFO 2011-12-22 17:38:37 SortSam Finished reading inputs, merging and writing to output now. [Thu Dec 22 17:40:23 PST 2011] net.sf.picard.sam.SortSam done. Elapsed time: 5.90 minutes. Runtime.totalMemory()=24891424768 737349342 real 5m54.830s user 21m50.175s sys 0m24.438s $ time java -Xmx48G -jar ~/src/picard/picard-tools-1.58/SortSam.jar SO=coordinate I=30NE8AAXX_3.sam O=/dev/stdout MAX_RECORDS_IN_RAM=10000000 COMPRESSION_LEVEL=5 VALIDATION_STRINGENCY=SILENT |wc -c [Thu Dec 22 17:27:16 PST 2011] net.sf.picard.sam.SortSam INPUT=30NE8AAXX_3.sam OUTPUT=/dev/stdout SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=10000000 VERBOSITY=INFO QUIET=false CREATE_INDEX=false CREATE_MD5_FILE=false [Thu Dec 22 17:27:16 PST 2011] Executing as sja...@xh... on Linux 2.6.18-194.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_02-b13; Picard version: 1.58(1046) INFO 2011-12-22 17:29:18 SortSam Read 10000000 records. INFO 2011-12-22 17:31:32 SortSam Finished reading inputs, merging and writing to output now. [Thu Dec 22 17:34:00 PST 2011] net.sf.picard.sam.SortSam done. Elapsed time: 6.74 minutes. Runtime.totalMemory()=24623054848 632187930 real 6m46.889s user 23m20.322s sys 0m24.080s On Wed, 2011-12-21 at 17:48 -0800, Colin Hercus wrote: > Hi Shaun, > > Compression is pretty slow, using --fast helps a lot and you still get > most of the compression. It's a good option except perhaps for > archival storage. > > A lot of the samtools functions let you set the compression level to > fast with option -1 or off with -u. It's in merge but not in sort so > you're stuck with default compression which is quite slow. It's also a > bit silly they don't have it as a lot of sorted bam files are > subsequently merged and use of fast compression for the sort could be > a significant benefit. > > If you have time could you try picard SortSam, it lets you set the > compression level so you could try with equivalent of --fast. > > Kind Regards, Colin > > > > On Thu, Dec 22, 2011 at 4:35 AM, Shaun Jackman <sja...@bc...> > wrote: > Hi Colin, > > Good point. I often work with SAM files or compressed SAM > files rather > than BAM files. For me, a tool that takes SAM input and > produces SAM > output is often more useful than a tool that produces BAM > output. > > I ran some more timing tests taking a SAM input and producing > a > compressed output, either .bam or .sam.gz format. The > compression, it > seems, is as much work as the sorting (for a 3.5 GB SAM file). > > Jared Simpson pointed out that I should set the memory buffer > to the > same amount for the two tools. I've set the memory buffer to 8 > GB for a > 3.5 GB file so that both tools will sort entirely in main > memory. > > The fastest way to sort and compress a SAM file was a UNIX > sort piped > into gzip --fast, which was 30% faster than samtools sort. The > gzip > --fast compressed SAM file was 18% larger than the BAM file. > The default > gzip compressed SAM file was 7% smaller than the BAM file, but > took 15% > longer than samtools sort. > > 2m57s samtools view -Su |samtools sort > 3m28s sort |samtools view -Sb > 3m47s sort |gzip > 2m3s sort |gzip --fast > > 627 MB samtools view -Su |samtools sort > 627 MB sort |samtools view -Sb > 586 MB sort |gzip > 737 MB sort |gzip --fast > > Cheers, > Shaun > > $ time samtools view -Su test.sam |samtools sort -m 8589934592 > -o - - >/dev/null > real 2m57.482s > user 2m55.054s > sys 0m6.648s > > $ time sort -S8G -snk3 -k4 test.sam |samtools view -Sbt > GRCh37.fa - >/dev/null > real 3m28.060s > user 3m26.836s > sys 0m4.762s > > $ time sort -S8G -snk3 -k4 test.sam |gzip >/dev/null > > real 3m47.821s > user 3m47.739s > sys 0m3.286s > > $ time sort -S8G -snk3 -k4 test.sam |gzip --fast >/dev/null > > real 2m3.292s > user 2m3.019s > sys 0m4.336s > > On Tue, 2011-12-20 at 18:13 -0800, Colin Hercus wrote: > > Hi Shaun, > > > > That's interesting but you are ending up with two different > results. > > With samtools sort you end up with a compressed bam file and > with > > Linux sort you still have a sam file (and with no headers). > Add the > > sam to compressed bam cost to Linux sort and I think > samtools is the > > winner. > > > > Kind Regards, Colin > > > > On Wed, Dec 21, 2011 at 4:04 AM, Shaun Jackman > <sja...@bc...> > > wrote: > > Hi, > > > > To sort a SAM file, UNIX sort takes less than half > the time of > > samtools. > > Here's a test with a 3.5 GB SAM file: > > > > $ time samtools view -Su test.sam |samtools sort -o > - - > > >/dev/null > > [samopen] SAM header is present: 25 sequences. > > [bam_sort_core] merging from 7 files... > > > > real 3m55.149s > > user 3m48.554s > > sys 0m5.623s > > > > $ time sort -snk3 -k4 test.sam >/dev/null > > > > real 1m38.004s > > user 1m26.216s > > sys 0m7.494s > > > > This trick works if your sequence IDs are in an > order that can > > be sorted > > by UNIX sort. That is, the @SQ headers must be > sorted either > > alphabetically or numerically. The above sort > command uses the > > -n option > > to sort numerically. > > > > Cheers, > > Shaun > > > > $ sort --version > > sort (GNU coreutils) 7.6 > > $ samtools > > Program: samtools (Tools for alignments in the SAM > format) > > Version: 0.1.18 (r982:295) > > > > $ time samtools view -Su 30NE8AAXX_3.sam >test.bam > > [samopen] SAM header is present: 25 sequences. > > > > real 1m8.586s > > user 0m40.915s > > sys 0m4.096s > > > > $ du -h test.bam > > 2.9G test.bam > > > > $ time samtools sort -o test.bam - >/dev/null > > [bam_sort_core] merging from 7 files... > > > > real 3m47.551s > > user 3m3.334s > > sys 0m3.145s > > > > $ time samtools view -Sb 30NE8AAXX_3.sam >test.bam > > [samopen] SAM header is present: 25 sequences. > > > > real 2m37.593s > > user 2m33.125s > > sys 0m3.267s > > > > $ du -h test.bam > > 835M test.bam > > > > $ time samtools sort -o test.bam - >/dev/null > > [bam_sort_core] merging from 7 files... > > > > real 3m28.348s > > user 3m16.909s > > sys 0m2.065s |