About the SAM/BAM Format
How can I get alignments in the SAM/BAM format?
Many alignment programs generate SAM/BAM natively or output a format that can be converted to SAM/BAM. Please check out this page for the complete list. If your preferred software is not in this list, you may contact the developers or write your own, and then please let us know.
How unaligned reads are stored in SAM?
An unaligned reads must be flagged with 0x4. It may have no coordinate (i.e. a coordinate `*:0'), but may have an ordinary coordinate with the CIGAR field set to `*'. In SAM, if one read in a read pair is aligned but the mate is not, we strongly recommend to set the coordinate of the unmapped read the same as that of the mapped one such that in a position sorted SAM/BAM file, the unmapped read is adjacent to the mapped. This convention greatly helps local assembly when we want to collect all related reads in a small region.
CIGAR is `50M', but I see mismatches in the alignment.
CIGAR operation `M' means `alignment match' (i.e. not a gap). It may be a `sequence match' or a `sequence mismatch'. Mismatching information is stored in the `MD' tag which is optional but can be generated with the `calmd' samtools command. We are proposing new CIGAR operations `=' for sequence match and `X' for sequence mismatch, but they are not well supported by samtools.
How to convert SAM to BAM?
If your SAM file has header @SQ lines, you may get BAM by:
samtools view -bS aln.sam > aln.bam
If not, you need to have your reference file ref.fa and then do this:
samtools faidx ref.fa samtools view -bt ref.fa.fai aln.sam > aln.bam
The second method also works if your SAM file has @SQ lines. After conversion, you would probably like to sort and index the alignment to enable fast random access:
samtools sort aln.bam aln-sorted samtools index aln-sorted.bam
I want to call SNPs and short indels.
For a short answer, do this:
samtools pileup -vcf ref.fa aln.bam | tee raw.txt | samtools.pl varFilter -D100 > flt.txt awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' flt.txt > final.txt
For a long answer, see this protocol. Please always remember to set the maximum depth (-D) in filtering.
I want to call SNPs from one chromosome only.
Index your alignment with the `index' command and:
samtools view -u aln.bam chr10 | samtools pileup -vcf ref.fa - > chr10.raw.txt
Please read this page for more information.
The integer FLAG field is not friendly to eyes.
You may get string FLAG by:
samtools view -X aln.bam | less -S
For more information, please check out:
samtools view -?
I do not understand the columns in the pileup output.
This is explained in the manual page. Or briefly (when you invoke pileup with the -c option):
- reference sequence name
- reference coordinate
- reference base, or `*' for an indel line
- genotype where heterozygotes are encoded in the IUB code: M=A/C, R=A/G, W=A/T, S=C/G, Y=C/T and K=G/T; indels are indicated by, for example, */+A, -A/* or +CC/-C. There is no difference between */+A or +A/*.
- Phred-scaled likelihood that the genotype is wrong, which is also called `consensus quality'.
- Phred-scaled likelihood that the genotype is identical to the reference, which is also called `SNP quality'. Suppose the reference base is A and in alignment we see 17 G and 3 A. We will get a low consensus quality because it is difficult to distinguish an A/G heterozygote from a G/G homozygote. We will get a high SNP quality, though, because the evidence of a SNP is very strong.
- root mean square (RMS) mapping quality
- # reads covering the position
- read bases at a SNP line (check the manual page for more information); the 1st indel allele otherwise
- base quality at a SNP line; the 2nd indel allele otherwise
- indel line only: # reads directly supporting the 1st indel allele
- indel line only: # reads directly supporting the 2nd indel allele
- indel line only: # reads supporting a third indel allele
If pileup is invoked without `-c', indel lines and columns between 3 and 7 inclusive will not be outputted.
I see `*' in the pileup sequence column. What are they?
A star at the sequence column represents a deletion. It is a place holder to make sure the number of bases at that column matches the read depth column. Simply ignore `*' if you do not use this information.
I only want to use a subset of alignments in pileup.
If you want to filter on mapping quality, flags, one read group or one library, you may just use the view command. If want to apply more complex filters, you may write an awk command for SAM. For example, I only want to use alignment with two or fewer differences (mismatches+gaps):
samtools view -h aln.bam | perl -ne 'print if (/^@/||(/NM:i:(\d+)/&&$1<=2))' | samtools pileup -S - > out.txt
or exclude all gapped alignments:
samtools view -h aln.bam | awk '$6!~/[ID]/' | samtools pileup -S -
Does samtools generate the consensus sequence like Maq?
Yes. Try this:
samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq
Again, remember to set -D according to your read depth. Note that pileup2fq applies fewer filters in comparison to varFilter, and you may see tiny inconsistency between the two outputs.
I want to get `unique' alignments from SAM/BAM.
We prefer to say an alignment is `reliable' rather than `unique' as `uniqueness' is not well defined in general cases. You can get reliable alignments by setting a threshold on mapping quality:
samtools view -bq 1 aln.bam > aln-reliable.bam
You may want to set a more stringent threshold to get more reliable alignments.
In repetitive regions, SAMtools call all bases as 'A' although there are no 'A' bases in reads.
This is due to a floating underflow in the MAQ SNP calling model used by default and only happens in repetitive regions. These calls are always filtered out. However, if you are uncomfortable with this, you may use the simplified SOAPsnp model with:
samtools -avcf ref.fa aln.bam > raw.txt
The MAQ model and SOAPsnp model usually deliver very similar SNP calls.
How are SNPs and indels called and filtered by SAMtools?
By default, SNPs are called with a Bayesian model identical to the one used in MAQ. A simplified SOAPsnp model is implemented, too. Indels are called with a simple Bayesian model. The caller does local realignment to recover indels that occur at the end of a read but appear to be contiguous mismatches. For an example, see this picture.
The varFilter filters SNPs/indels in the following order:
- d: low depth
- D: high depth
- W: too many SNPs in a window (SNP only)
- G: close to a high-quality indel (SNP only)
- Q: low root-mean-square (RMS) mapping quality (SNP only)
- g: close to another indel with more evidence (indel only)
The first letter indicates the reason why SNPs/indels are filtered when you invoke varFilter with the `-p' option. A SNP/indel filtered by a rule higher in the list will not be tested against other rules.
The Windows version of SAMtools does not work sometimes.
We are sorry that this is due to bugs in the Windows port. The Windows version is mainly meant to be a cross-platform viewer. Most of samtools functionality are not tested. For heavy use of samtools, please run it on Linux machines instead.
How to make my aligner work best with samtools?
To get the best performance in SNP calling, we recommend the following rules.
- Try to generate all the mandatory fields, in particular the mate coordinates and insert size. Samtools' rmdup relies on the ISIZE field and Picard MarkDuplicates requires the mate coordinates. One may use the fixmate command afterward, but that is very inefficient.
- Choose a random position for a repetitive read. If an aligner discards repetitive reads, the read depth will be inaccurate, which may cause problems in filtering SNPs.
- Write mapping quality. It is recommended to compute mapping quality as samtools SNP caller may take advantage of this information. However, computing mapping quality requires an aligner to look into suboptimal hits and thus slows down alignment. If your aligner cannot do this, write 0 for repetitive reads and 60 for `unique' reads.
Why mapping quality?
The plot below shows alignment accuracy for 108bp simulated reads under different configurations of BWA. If we only retain `unique' alignment, we get a single spot ungap-se-unqiue which corresponds to ~2300 wrong alignments out of 1.68 million mapped reads. If we look at the mapping quality generated by BWA and set a stringent threshold on that, it is possible to get an accuracy of 400/1.67M (the ungap-se line). That is saying we get >80% fewer false alignments at the cost of 1% loss in sensitivity. Setting a higher threshold further reduces false alignments and helps to reduce noises in identifying structural variations bridging unique regions. The plot may vary with the aligner in use, but it is generally true that an algorithm seeing more suboptimal alignments is more accurate.