1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

SAM protocol

From samtools

Jump to: navigation, search

Contents

Abstract

Introduction

Basic Protocol 1: The Sequence Alignment/Map (SAM) Format

In SAM, each alignment line has 11 mandatory fields and a variable number of optional fields. The mandatory fields must be present but their value can be a * or a zero (depending on the field) if the corresponding information is unavailable.

Mandatory fields in SAM Modified from Table 1 in Li et al. (2009)
Col Name Description
1 QNAME Query NAME of the read or the read pair
2 FLAG bitwise FLAG (pairing, strand, mate strand, etc.)
3 RNAME Reference sequence NAME
4 POS 1-based leftmost POSition of clipped alignment
5 MAPQ MAPping Quality (Phred-scaled)
6 CIGAR extended CIGAR string (operations: MIDNSHP)
7 NRNM Mate Reference NaMe (`=' if same as RNAME)
8 MPOS 1-based leftmost Mate POSition
9 ISIZE inferred Insert SIZE
10 SEQ query SEQuence on the same strand as the reference
11 QUAL query QUALity (ASCII-33=Phred base quality)
Bitwise flags in the FLAG field
Flag Description
0x0001 Paired in sequencing
0x0002 Mapped in proper read pair
0x0004 Query unmapped
0x0008 Mate unmapped
0x0010 Strand of the query (1 for reverse)
0x0020 Strand of the mate (1 for reverse)
0x0040 First read in the pair
0x0080 Second read in the pair
0x0100 Secondary alignment
0x0200 QC failure
0x0400 Optical or PCR duplicates

Basic Protocol 2: Generating Alignments for One Human Individual

Necessary resources

Hardware
A computer with 64-bit CPU, 8GB or more memory and 100GB free disk space.
Software
Linux; wget; C compiler (e.g. gcc)
Files
Human genome sequences in one FASTA file (hsRef.fa in this example). Paired-end sequencing reads generated by Illumina GenomeAnalyzer in two FASTQ files, one for each end (ga1.fq and ga2.fq). Sequencing reads generated by Roche/454 in one FASTQ file (454.fq).

Procedure

Install software

  1. Install samtools and bwa. Sophisticated Unix users may copy the executables to any directory on the PATH.
    wget http://sourceforge.net/projects/bio-bwa/files/bwa-0.5.5.tar.bz2/download
    wget http://sourceforge.net/projects/samtools/files/samtools/0.1.7/samtools-0.1.7a.tar.bz2/download
    tar -jxf bwa-0.5.5.tar.bz2; tar -jxf samtools-0.1.7a.tar.bz2
    (cd bwa-0.5.5; make; cp bwa $HOME/bin)
    (cd samtools-0.1.7a; make; cp samtools misc/samtools.pl $HOME/bin)
    export PATH="$PATH:$HOME/bin" # this is bash only.

    Note that this example uses bwa-0.5.5 and samtools-0.1.7a, but latest versions are preferred. In addition, many other aligners also generate alignments in the SAM format (see this page for a complete list). We only use bwa as an example.

Perform alignment

  1. Index the genome. For the human genome, option -a bwtsw is required because the default is algorithm does not work with genomes longer than 2Gbp. It typically takes three hours for indexing.
    bwa index -a bwtsw hsRef.fa
  2. Align Illumina short reads to the reference genome, using the bwa-short read algorithm.
    bwa aln hsRef.fa ga1.fq > ga1.sai
    bwa aln hsRef.fa ga2.fq > ga2.sai
    bwa sampe hsref.fa ga1.sai ga2.sai ga1.fq ga2.fq | gzip > ga.sam.gz

    The first two command lines generate alignments in the suffix array coordinate for each end. The last command line pairs the two ends and produces the alignment in the compressed SAM format. The gzip output saves disk space but is optional. The input FASTQ files can be also compressed. Both bwa and samtools seamlessly work with uncompressed and gzip'ed input files.

  3. Align 454 reads to the reference genome, using the BWA-SW algorithm in bwa. This algorithm works with long reads, but does not support paired-end mapping.
    bwa bwasw hsRef.fa 454.fq | gzip > 454.sam.gz

    For large-scale data, step 3, 4 and 5 are usually parallelized across a compute cluster.

Generate merged alignment

  1. Convert SAM to BAM and perform sorting.
    samtools view -uS ga.sam.gz | samtools sort - ga
    samtools calmd -uS 454.sam.gz hsRef.fa | samtools sort - 454

    We use the calmd command to generate the NM and MD tags for each alignment which may be potentially useful to downstream analysis. The BWA-short algorithm writes these tags but BWA-SW does not.

    In both command lines, option -u renders samtools to output uncompressed BAM to reduce the computing time. An input file name `-' means the standard input. It is recommended to use samtools on a data stream to reduce disk space usage as well as the overall wall-clock time on processing the alignment.

  2. Merge sorted alignments and remove potential PCR/optical duplicates.
    perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
    samtools merge -rh rg.txt - ga.bam 454.bam | samtools rmdup - - | samtools rmdup -s - aln.bam

    The library information of each read group is kept in rg.txt which will be embedded in the header of aln.bam. The -r option of merge tells samtools to attach `RG:Z:ga' to each alignment in ga.bam and `RG:Z:454' to each alignment in 454.bam.

    Removing duplicates is optional, but it helps to reduce false positives in SNP calling. However, duplicates removal for single-end reads is not recommended given very deep coverage as this step may discard too many good alignments.

Index alignment

  1. Index the reference sequences and the alignment to enable fast retrieval of alignment or subsequence in a specified region.
    samtools faidx hsRef.fa
    samtools index aln.bam

    File hsRef.fa.fai and aln.bam.bai will be created. After this step, alignment can be displayed in terminal with

    samtools tview aln.bam hsRef.fa

Basic Protocol 3: Variant Calling with SAMtools

Necessary resources

Hardware
A computer with 100GB free disk space.
Software
Linux; samtools
Files
Reference sequences in one FASTA file (hsRef.fa in this example). Position sorted alignment in the BAM format (aln.bam).

Procedure

  1. Get the raw variant calls on the X chromosome. Assume X chromosome is named X in both hsRef.fa and aln.bam.
    samtools view -u aln.bam X | samtools pileup -vcf hsRef.fa - > var-X.raw.txt

    A region is specified with chr, chr:begin or chr:begin-end.

  2. Filter the raw variant calls.
    samtools.pl varFilter -D100 var-X.raw.txt > var-X.flt.txt

    Please remember to set the -D to set a maximum read depth according to the average read depth. In whole genome shotgun (WGS) resequencing, SNPs with excessively high read depth are usually caused by structural variations or alignment artifacts and should not be trusted.

  3. Acquire final variant calls by setting a quality threshold.
    awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' var-X.flt.txt > var-X.final.txt

    Here 50 is the quality threshold for indels and 20 for substitutions.

  4. Get the consensus sequence of chrX:
    samtools view -u aln.bam X | samtools pileup -cf hsRef.fa - | samtools.pl pileup2fq -D100 > var-X.fq

    File var-X.fq is in the multi-line FASTQ format. Bases in lowercase are filtered out due to repeats, being close to indels or insufficient/excessive read depth. The consensus file is essential to the estimate of mutation rate.

    The pileup2fq command applies fewer filters than varFilter and may not give identical results. For variant discovery, varFilter is recommended.

Alternative Protocol: Variant Calling with GATK

Necessary resources

Hardware
A computer with 100GB free disk space.
Software
Linux; Java runtime; samtools
Files
Reference sequences in one FASTA file (hsRef.fa in this example). Position sorted alignment in the BAM format (aln.bam).

Procedure

  1. Download and install GATK:
    wget ftp://ftp.broadinstitute.org/pub/gsa/GenomeAnalysisTK/GenomeAnalysisTK-EarlyAccess.tar.bz2;
    tar -jxf GenomeAnalysisTK-EarlyAccess.tar.bz2;
    ln -s GenomeAnalysisTK/GenomeAnalysisTK.jar
  2. Call SNPs:
    [Note that multiple -I (input bam) arguments can be specified (or one merged bam used) to make multi-sample-aware calls]
    java -jar GenomeAnalysisTK.jar -I aln.bam -R hsRef.fa -T UnifiedGenotyper \
       -varout snpCalls.vcf -confidence 50.0 -U -S SILENT

Support Protocol 1: Base Quality Recalibration

Base quality recalibration is optional, but it helps to improve the SNP quality and is preferred whenever possible. This tool recalibrates base quality scores of sequencing-by-synthesis reads in an aligned BAM file. Often base quality scores as reported by the sequencing machines can be inaccurate or uninformative. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is much closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed (and thus more informative) ones.

  1. Download and install GATK:
    wget ftp://ftp.broadinstitute.org/pub/gsa/GenomeAnalysisTK/GenomeAnalysisTK-EarlyAccess.tar.bz2;
    tar -jxf GenomeAnalysisTK-EarlyAccess.tar.bz2;
    ln -s GenomeAnalysisTK/GenomeAnalysisTK.jar
  2. Generate quality recalibration table:
    java -Xmx4g -jar GenomeAnalysisTK.jar -I aln.bam -R hsRef.fa -T CountCovariates -l INFO \
              -recalFile recal_data.csv -cov ReadGroupCovariate -cov QualityScoreCovariate \
              -cov CycleCovariate -cov DinucCovariate -cov TileCovariate

    Empirical quality is measured as ( # mismatches + 1 ) / ( # observations + 1 ) at all non-variant sites. Therefore it is strongly recommended to also apply "-D knownSNP.rod" with CountCovariates; otherwise base quality will be underestimated because known polymorphic sites will of course mismatch with the reference. CountCovariates also accepts VCF files as a list of known polymorphic sites to skip over. Users may take advantage of this feature by adding "-B variants,VCF,knownSNP.vcf" to the command.

  3. After counting covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recal_data.csv file, into a new BAM file:
    java -Xmx4g -jar GenomeAnalysisTK.jar -I aln.bam -R hsRef.fa -T TableRecalibration -l INFO \
              -recalFile recal_data.csv --output_bam alnRecal.bam

There are many advanced options described in detail on the tool's wiki page: http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

Support Protocol 2: Local Realignment

Samtools calls short indels with local realignment, but it does not write a modified BAM file after the realignment. The GATK though provides such a tool that realigns reads in regions with suspected indel artifacts and generates a BAM with cleaned alignments.

  1. Collect regions around potential indels and clusters of mismatches:
    samtools view -H aln.bam | grep ^@SQ > hsRef.dict;
    java -jar GenomeAnalysisTK.jar -I aln.bam -R hsRef.fa -T IndelIntervals -U -S SILENT -o indelIntv.txt;
    java -jar GenomeAnalysisTK.jar -I aln.bam -R hsRef.fa -T MismatchIntervals -U -S SILENT -o mmIntv.txt
  2. Merge intervals:
    java -jar GenomeAnalysisTK.jar -I aln.bam -R hsRef.fa -T IntervalMerger -U -S SILENT \
              --intervalsToMerge indelIntv.txt --intervalsToMerge mmIntv.txt -o mergedIntv.txt
  3. Realign reads overlapping specified regions:
    java -jar GenomeAnalysisTK.jar -I aln.bam -R hsRef.fa -T IntervalCleaner -L mergedIntv.txt -U -S SILENT \
              --OutputCleanedReadsOnly --OutputCleaned cleaned.bam --OutputIndels rawIndel.txt
  4. Write the cleaned alignment file:
    java -jar GenomeAnalysisTK.jar -I aln.bam -R hsRef.fa -T CleanedReadInjector --cleaned_reads cleaned.bam \
              -U -S SILENT --output_bam alnReAln.bam

Support Protocol 3: Assessing coverage

The GATK provides a tool to assess the depth of coverage over any number of intervals in a BAM file. The coverage will be displayed in the output for every single position plus as an average over each interval specified.

java -jar GenomeAnalysisTK.jar -I aln.bam -R hsRef.fa -T DepthOfCoverage -L intervals.txt -U -S SILENT


Support Protocol 4: Viewing Alignment with IGV

Samtools provides a concise text-based alignment viewer which does not support advanced features such as zooming and annotation tracks. For advanced features, the Integrative Genomics Viewer is more appropriate.

  1. Launch IGV. IGV can be launched via Java Web Start. The human reference genome and annotation tracks are available from the IGV server. Nonetheless, if you prefer to use your own reference sequences and annotation, you may download IGV and run it locally on your computer. To download and install IGV:
    wget http://www.broadinstitute.org/igvdata/downloads/IGV_1.4.04.zip;
    unzip IGV_1.4.04.zip;
    wget http://www.broadinstitute.org/igvdata/downloads/igvtools.zip;
    unzip igvtools.zip IGVTools/igvtools.jar

    and to lauch IGV locally:

    (cd IGV_1.4.04; sh igv_linux-64.sh)

    You should invoke another shell script if you are not using x86_64-linux.

  2. Import the genome sequence. Open the "File->Import Genome..." menu. Name the reference as "hsRef" and choose the reference sequence file "hsRef.fa". You may also import the gene annotation file if it is available.
  3. Generate the coverage information to be displayed in IGV. In a terminal:
    java -jar IGVTools/igvtools.jar count aln.bam aln.depth hsRef.genome

    where hsRef.genome was generated when we imported the genome sequence at the last step.

    This step is optional, but it is essential if you want to see the read depth information in large scale.

  4. Load the alignment. Come back to the IGV window. Open the "File->Load from File..." menu, and choose the alignment file "aln.bam". Open the "File->Load from File..." menu again, and choose the depth file "aln.depth.tdf".
  5. View the alignment. You may not see the alignment immediately when the file is loaded. You need to zoom in to see the details. A typical alignment view looks like:

Support Protocol 5: APIs for Accessing Alignments in SAM/BAM

C APIs (SAMtools-C)

  1. Download samtools and compile libbam.a:
    wget http://sourceforge.net/projects/samtools/files/samtools/0.1.7/samtools-0.1.7a.tar.bz2/download;
    tar -jxf samtools-0.1.7a.tar.bz2;
    export SAMTOOLS=`pwd`/samtools-0.1.7a;
    (cd $SAMTOOLS; make libbam.a)
  2. Create file bam2bed.c which is:
    #include <stdio.h>
    #include "sam.h"
    /* callback for bam_fetch() */
    static int fetch_func(const bam1_t *b, void *data)
    {
        samfile_t *fp = (samfile_t*)data;
        uint32_t *cigar = bam1_cigar(b);
        const bam1_core_t *c = &b->core;
        int i, l;
        if (c->flag&BAM_FUNMAP) return 0; /* skip unmapped reads */
        for (i = l = 0; i < c->n_cigar; ++i) {
            int op = cigar[i]&0xf;
            if (op == BAM_CMATCH || op == BAM_CDEL || op == BAM_CREF_SKIP)
                l += cigar[i]>>4;
        }
        printf("%s\t%d\t%d\t%s\t%d\t%c\n", fp->header->target_name[c->tid],
               c->pos, c->pos + l, bam1_qname(b), c->qual, (c->flag&BAM_FREVERSE)? '-' : '+');
        return 0;
    }
    int main(int argc, char *argv[])
    {
        samfile_t *fp;
        if (argc == 1) {
            fprintf(stderr, "Usage: bam2bed <in.bam> [region]\n");
            return 1;
        }
        if ((fp = samopen(argv[1], "rb", 0)) == 0) {
            fprintf(stderr, "bam2bed: Fail to open BAM file %s\n", argv[1]);
            return 1;
        }
        if (argc == 2) { /* if a region is not specified */
            bam1_t *b = bam_init1();
            while (samread(fp, b) >= 0) fetch_func(b, fp);
            bam_destroy1(b);
        } else {
            int ref, beg, end;
            bam_index_t *idx;
            if ((idx = bam_index_load(argv[1])) == 0) { /* load BAM index */
                fprintf(stderr, "bam2bed: BAM index file is not available.\n");
                return 1;
            }
            bam_parse_region(fp->header, argv[2], &ref, &beg, &end); /* parse region */
            if (ref < 0) {
                fprintf(stderr, "bam2bed: Invalid region %s\n", argv[2]);
                return 1;
            }
            bam_fetch(fp->x.bam, idx, ref, beg, end, fp, fetch_func);
            bam_index_destroy(idx);
        }
        samclose(fp);
        return 0;
    }
  3. Compile bam2bed.c:
    gcc -I$SAMTOOLS -g -O2 -Wall bam2bed.c -o bam2bed -lz -L$SAMTOOLS -lbam;
  4. Convert BAM to BED with:
    ./bam2bed aln.bam > aln.bed;
    ./bam2bed aln.bam X:10,000,000-20,000,000 > aln-partial.bed

Java APIs (Picard)

  1. Download Picard from [1]
  2. Create file BamToBed.java which is:
    import net.sf.picard.cmdline.CommandLineProgram;
    import net.sf.picard.cmdline.Option;
    import net.sf.picard.cmdline.StandardOptionDefinitions;
    import net.sf.picard.cmdline.Usage;
    import net.sf.picard.io.IoUtil;
    import net.sf.samtools.*;
    
    import java.io.File;
    import java.util.Iterator;
    
    /**
     * Command-line application for converting bam or sam files to bed files.
     */
    public class BamToBed extends CommandLineProgram {
    
        /* Specify usage string and command-line options. There is annotation-processing code elsewhere that will generate
            command-line parsing code based on these.
         */
    
        /** Usage string */
        @Usage
        public String USAGE = getStandardUsagePreamble() + "bam or sam file to bed file conversion.\n" +
                "Input and output formats are determined by file extension.";
    
        /** Command-line args */
        @Option(doc="The BAM or SAM file to convert.", shortName= StandardOptionDefinitions.INPUT_SHORT_NAME) public File INPUT;
        @Option(doc="The sequence name.", shortName="SEQ", optional=true) public String SEQUENCE;
        @Option(doc="The starting index (1-based inclusive).", shortName="S", optional=true) public Integer START;
        @Option(doc="The end index (1-based inclusive).", shortName="E", optional=true) public Integer END;
    
        
        /** Whether the user provided sequence, start, and end args on the command line */
        protected boolean rangeArgsProvided = false;
    
        /** This method contains the main logic of the application */
        protected int doWork() {
            IoUtil.assertFileIsReadable(INPUT);
    
            final SAMFileReader reader = new SAMFileReader(INPUT);
    
            Iterator<SAMRecord> iterator = null;
            if(!rangeArgsProvided )
            {
                iterator = reader.iterator();
            }
            else
            {
                iterator = reader.queryOverlapping(SEQUENCE, START, END);
            }
    
            
            while (iterator.hasNext()) {
                final SAMRecord record = iterator.next();
                if (record.getReadUnmappedFlag()) {
                    continue;
                }
    
                // Output tab-delimited fields in bed file format
                System.out.println(record.getReferenceName() + "\t" +
                        (record.getAlignmentStart() - 1) + "\t" + //subtract 1 to shift from one-based to zero-based
                        (record.getAlignmentEnd() - 1 + 1) + "\t" + //subtract 1 to shift from one-based to zero-based, and
                                                                    // then add 1 to shift from inclusive to exclusive
                        record.getReadName() + "\t" +
                        record.getMappingQuality() + "\t" +
                        (record.getReadNegativeStrandFlag()? "-": "+") );
            }
            reader.close();
    
            return 0;
        }
    
        
        /**
         * Put any custom command-line validation in an override of this method.
         * clp is initialized at this point and can be used to print usage and access argv.
         * Any options set by command-line parser can be validated.
         * @return null if command line is valid.  If command line is invalid, returns an array of error message
         * to be written to the appropriate place.
         */
        protected String[] customCommandLineValidation() {
        	if(SEQUENCE == null && START == null && END == null)
            {
                return null;
            }
    
            //validate SEQUENCE, START, END args
            if(SEQUENCE == null)
            {
                return new String[] { "Missing the SEQUENCE arg." };
            }
    
            if(START == null)
            {
                return new String[] { "Missing the START arg." };
            }
    
            if(END == null) 
            {
                return new String[] { "Missing the END arg." };
            }
    
            if(START >= END)
            {
                return new String[] { "START arg is >= END arg."};
            }
    
    
            rangeArgsProvided = true;
            return null;
        }
    
        
        /**
         * Main entry point of the application.
         *
         * @param argv Command line args.
         */
        public static void main(String[] argv) {
            new BamToBed().instanceMainWithExit(argv);
        }
    
    }
    
    
  3. Compile BamToBed.java, putting the picard-(version).jar and sam-(version).jar on the classpath:
    javac -cp picard-(version).jar:sam-(version).jar BamToBed.java
  4. Convert BAM to BED by running BamToBed.class:
    java -cp .:picard-(version).jar:sam-(version).jar BamToBed INPUT=test.bam > test.bed;
    java -cp picard-(version).jar:sam-(version).jar BamToBed INPUT=test.bam SEQUENCE=chr1 START=1329545 END=1321000 > test-partial.bed


Perl APIs (Bio-SamTools)

Python APIs (Pysam)

  1. Install Pyrex if absent:
    export PYTHONPATH="$PYTHONPATH:$HOME/opt/lib/python2.6/site-packages";
    wget http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/Pyrex-0.9.8.5.tar.gz
    tar -zxf Pyrex-0.9.8.5.tar.gz;
    (cd Pyrex-0.9.8.5; python setup.py install --prefix=$HOME/opt)

    You do not need to set PYTHONPATH and --prefix if you have the write permission to the system default module search path. Install pysam:

    wget http://pysam.googlecode.com/files/pysam-0.1.1.tar.gz
    tar -zxf pysam-0.1.1.tar.gz;
    (cd pysam-0.1.1; python setup.py install --prefix=$HOME/opt)
  2. Create script bam2bed.py, which is:
    #!/bin/env python
    
    import os, sys, re, optparse
    import pysam
    
    def main( argv = None ):
        if not argv: argv = sys.argv
    
        # setup command line parser
        parser = optparse.OptionParser( version = "%prog version: $Id$", 
                                        usage = globals()["__doc__"] )
        parser.add_option("-r", "--region", dest="region", type="string",
                          help="samtools region string [default=%default]."  )
        parser.set_defaults( region = None, )
        (options, args) = parser.parse_args( argv[1:] )
        if len(args) != 1:
            raise ValueError( "no samfile specified - see --help for usage" )
    
        # open BAM and get the iterator
        samfile = pysam.Samfile( args[0], "rb" )
        if options.region != None:
            it = samfile.fetch( region=options.region )
        else:
            it = samfile.fetch()
    
        # calculate the end position and print out BED
        take = (0, 2, 3) # CIGAR operation (M/match, D/del, N/ref_skip)
        outfile = sys.stdout
        for read in it:
            if read.is_unmapped: continue
            # compute total length on reference
            t = sum([ l for op,l in read.cigar if op in take ])
            if read.is_reverse: strand = "-"
            else: strand = "+"
            outfile.write("%s\t%d\t%d\t%s\t%d\t%c\n" %\
                              ( samfile.getrname( read.rname ),
                                read.pos, read.pos+t, read.qname,
                                read.mapq, strand) )            
    
    if __name__ == "__main__":
        sys.exit( main( sys.argv) )
  3. Convert BAM to BED with:
    python bam2bed.py aln.bam > aln.bed;
    python bam2bed.py -r X:10,000,000-20,000,000 aln.bam > aln-partial.bed

Commentary

Literature Cited

  • Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168]
  • Li H. and Durbin R. (2009) Fast and accurate long read alignment with Burrows-Wheeler Transform. Bioinformatics, accepted.
  • Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map format and SAMtools. Bioinformatics, 25:2078-9. [PMID: 19505943]
Personal tools