Best Practices and Workflows

Uday Evani Atlas2Team Danny Challis Jin Yu

Variant calling on 1000 Genomes Phase3-like data

Following is the command line template to call SNP by taking high coverage exome capture sequencing BAM files as inputs and output the SNP calls in VCF file. We have already tuned the best practice parameters into the tool. Please refer Atlas-SNP2 usage page for the explanation and default value of each parameter.

Input: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/alignment_indices/20120522.exome.alignment.index

SNPs

Reference fasta: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz

For each BAM file run:

ruby Atlas-SNP2.rb -i [bam file] -r [reference fasta file] -o [output file] --Illumina –n [sample name]

for example:

for sample in `cat ../samples ` 
  do echo "ruby ../Atlas-SNP2/Atlas-SNP2.rb -i ../bams/$sample.*.bam -r ~/refs/human_g1k_v37.fasta -o $sample.vcf --Illumina -n $sample" | msub -q analysis -d $PWD -l pmem=4gb -N atlas2.snps.$sample
done

Indels

For each BAM file run:

ruby Atlas-Indel.rb -i [bam file] -r [reference fasta file] -o [output file] -I

for example:

for sample in `cat ../samples ` 
  do echo "ruby ../Atlas-Indel2/Atlas-Indel2.rb -i ../bams/$sample.*.bam -r ~/refs/human_g1k_v37.fasta -o $sample.vcf -I" | msub -q analysis -d $PWD -l pmem=1gb -N atlas2.indels.$sample
done

Project-level VCF

Here we describe the workflow to combine single sample vcf files to generate a multi-sample vcf a.k.a project-level vcf. The goal is to provide genotype calls from sequence data across all samples at every variant position called in at least one sample. The idea is to report coverage information for reference homozygous sites. This approach is better than providing pileups for individual samples because of the differences in methods calculating reference and variant read depths and differences in filtering reads based on quality .

Step 0:(Optional) Filter calls to the capture target region

While not required, filtering the variant calls to the capture target or other region of interest will drastically reduce the amount of time and resources to build a project-level VCF. A simple tool for filtering VCF files to a target BED file is included in the Atlas2 package and can be used as follows:

ruby Atlas2/utils/bed_filter.rb [vcf_file] [bed_file] > [filtered_vcf_file]

So to filter all the VCF files we run:

mkdir target
for file in *.vcf
  do ruby Atlas2/utils/bed_filter.rb $file phase3-like.illum.capture.union.bed > target/$file
done

Step 1: Generate master site list

In order to provide genotypes and coverage information for all samples at all sites (not just the variant samples), you can run Atlas2 a second time providing a master list of sites that Atlas2 will always include in the output VCF. If you do not need the genotype/coverage information for non-variant sample-sites you can skip to Step 3.

Concatenate variant sites across all single samples vcfs create a master list of unique variants sites found in at least one sample.

Below code will create a unique master site list

for file in *.vcf
  do grep -v '^#' $file | cut -f1,2 >>tmp
done
sort -n -k1 -k2 tmp | uniq >master-site-list
rm tmp

Step 2: Genotype all called sites with Atlas2

Run Atlas2 a second time providing the master site list you have generated to generate genotypes and coverage data for all sample-sites.

SNPs

ruby Atlas-SNP2 -i [bam file] -r [reference fasta file] -o [output file] -n [sample name] --Illumina -a [master-site-list] -w

So for all the files you could do:

for sample in `cat ../samples ` 
  do echo "ruby ../Atlas-SNP2/Atlas-SNP2.rb -i ../bams/$sample.*.bam -r ~/refs/human_g1k_v37.fasta -o $sample.vcf --Illumina -n $sample -a master-site-list -w" | msub -q analysis -d $PWD -l pmem=4gb -N atlas2.snps.regeno.$sample
done

Indels

ruby Atlas-Indel.rb -i [bam file] -r [reference fasta file] -o [output file] -I -a [master-site-list]

So for all the files you could do:

for sample in `cat ../samples ` 
  do echo "ruby ../Atlas-Indel2/Atlas-Indel2.rb -b ../bams/$sample.*.bam -r ~/refs/human_g1k_v37.fasta -o $sample.vcf -I -n $sample -a master-site-list" | msub -q analysis -d $PWD -l pmem=1gb -N atlas2.indels.regeno.$sample
done

Note: As Atlas-Indel2 does not yet have the -w option (only call sites in the master-site-list), it is a good idea to once again filter the VCF files to the catpure target region as described in Step 0.

Step 3: VcfPrinter

Merge single sample vcf files generated in step 2 using vcfPrinter.

ruby RunPrinter.rb -i "*.vcf" -o merged-output.vcf -p

This step can take a very long time for a large number of sites or samples. For larger projects such as the 1000 Genomes Project we recommend running VcfPrinter in cluster mode. Note that cluster mode will only work on some clusters without modification (pertinent code is in RunPrinter.rb:59,83 and clusterJobStats.rb).

settings.rb

Running VcfPrinter in cluster mode requires a filled out settings.rb file. Inside the VcfPrinter directory copy settings.rb.default to settings.rb (NOTE: as of revision 172 the settings.rb file should be in your current working directory rather than the VcfPrinter directory).

cp settings.rb.default settings.rb

This gives you an empty settings.rb file that looks like this:

#cp this file to settings.rb
#and fill in the details

#Program settings (DO NOT CHANGE)
VERSION="1.4.3"
REVISION="158"
#User specific settings
RUBY_EXE = ""
INSTALL_DIR = ""
#Cluster specific settings
USER = ""
SCRATCH = ""
JOB_QUEUE = ""

You must fill in the path to your ruby executable, VcfPrinter install directory, cluster username, the location of an empty scratch directory (must be accessible to all nodes), and the name of the job queue to submit jobs to. For example:

#cp this file to settings.rb
#and fill in the details

#Program settings (DO NOT CHANGE)
VERSION="1.4.3"
REVISION="163"
#User specific settings
RUBY_EXE = "/usr/bin/ruby1.9.1"
INSTALL_DIR = "/projects/1000GENOMES/duane/lib/Atlas2/trunk/vcfPrinter/"
#Cluster specific settings
USER = "duane"
SCRATCH = "/projects/1000GENOMES/duane/phase3_like/snps.regeno/scratch"
JOB_QUEUE = "analysis"

Compress and Index VCFs

To work in cluster mode VcfPrinter requires that all VCF files are bgzipped and tabix indexed http://samtools.sourceforge.net/tabix.shtml. This can be done in this manner:

for file in *.vcf
  do bgzip $file
  tabix -p vcf $file.gz
end

Submitting Jobs

Once you have the settings.rb file ready you can start the VcfPrinter. In cluster mode VcfPrinter will split (map) the input VCFs into small regions and submit jobs to merge all the files in that region together. It will monitor the jobs it has submitted and once they have all finished it will concatenate (reduce) the merged VCFs together in the correct order. Note that you must run vcfPrinter on a worker node not a login node as it is doing much more that just submitting jobs. To run VcfPrinter in cluster mode:

ruby vcfPrinter/RunPrinter.rb -i "[glob to vcf.gz files]" -o [output vcf] -p --cluster -n [number of jobs per chromosome] (--indel)

For example:

ruby vcfPrinter/RunPrinter.rb -i "*.vcf.gz" -o ../all.snps.vcf -p --cluster -n 15

Make sure not to forget the quotes around the glob or it will not work. If you are merging indel VCFs you should include the --indel flag:

ruby vcfPrinter/RunPrinter.rb -i "*.vcf.gz" -o ../all.indels.vcf -p --cluster -n 15 --indel

Step 4: VCF Cleanup and Annotation

The project-level VCF produced by VcfPrinter is extremely data-rich and for large populations is often too crowded for some applications. To remove uncalled alleles and also add in allele-count and allele-frequency annotations (AN/AC/AF) you can use the add_allele_freq.rb scipt in the Atlas2 utils directory with the -clean flag as follows:

ruby Atlas2/utils/add_allele_freq.rb all.snps.vcf -clean > all.snps.final.vcf

Related

Wiki: Atlas2 Suite