Following is the command line template to call SNP by taking high coverage exome capture sequencing BAM files as inputs and output the SNP calls in VCF file. We have already tuned the best practice parameters into the tool. Please refer Atlas-SNP2 usage page for the explanation and default value of each parameter.
Reference fasta: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
For each BAM file run:
ruby Atlas-SNP2.rb -i [bam file] -r [reference fasta file] -o [output file] --Illumina –n [sample name]
for sample in `cat ../samples ` do echo "ruby ../Atlas-SNP2/Atlas-SNP2.rb -i ../bams/$sample.*.bam -r ~/refs/human_g1k_v37.fasta -o $sample.vcf --Illumina -n $sample" | msub -q analysis -d $PWD -l pmem=4gb -N atlas2.snps.$sample done
For each BAM file run:
ruby Atlas-Indel.rb -i [bam file] -r [reference fasta file] -o [output file] -I
for sample in `cat ../samples ` do echo "ruby ../Atlas-Indel2/Atlas-Indel2.rb -i ../bams/$sample.*.bam -r ~/refs/human_g1k_v37.fasta -o $sample.vcf -I" | msub -q analysis -d $PWD -l pmem=1gb -N atlas2.indels.$sample done
Here we describe the workflow to combine single sample vcf files to generate a multi-sample vcf a.k.a project-level vcf. The goal is to provide genotype calls from sequence data across all samples at every variant position called in at least one sample. The idea is to report coverage information for reference homozygous sites. This approach is better than providing pileups for individual samples because of the differences in methods calculating reference and variant read depths and differences in filtering reads based on quality .
While not required, filtering the variant calls to the capture target or other region of interest will drastically reduce the amount of time and resources to build a project-level VCF. A simple tool for filtering VCF files to a target BED file is included in the Atlas2 package and can be used as follows:
ruby Atlas2/utils/bed_filter.rb [vcf_file] [bed_file] > [filtered_vcf_file]
So to filter all the VCF files we run:
mkdir target for file in *.vcf do ruby Atlas2/utils/bed_filter.rb $file phase3-like.illum.capture.union.bed > target/$file done
In order to provide genotypes and coverage information for all samples at all sites (not just the variant samples), you can run Atlas2 a second time providing a master list of sites that Atlas2 will always include in the output VCF. If you do not need the genotype/coverage information for non-variant sample-sites you can skip to Step 3.
Concatenate variant sites across all single samples vcfs create a master list of unique variants sites found in at least one sample.
Below code will create a unique master site list
for file in *.vcf do grep -v '^#' $file | cut -f1,2 >>tmp done sort -n -k1 -k2 tmp | uniq >master-site-list rm tmp
Run Atlas2 a second time providing the master site list you have generated to generate genotypes and coverage data for all sample-sites.
ruby Atlas-SNP2 -i [bam file] -r [reference fasta file] -o [output file] -n [sample name] --Illumina -a [master-site-list] -w
So for all the files you could do:
for sample in `cat ../samples ` do echo "ruby ../Atlas-SNP2/Atlas-SNP2.rb -i ../bams/$sample.*.bam -r ~/refs/human_g1k_v37.fasta -o $sample.vcf --Illumina -n $sample -a master-site-list -w" | msub -q analysis -d $PWD -l pmem=4gb -N atlas2.snps.regeno.$sample done
ruby Atlas-Indel.rb -i [bam file] -r [reference fasta file] -o [output file] -I -a [master-site-list]
So for all the files you could do:
for sample in `cat ../samples ` do echo "ruby ../Atlas-Indel2/Atlas-Indel2.rb -b ../bams/$sample.*.bam -r ~/refs/human_g1k_v37.fasta -o $sample.vcf -I -n $sample -a master-site-list" | msub -q analysis -d $PWD -l pmem=1gb -N atlas2.indels.regeno.$sample done
Note: As Atlas-Indel2 does not yet have the -w option (only call sites in the master-site-list), it is a good idea to once again filter the VCF files to the catpure target region as described in Step 0.
Merge single sample vcf files generated in step 2 using vcfPrinter.
ruby RunPrinter.rb -i "*.vcf" -o merged-output.vcf -p
This step can take a very long time for a large number of sites or samples. For larger projects such as the 1000 Genomes Project we recommend running VcfPrinter in cluster mode. Note that cluster mode will only work on some clusters without modification (pertinent code is in RunPrinter.rb:59,83 and clusterJobStats.rb).
Running VcfPrinter in cluster mode requires a filled out settings.rb file. Inside the VcfPrinter directory copy settings.rb.default to settings.rb (NOTE: as of revision 172 the settings.rb file should be in your current working directory rather than the VcfPrinter directory).
cp settings.rb.default settings.rb
This gives you an empty settings.rb file that looks like this:
#cp this file to settings.rb #and fill in the details #Program settings (DO NOT CHANGE) VERSION="1.4.3" REVISION="158" #User specific settings RUBY_EXE = "" INSTALL_DIR = "" #Cluster specific settings USER = "" SCRATCH = "" JOB_QUEUE = ""
You must fill in the path to your ruby executable, VcfPrinter install directory, cluster username, the location of an empty scratch directory (must be accessible to all nodes), and the name of the job queue to submit jobs to. For example:
#cp this file to settings.rb #and fill in the details #Program settings (DO NOT CHANGE) VERSION="1.4.3" REVISION="163" #User specific settings RUBY_EXE = "/usr/bin/ruby1.9.1" INSTALL_DIR = "/projects/1000GENOMES/duane/lib/Atlas2/trunk/vcfPrinter/" #Cluster specific settings USER = "duane" SCRATCH = "/projects/1000GENOMES/duane/phase3_like/snps.regeno/scratch" JOB_QUEUE = "analysis"
To work in cluster mode VcfPrinter requires that all VCF files are bgzipped and tabix indexed http://samtools.sourceforge.net/tabix.shtml. This can be done in this manner:
for file in *.vcf do bgzip $file tabix -p vcf $file.gz end
Once you have the settings.rb file ready you can start the VcfPrinter. In cluster mode VcfPrinter will split (map) the input VCFs into small regions and submit jobs to merge all the files in that region together. It will monitor the jobs it has submitted and once they have all finished it will concatenate (reduce) the merged VCFs together in the correct order. Note that you must run vcfPrinter on a worker node not a login node as it is doing much more that just submitting jobs. To run VcfPrinter in cluster mode:
ruby vcfPrinter/RunPrinter.rb -i "[glob to vcf.gz files]" -o [output vcf] -p --cluster -n [number of jobs per chromosome] (--indel)
ruby vcfPrinter/RunPrinter.rb -i "*.vcf.gz" -o ../all.snps.vcf -p --cluster -n 15
Make sure not to forget the quotes around the glob or it will not work. If you are merging indel VCFs you should include the --indel flag:
ruby vcfPrinter/RunPrinter.rb -i "*.vcf.gz" -o ../all.indels.vcf -p --cluster -n 15 --indel
The project-level VCF produced by VcfPrinter is extremely data-rich and for large populations is often too crowded for some applications. To remove uncalled alleles and also add in allele-count and allele-frequency annotations (AN/AC/AF) you can use the add_allele_freq.rb scipt in the Atlas2 utils directory with the -clean flag as follows:
ruby Atlas2/utils/add_allele_freq.rb all.snps.vcf -clean > all.snps.final.vcf