Best Practices and Workflows

Uday Evani Atlas2Team Danny Challis Jin Yu

Variant calling on 1000 Genomes Phase3-like data

Following is the command line template to call SNP by taking high coverage exome capture sequencing BAM files as inputs and output the SNP calls in VCF file. We have already tuned the best practice parameters into the tool. Please refer Atlas-SNP2 usage page for the explanation and default value of each parameter.

Input: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/alignment_indices/20120522.exome.alignment.index

SNPs

Reference fasta: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz

For each BAM file run:

ruby Atlas-SNP2.rb -i [bam file] -r [reference fasta file] -o [output file] --Illumina –n [sample name]

for example:

for sample in `cat ../samples ` 
  do echo "ruby ../Atlas-SNP2/Atlas-SNP2.rb -i ../bams/$sample.*.bam -r ~/refs/human_g1k_v37.fasta -o $sample.vcf --Illumina -n $sample" | msub -q analysis -d $PWD -l pmem=4gb -N atlas2.snps.$sample
done

Indels

For each BAM file run:

ruby Atlas-Indel.rb -i [bam file] -r [reference fasta file] -o [output file] -I 

for example:

for sample in `cat ../samples ` 
  do echo "ruby ../Atlas-Indel2/Atlas-Indel2.rb -i ../bams/$sample.*.bam -r ~/refs/human_g1k_v37.fasta -o $sample.vcf -I" | msub -q analysis -d $PWD -l pmem=1gb -N atlas2.indels.$sample
done

Project-level VCF

Here we describe the workflow to combine single sample vcf files to generate a multi-sample vcf a.k.a project-level vcf. The goal is to provide genotype calls from sequence data across all samples at every variant position called in at least one sample. The idea is to report coverage information for reference homozygous sites. This approach is better than providing pileups for individual samples because of the differences in methods calculating reference and variant read depths and differences in filtering reads based on quality .

Step 0:(Optional) Filter calls to the capture target region

While not required, filtering the variant calls to the capture target or other region of interest will drastically reduce the amount of time and resources to build a project-level VCF. A simple tool for filtering VCF files to a target BED file is included in the Atlas2 package and can be used as follows:

ruby Atlas2/utils/bed_filter.rb [vcf_file] [bed_file] > [filtered_vcf_file]

So to filter all the VCF files we run:

~~~~~~~~~
:::bash
mkdir target
for file in *.vcf
do ruby Atlas2/utils/bed_filter.rb $file phase3-like.illum.capture.union.bed > target/$file
done

##Step 1: Generate master site list

In order to provide genotypes and coverage information for all samples at all sites (not just the variant samples), you can run Atlas2 a second time providing a master list of sites that Atlas2 will always include in the output VCF.  If you do not need the genotype/coverage information for non-variant sample-sites you can skip to Step 3.

Concatenate variant sites across all single samples vcfs create a master list of unique variants sites found in at least one sample.

Below code will create a unique master site list

~~~~~~~
:::bash
for file in *.vcf
  do grep -v '^#' $file | cut -f1,2 >>tmp
done
sort -n -k1 -k2 tmp | uniq >master-site-list
rm tmp
~~~~~~~

##Step 2: Genotype all called sites with Atlas2
 Run Atlas2 a second time providing the master site list you have generated to generate genotypes and coverage data for all sample-sites.

###SNPs

~~~~~~
:::bash
ruby Atlas-SNP2 -i [bam file] -r [reference fasta file] -o [output file] -n [sample name] --Illumina -a [master-site-list] -w
~~~~~~

So for all the files you could do:

~~~~~~~
:::bash
for sample in `cat ../samples ` 
  do echo "ruby ../Atlas-SNP2/Atlas-SNP2.rb -i ../bams/$sample.*.bam -r ~/refs/human_g1k_v37.fasta -o $sample.vcf --Illumina -n $sample -a master-site-list -w" | msub -q analysis -d $PWD -l pmem=4gb -N atlas2.snps.regeno.$sample
done
~~~~~~~

###Indels

~~~~~~~~~~~
:::bash
ruby Atlas-Indel.rb -i [bam file] -r [reference fasta file] -o [output file] -I -a [master-site-list]
~~~~~~~~~~~~

So for all the files you could do:

~~~~~~~
:::bash
for sample in `cat ../samples ` 
  do echo "ruby ../Atlas-Indel2/Atlas-Indel2.rb -b ../bams/$sample.*.bam -r ~/refs/human_g1k_v37.fasta -o $sample.vcf -I -n $sample -a master-site-list" | msub -q analysis -d $PWD -l pmem=1gb -N atlas2.indels.regeno.$sample
done
~~~~~~~

**Note:** As Atlas-Indel2 does not yet have the -w option (only call sites in the master-site-list), it is a good idea to once again filter the VCF files to the catpure target region as described in Step 0.

##Step 3: VcfPrinter

Merge single sample vcf files generated in step 2 using vcfPrinter.

`ruby RunPrinter.rb -i "*.vcf" -o merged-output.vcf -p`

This step can take a very long time for a large number of sites or samples.  For larger projects such as the 1000 Genomes Project we recommend running VcfPrinter in cluster mode.  Note that cluster mode will only work on some clusters without modification (pertinent code is in RunPrinter.rb:59,83 and clusterJobStats.rb).

###settings.rb
Running VcfPrinter in cluster mode requires a filled out settings.rb file.  Inside the VcfPrinter directory copy settings.rb.default to settings.rb (NOTE: as of revision 172 the settings.rb file should be in your current working directory rather than the VcfPrinter directory).

~~~~~~
:::bash
cp settings.rb.default settings.rb
~~~~~~

This gives you an empty settings.rb file that looks like this:

~~~~~~
:::ruby
#cp this file to settings.rb
#and fill in the details

#Program settings (DO NOT CHANGE)
VERSION="1.4.3"
REVISION="158"
#User specific settings
RUBY_EXE = ""
INSTALL_DIR = ""
#Cluster specific settings
USER = ""
SCRATCH = ""
JOB_QUEUE = ""
~~~~~~~

You must fill in the path to your ruby executable, VcfPrinter install directory, cluster username, the location of an empty scratch directory (must be accessible to all nodes), and the name of the job queue to submit jobs to.  For example:

~~~~~~~
:::ruby
#cp this file to settings.rb
#and fill in the details

#Program settings (DO NOT CHANGE)
VERSION="1.4.3"
REVISION="163"
#User specific settings
RUBY_EXE = "/usr/bin/ruby1.9.1"
INSTALL_DIR = "/projects/1000GENOMES/duane/lib/Atlas2/trunk/vcfPrinter/"
#Cluster specific settings
USER = "duane"
SCRATCH = "/projects/1000GENOMES/duane/phase3_like/snps.regeno/scratch"
JOB_QUEUE = "analysis"
~~~~~~~~

###Compress and Index VCFs
To work in cluster mode VcfPrinter requires that all VCF files are bgzipped and tabix indexed <http://samtools.sourceforge.net/tabix.shtml>.  This can be done in this manner:

~~~~~~~~~~~~~
:::bash
for file in *.vcf
  do bgzip $file
  tabix -p vcf $file.gz
end
~~~~~~~~~~~~~

###Submitting Jobs###
Once you have the settings.rb file ready you can start the VcfPrinter.  In  cluster mode VcfPrinter will split (map) the input VCFs into small regions and submit jobs to merge all the files in that region together.  It will monitor the jobs it has submitted and once they have all finished it will concatenate (reduce) the merged VCFs together in the correct order.  Note that you must run vcfPrinter on a worker node not a login node as it is doing much more that just submitting jobs.  To run VcfPrinter in cluster mode:

~~~~~~~~
:::bash
ruby vcfPrinter/RunPrinter.rb -i "[glob to vcf.gz files]" -o [output vcf] -p --cluster -n [number of jobs per chromosome] (--indel)
~~~~~~~~

For example:

~~~~~~~~
:::bash
ruby vcfPrinter/RunPrinter.rb -i "*.vcf.gz" -o ../all.snps.vcf -p --cluster -n 15
~~~~~~~~~~~

Make sure not to forget the quotes around the glob or it will not work.  If you are merging indel VCFs you should include the --indel flag:

:::bash
ruby vcfPrinter/RunPrinter.rb -i "*.vcf.gz" -o ../all.indels.vcf -p --cluster -n 15 --indel
~~~~~~~~~~~~

Step 4: VCF Cleanup and Annotation

The project-level VCF produced by VcfPrinter is extremely data-rich and for large populations is often too crowded for some applications. To remove uncalled alleles and also add in allele-count and allele-frequency annotations (AN/AC/AF) you can use the add_allele_freq.rb scipt in the Atlas2 utils directory with the -clean flag as follows:

ruby Atlas2/utils/add_allele_freq.rb all.snps.vcf -clean > all.snps.final.vcf

Related

Wiki: Atlas2 Suite

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks