Menu

Home

bioruan

Give summary and plot(auto detect variable type) for all or specified fields in a vcf file (except CHROM POS ID fields,which I think not necessary).
Extract arbitrary fixed fields or values in sample fields to TAB delimited file.
Natural support for multi-sample vcf

before using
chmod +x vcf-summarize.sh

Simplest usage: vcf-summarize.sh -f filename.vcf -a #extract and summarize all fields and subfields
For large file: nohup vcf-summarize.sh -f filename.vcf -a &

Usage
-f [required] Take 1 file. The target vcf file. Support plain txt and gz,bz file
-a [optional] Take 0 argument. If specified, extract and summarize all variables
-q [optional] Take 0 argument. If specified, will skip REF, ALT, QUAL, FILTER fields
-c [optional] Take 1 string. e.g. -c "chr1" will limit analysis to records with CHROM fields equal to chr1
-i [optional] Take 1 string. e.g. -i "CHROM POS" will used CHR_POS [default] as the index of extracted tables. You can choose any combination of "CHROM POS ID REF ALT". The idea is to generate unique index with the smallest number of fields.
-I [optional] Take 1 string. e.g. -I "AN DB" will extract and summarize AN and DB subfields in INFO field. Will overwrite option -a, which analyze all subfields in INFO.
-F [optional] Take 1 string. e.g. -F "GT AD DP" will extract and summarize GT AD DP subfields in sample columns. Will overwrite option -a, which analyze all subfields in sample columns
-s [optional] Take 0 argument. If specified, just do data extraction. Suppress summarization and plotting
-o [optional] Take 1 string. The output directory name. Default is vcfsummarize
-h [optional] show this help

This is a personal effort without any funding support
Report suggestion and bug to ruansun@163.com

Note: Two small bugs were fixed on Jun 11 2014
The first bug may cause program to crash if names in FORMAT contain "_" (i.e. underline). You are ok if your FORMAT field do not have name with "_".
The second bug may cause a sample field (e.g. AD) be filled with value from another field (e.g. DP), if AD field does not exist for that variant locus. This generally won't affect the GT extract.