Deploying VIGOR3
----------------
1. un-tar VIGOR3.tgz. This creates a directory named VIGOR3 containing
the VIGOR software and reference databases.
$ tar xzvf VIGOR3.tgz -C /mypath
2. define a scratch space for vigor
$ # path used here is an example, any directory will do
$ mkdir /mypath/VIGOR3/tempspace
$ chmod 777 /mypath/VIGOR3/tempspace
3. define a symbolic link for the scratch space
$ # symbolic link requires FULL path
$ cd /mypath/VIGOR3
$ chmod 777 prod3
$ ln -s /mypath/VIGOR3/tempspace prod3/vigorscratch
4. define symbolic links for external programs
$ cd /mypath/VIGOR3
$ ln -s /usr/local/bin/perl prod3/perl
$ ln -s /usr/local/bin/blastall prod3/blastall
$ ln -s /usr/local/bin/bl2seq prod3/bl2seq
$ ln -s /usr/local/bin/formatdb prod3/formatdb
$ ln -s /usr/local/bin/fastacmd prod3/fastacmd
$ ln -s /usr/local/bin/clustalw prod3/clustalw2
$ ln -s /usr/local/bin/muscle prod3/muscle
$ ln -s /usr/local/bin/cd-hit prod3/cd-hit
$ chmod 555 prod3
notes:
1. the dbutils directory under prod3 contains utility programs used
to support the creation of reference databases for VIGOR
2. muscle and cd-hit are used by programs in "dbutils", they are
not required by VIGOR.
3. the adhoc directory under prod3 contains a handful of adhoc
programs created during the project, these programs use many of
VIGOR's library functions but are not part of VIGOR
4. three additional programs are contained in the prod3 directory
a. rna_finder - used by the JCVI pipeline to annotate non-
coding genes
b. tblUTR - used by the JCVI pipeline to extend gene
boundries to include the UTRs
c. hmm3Evidence - used by the JCVI pipeline to suppply HMM3
evidence supporting the functional annotation of the gene
Running VIGOR3
--------------
Example:
$ VIGOR3.pl -d yfv -i samples/westnile.fasta -o test/westnile
(sample fasta and output files can be found in the samples directory)
Usage:
-- allow VIGOR to choose the reference database
$ VIGOR3.pl -i inputfasta -o outputprefix
-- tell VIGOR which reference database to use
$ VIGOR3.pl -d refdb -i inputfasta -o outputprefix
Command Line Options:
-a auto-select the reference database, equivalent to "-d any", default
behavior unless overridden by -d or -G, (-A is a synonym for this
option)
-d <ref db>, specify the reference database to be used, (-D is a synonym
for this option)
-e <evalue>, override the default evalue used to identify potential
genes, the default is usually 1E-5, but varies by reference database
-c <pct ref> minimum coverage of reference product (0-100) required to
report a gene, by default coverage is ignored
-C complete (linear) genome (do not treat edges as gaps)
-0 (zero) complete circular genome (allows gene to span origin)
-f <0, 1, or 2>, frameshift sensitivity, 0=low 1=normal 2=high
(defaults to 1)
-i <input fasta>, path to fasta with genomic sequences to be annotated
(-I is a synonym for this option)
-l do NOT use locus_tags in TBL file output (incompatible with -L)
-L USE locus_tags in TBL file output (incompatible with -l)
-o <output prefix>, prefix for outputfile files, e.g. if the ouput
prefix is /mydir/anno VIGOR will create output files /mydir/anno.tbl,
/mydir/anno.stats, etc., (-O is a synonym for this option)
-P <parameter=value~~...~~parameter=value>, override default values of
VIGOR parameters
-j turn off JCVI rules, JCVI rules treat gaps and ambiguity codes
conservatively, use this option to relax these constraints and
produce a more speculative annotation
-m ignore reference match requirements (coverage/identity/similarity),
sometimes useful to evaluate raw contigs and rough draft sequences
-s <gene size> minimum size (aa) of product required to report a gene,
by default size is ignored
Outputs:
outputprefix.rpt - summary of program results
outputprefix.stats - run statistics (per genome sequence) in tab-
delimited format
outputprefix.cds - fasta file of predicted CDSs
outputprefix.pep - fasta file of predicted proteins
outputprefix.tbl - predicted features in GenBank tbl format
outputprefix.aln - alignment of predicted protein to reference, and
reference protein to genome
outputprefix.fs - subset of aln report for those genes with
potential sequencing issues
outputprefix.at - potential sequencing issues in tab-delimited
format
Reference Datasets:
Name Description (Synonyms)
any any virus (vda)
cov_abcdx Alpha/Beta/Gamma/Delta/Unclassified Cov*
veev Alphaviruses (VEEV/EEEV) (alpha,eeev)
bunya Bunyaviridae
hanta Bunyaviridae Hantavirus (hantavirus)
obunya Bunyaviridae Orthobunyavirus
bunya_misc Bunyaviridae miscellaneous
gcv Coronavirus (cov)
gcv_g1a Coronavirus Group 1A (cov_g1a)
gcv_g1b Coronavirus Group 1B (cov_g1b)
gcv_g2a Coronavirus Group 2A (cov_g2a)
gcv_g2b Coronavirus Group 2B (SARS) (cov_g2b,
sars)
gcv_g2cd Coronavirus Group 2C & 2D (cov_g2c,
cov_g2d
gcv_g3 Coronavirus Group 3 (cov_g3)
filo Filoviridae (Ebola/Marburg) (ebola,
marburg)
giv Flu (flu)
giv_a Flu A (flu_a)
giv_b Flu B (flu_b)
giv_c Flu C (flu_c)
hrv Human Rhinovirus/Enterovirus (entero,
rhino)
hadv Human adenovirus
hadv_a Human adenovirus A
hadv_b Human adenovirus B
hadv_c Human adenovirus C
hadv_d Human adenovirus D
hadv_e Human adenovirus E
hadv_f Human adenovirus F
hadv_g Human adenovirus G
hhv Human herpesvirus+ (hsv)
hhv1 Human herpesvirus 1+ (hsv1)
hhv2 Human herpesvirus 2+
hhv3 Human herpesvirus 3 (Varicellovirus)+ (var)
hhv4 Human herpesvirus 4+
hhv5 Human herpesvirus 5+
msl Measles / Morbillivirus (measles)
mpv Metapneumovirus (MPV)
mmp Mumps / Rubulavirus (mumps)
norv Norovirus (noro)
norv_1 Norovirus I (noro1)
norv_2 Norovirus II (noro2)
norv_misc Norovirus miscellaneous
norv_mur Norovirus murine
rabies Rabies
rsv Respiratory syntactical virus (RSV)
respiro Respirovirus (resp)
hpiv_1 Respirovirus HPIV-1 (hpiv1)
hpiv_3 Respirovirus HPIV-3 (hpiv3)
sendai Respirovirus Sendai
rtv Rotavirus (rota)
rtv_a Rotavirus A (rota_a)
rtv_b Rotavirus B (rota_b)
rtv_c Rotavirus C (rota_c)
rtv_f Rotavirus F (rota_f)
rtv_g Rotavirus G (rota_g)
rbl Rubella (rubella)
sapo Sapovirus
yfv Yellow Fever / Japanese encephalitis (JEV) (jev)
* non-standard grouping, must be invoked directly, not included in
"any virus" via -A or as a subset of other -D specifications
+ these datasets have not been curated