speciateIT Code

Status: Beta

Brought to you by: pgajer

Tree [r184] /

History

HTTPS access

File	Date	Author	Commit
MCclassifier	2010-08-28	pgajer	[r166]
R-scripts	2010-06-02	pgajer	[r1] init version
perl-scripts	2010-09-13	pgajer	[r182] bug fixes
spp-data	2010-07-11	pgajer	[r84] init version
tests	2010-09-13	pgajer	[r181]
utils	2010-08-28	pgajer	[r161] moved from fastasplitn
vicut	2010-08-28	pgajer	[r167]
.speciateIT_exclude	2010-06-15	pgajer	[r65] init version
INSTALL	2010-06-13	pgajer	[r59]
LICENSE	2011-03-31	pgajer	[r183] copy of license from speciateIT file
Makefile.PL	2010-06-08	pgajer	[r39] changed check_apps()
Makefile.in	2010-09-13	pgajer	[r180] fixed COPY bin/fastasplitn BIN_DIR/fastasplit bug
README	2010-09-13	pgajer	[r182] bug fixes
configure	2010-08-28	pgajer	[r162] moved fastasplitn dir to utils
configure.ac	2010-06-13	pgajer	[r64]
install-sh	2010-06-13	pgajer	[r57] init version
speciateIT	2011-03-31	pgajer	[r184] updated year in the license statement
test.pl	2010-06-02	pgajer	[r1] init version

Read Me

##############################################################################
#
# README file for speciateIT
#
##############################################################################

This is a beta version of a speciation pipeline for 16S rRNA amplicon data. In
principle, the pipeline can be used for speciation based on any highly preserved
gene.

In a nutshell, the speciation algorithm takes as an input fasta files of the 16S
sequences and *.taxon.results files produced by the RDP classifier and a
RDP_classifier_parser.pl script (supplied with this package). Next, unique
sequences of a specified genus (as classified by RDP classifier) are identified
by cdhit-est and then 5000 most abundant unique sequences are classified at the
species level using vicut. Sequences that do not match any of the traning species
sequences are clustered by vicut to for a presumably novel OTUs. Of course, the
number of unique sequence is a parameter that can be changed by the user. The
remaing sequences are classified using nearest neighbor Markov Chain classifier
(based on the classification of the 5000 most abundant sequences).

We are in a process of preparing a manuscript with a detailed description of the
algorithm and several of its variants.

If you have any problems with installation or questions about the usage, please
e-mail Pawel Gajer at pgajer@gmail.com

Installation instructions can be found in the INSTALL file.

##############################################################################
#
# USAGE
#
##############################################################################

speciateIT uses reference speciation data in the form of a fasta file of 16S
fragments of known species for a given genus as well as a taxon file that
consists of two columns: sequence Id and the corresponding species name. The
package comes with the reference speciation data for several genera. The rest
of the document describes how to use speciateIT with the supplied speciation data
and then how to build custom speciation reference files.

If you want to speciate Lactobacillus genus using sequences from V1-V2
hyper-variable region of the 16S gene (amplified using primers
spp-data/primers/V2V1primers.txt), then run

> speciateIT.pl -i xyz.sample.list -o output_dir

where 'xyz.sample.list' is a file consisting of three columns: sample Id, path to
the fasta file of the given sample, path to taxon file of the given sample and
'output_dir' is the name of the output directory.

EXAMPLE
Assuming that you are in the distribution directory, execute

> generateSampleList.pl -f tests/FASTA_BIN -t tests/RDP_BIN -o tests/full.sample.list
> head -n10 tests/full.sample.list > tests/small.sample.list
> speciateIT -i tests/small.sample.list -o tests/speciation_small_test

tests/small.sample.list is a sample list of ten samples of human vaginal 16S
sequences. The main tables of interest in the output directory are:

otu.counts.txt
otu.readCountGE500.withSIDs.counts.txt
otu.readCountGE500.withSIDs.sorted.perc.txt

otu.counts.txt contains a table of species/genera counts per sample.
otu.readCountGE500.withSIDs.counts.txt is a version of otu.counts.txt where
samples with less than 500 sequences has been filtered out from the
table. Similarily, genera with fewer than two sequences have been removed. The
last file is a percentage table obtained from
otu.readCountGE500.withSIDs.counts.txt by dividing each count by the total row
count and multiplying the result by 100.

A sample.list file can be generated as in the above example with

> generateSampleList.pl -f /path/to/FASTA -t /path/to/RDP -o my.sample.list

where '/path/to/FASTA' is a path to the directory containing fasta files of the
given project and '/path/to/RDP' is a path to a directory containing
*.results.taxon files associated with the fasta files. If your fasta/taxon files
are spread over several directories, then run generateSampleList.pl on each
directory pair (for fasta and taxon files) separately and then concatenate the
resulting sample.list files. generateSampleList.pl and other perl scripts
mentioned in this README file are located in perl-scripts and /usr/local/bin (or
user specified location if PREFIX flag is used) if 'make install' was run.

In oreder to generate .taxon files from RDP classifier's .results files, run

> RDP_classifier_parser.pl --input_list results.list --tree_file /path/to/spp-data/bergeyTrainingTree.xml --rank genus --output_format TN_SP --score_cutoff 0 -d outDir

where /path/to/spp-data/bergeyTrainingTree.xml is a path to
bergeyTrainingTree.xml file, located in spp-data subdirectory of the distribution
directory and 'results.list' is a file with paths to *.results files (one file per line).

EXAMPLE
Assuming you are at the top level of the distribution directory, run

> RDP_classifier_parser.pl --input_list tests/rdp.list --tree_file spp-data/bergeyTrainingTree.xml --rank genus --output_format TN_SP --score_cutoff 0 -d tests/RDP_BIN

The above command creates a directory tests/RDP_BIN with *.results.taxon files
associated with *.results files listed in tests/rdp.list.

By defualt speciateIT runs vicut on the 5000 most abundant unique sequences
within the specified genus (if there are that many of unique sequences and all
unique sequences if the number of unique sequences is less than 5000). You change
that number with -u or --unique_sequence flag.

Another parameter of speciatIT is --min_read_count or -s and it is used to
specified the minimum number of sequences within each OTU.

Here is rough outline what needs to be done to speciate some genus

1. look at the distribution of your 454 read lengths.
2. filter out the reads with Ns and lengths that are one standard deviation from the median or something like this.
3. Trim your species sequences to the median+(one st. dev.) length using

> trimSeqs.pl -i <species fasta file> -p ../primers/V2V1primers.txt -l <minLen> -f <maxLen> -o <output fasta file>

for example

> trimSeqs.pl -i rdp_vagi1200Len_Lactobacillus.fa -p ../primers/V2V1primers.txt -l 300 -f 360 -o rdp_vagi1200Len_Lactobacillus.V2V1.360bp.fa

Here the trimming is done so that the species sequences have length in the range 300-360. I am using V2V1 primers. trimSeqs.pl is in speciateIT/perl-scripts directory.

Make sure that your species sequences all contain species name in the header of the sequence (in the fasta file). Otherwise they are useless for speciation.

4. Create non-redundant species sequence fasta file

> cd-hit-est -i rdp_vagi1200Len_Lactobacillus.V2V1.360bp.fa -o rdp_vagi1200Len_Lactobacillus.V2V1.360bp_nr.fa -c 1.0 -n 9

5. Create taxonomy file

> fastaToTaxonomy.pl -i rdp_vagi1200Len_Lactobacillus.V2V1.360bp_nr.fa -o rdp_vagi1200Len_Lactobacillus.V2V1.360bp_nr.taxon

6. Run speciateIT with flags
--spp_fasta=/path/to/speciation/genus/fasta/file of non-redundant reads that you just created
--spp_taxon=/path/to/speciation/genus/taxon/file
--spp_genus=name-of-genus-to-be-speciated

speciateIT Code

Tree [r184] / Download Snapshot History

Read Me

Tree [r184] /

History