speciateIT Code
Status: Beta
Brought to you by:
pgajer
File | Date | Author | Commit |
---|---|---|---|
MCclassifier | 2010-08-28 | pgajer | [r166] |
R-scripts | 2010-06-02 | pgajer | [r1] init version |
perl-scripts | 2010-09-13 | pgajer | [r182] bug fixes |
spp-data | 2010-07-11 | pgajer | [r84] init version |
tests | 2010-09-13 | pgajer | [r181] |
utils | 2010-08-28 | pgajer | [r161] moved from fastasplitn |
vicut | 2010-08-28 | pgajer | [r167] |
.speciateIT_exclude | 2010-06-15 | pgajer | [r65] init version |
INSTALL | 2010-06-13 | pgajer | [r59] |
LICENSE | 2011-03-31 | pgajer | [r183] copy of license from speciateIT file |
Makefile.PL | 2010-06-08 | pgajer | [r39] changed check_apps() |
Makefile.in | 2010-09-13 | pgajer | [r180] fixed COPY bin/fastasplitn BIN_DIR/fastasplit bug |
README | 2010-09-13 | pgajer | [r182] bug fixes |
configure | 2010-08-28 | pgajer | [r162] moved fastasplitn dir to utils |
configure.ac | 2010-06-13 | pgajer | [r64] |
install-sh | 2010-06-13 | pgajer | [r57] init version |
speciateIT | 2011-03-31 | pgajer | [r184] updated year in the license statement |
test.pl | 2010-06-02 | pgajer | [r1] init version |
############################################################################## # # README file for speciateIT # ############################################################################## This is a beta version of a speciation pipeline for 16S rRNA amplicon data. In principle, the pipeline can be used for speciation based on any highly preserved gene. In a nutshell, the speciation algorithm takes as an input fasta files of the 16S sequences and *.taxon.results files produced by the RDP classifier and a RDP_classifier_parser.pl script (supplied with this package). Next, unique sequences of a specified genus (as classified by RDP classifier) are identified by cdhit-est and then 5000 most abundant unique sequences are classified at the species level using vicut. Sequences that do not match any of the traning species sequences are clustered by vicut to for a presumably novel OTUs. Of course, the number of unique sequence is a parameter that can be changed by the user. The remaing sequences are classified using nearest neighbor Markov Chain classifier (based on the classification of the 5000 most abundant sequences). We are in a process of preparing a manuscript with a detailed description of the algorithm and several of its variants. If you have any problems with installation or questions about the usage, please e-mail Pawel Gajer at pgajer@gmail.com Installation instructions can be found in the INSTALL file. ############################################################################## # # USAGE # ############################################################################## speciateIT uses reference speciation data in the form of a fasta file of 16S fragments of known species for a given genus as well as a taxon file that consists of two columns: sequence Id and the corresponding species name. The package comes with the reference speciation data for several genera. The rest of the document describes how to use speciateIT with the supplied speciation data and then how to build custom speciation reference files. If you want to speciate Lactobacillus genus using sequences from V1-V2 hyper-variable region of the 16S gene (amplified using primers spp-data/primers/V2V1primers.txt), then run > speciateIT.pl -i xyz.sample.list -o output_dir where 'xyz.sample.list' is a file consisting of three columns: sample Id, path to the fasta file of the given sample, path to taxon file of the given sample and 'output_dir' is the name of the output directory. EXAMPLE Assuming that you are in the distribution directory, execute > generateSampleList.pl -f tests/FASTA_BIN -t tests/RDP_BIN -o tests/full.sample.list > head -n10 tests/full.sample.list > tests/small.sample.list > speciateIT -i tests/small.sample.list -o tests/speciation_small_test tests/small.sample.list is a sample list of ten samples of human vaginal 16S sequences. The main tables of interest in the output directory are: otu.counts.txt otu.readCountGE500.withSIDs.counts.txt otu.readCountGE500.withSIDs.sorted.perc.txt otu.counts.txt contains a table of species/genera counts per sample. otu.readCountGE500.withSIDs.counts.txt is a version of otu.counts.txt where samples with less than 500 sequences has been filtered out from the table. Similarily, genera with fewer than two sequences have been removed. The last file is a percentage table obtained from otu.readCountGE500.withSIDs.counts.txt by dividing each count by the total row count and multiplying the result by 100. A sample.list file can be generated as in the above example with > generateSampleList.pl -f /path/to/FASTA -t /path/to/RDP -o my.sample.list where '/path/to/FASTA' is a path to the directory containing fasta files of the given project and '/path/to/RDP' is a path to a directory containing *.results.taxon files associated with the fasta files. If your fasta/taxon files are spread over several directories, then run generateSampleList.pl on each directory pair (for fasta and taxon files) separately and then concatenate the resulting sample.list files. generateSampleList.pl and other perl scripts mentioned in this README file are located in perl-scripts and /usr/local/bin (or user specified location if PREFIX flag is used) if 'make install' was run. In oreder to generate .taxon files from RDP classifier's .results files, run > RDP_classifier_parser.pl --input_list results.list --tree_file /path/to/spp-data/bergeyTrainingTree.xml --rank genus --output_format TN_SP --score_cutoff 0 -d outDir where /path/to/spp-data/bergeyTrainingTree.xml is a path to bergeyTrainingTree.xml file, located in spp-data subdirectory of the distribution directory and 'results.list' is a file with paths to *.results files (one file per line). EXAMPLE Assuming you are at the top level of the distribution directory, run > RDP_classifier_parser.pl --input_list tests/rdp.list --tree_file spp-data/bergeyTrainingTree.xml --rank genus --output_format TN_SP --score_cutoff 0 -d tests/RDP_BIN The above command creates a directory tests/RDP_BIN with *.results.taxon files associated with *.results files listed in tests/rdp.list. By defualt speciateIT runs vicut on the 5000 most abundant unique sequences within the specified genus (if there are that many of unique sequences and all unique sequences if the number of unique sequences is less than 5000). You change that number with -u or --unique_sequence flag. Another parameter of speciatIT is --min_read_count or -s and it is used to specified the minimum number of sequences within each OTU. Here is rough outline what needs to be done to speciate some genus 1. look at the distribution of your 454 read lengths. 2. filter out the reads with Ns and lengths that are one standard deviation from the median or something like this. 3. Trim your species sequences to the median+(one st. dev.) length using > trimSeqs.pl -i <species fasta file> -p ../primers/V2V1primers.txt -l <minLen> -f <maxLen> -o <output fasta file> for example > trimSeqs.pl -i rdp_vagi1200Len_Lactobacillus.fa -p ../primers/V2V1primers.txt -l 300 -f 360 -o rdp_vagi1200Len_Lactobacillus.V2V1.360bp.fa Here the trimming is done so that the species sequences have length in the range 300-360. I am using V2V1 primers. trimSeqs.pl is in speciateIT/perl-scripts directory. Make sure that your species sequences all contain species name in the header of the sequence (in the fasta file). Otherwise they are useless for speciation. 4. Create non-redundant species sequence fasta file > cd-hit-est -i rdp_vagi1200Len_Lactobacillus.V2V1.360bp.fa -o rdp_vagi1200Len_Lactobacillus.V2V1.360bp_nr.fa -c 1.0 -n 9 5. Create taxonomy file > fastaToTaxonomy.pl -i rdp_vagi1200Len_Lactobacillus.V2V1.360bp_nr.fa -o rdp_vagi1200Len_Lactobacillus.V2V1.360bp_nr.taxon 6. Run speciateIT with flags --spp_fasta=/path/to/speciation/genus/fasta/file of non-redundant reads that you just created --spp_taxon=/path/to/speciation/genus/taxon/file --spp_genus=name-of-genus-to-be-speciated