****** CITATION ******
Chesters D, Yu F, Cao HX, Dai QY, Wu QY, Shi WF, Zheng WM, Zhu CD.
in press.
Heuristic optimization for global species clustering of DNA sequence data from multiple loci
****** VERSION HISTORY ******
v1.02: runs on new version of multi_locus_MOTU.
uses HA adjusted rand index via call to R.
v1.04: almost release version ... removed development code.
now uses keyfile to indicate identity of reference dataset.
v1.05: improved usability
****** DEPENDENCIES ******
essentials:
R
clues R module
multi_locus_MOTU_4.03.pl (get from https://sourceforge.net/projects/multilocusmotu/files/?source=navbar)
Matching.pm (same as above)
accessories:
makeblastdb
blastn
esprit (hcluster)
blastout_to_esprit.pl
insert_ID_to_esprit_output.pl
****** INSTRUCTIONS ******
Assumes familiarity with the linux command line.
Make sure you have R installed, and the clues R module,
check this works by the R command: library("clues")
Place the Perl scripts and the file Matching.pm into your working folder.
Inputs to the main procedure are a load of files,
one for each gene/clustering threshold,
so 3 genes and 40 thresholds would be 120 files.
A given cluster file has one line per MOTU,
where the members of a MOTU are whitespace seperated.
The members (corresponding to individual sequences) must have
species label, followed by underscore, followed by accession-like identifier.
check the example input
You can linkage cluster using any method you wish,
as long as the resulting files are in the correct format.
Alternativly, i give an example procedure using the
blast+ (Camacho et al, 2009) and esprit (Sun et al, 2009) programs.
I dont claim this is the best way to do this,
in particular note using distances from local alignments
for species clustering has been critisised,
But i have found these programs to scale well to very large datasets.
You need to produce a keyfile (see example),
This contains a list of all species IDs,
followed by 1 (if the id is from the reference partition),
or 0 (if from the subject partition)
The output file MLCO_results_R_table can be read into R
(maybe requires tweak).
You will probably want to plot the column best_iteration_tRI,
find the stationary phase,
and calculate the average species counts for that phase,
where best_iteration_number_unidentified_motu is count of
global MOTU of unidentified data, and best_iteration_total_motu is total.
****** SIMPLIFIED ALGORITHM OUTLINE ******
get best clustering combination by:
propose new set of clustering parameters
obtain appropriate clusters and link loci by k-partite matching,
calculate combined rand index RI-T'
if RI-T' > RI-T (previous best), adopt new.
****** EXAMPLE PIPELINE START ******
# install blast+ and esprit. im using blastn-2.2.28 and esprit-version_unknown.
# you should have no problems installing blast. esprit requires a minor change to Makefile, see the manual.
# step 1, calculate all-against-all pairwise distances using blast+, do this for each of 2 files, 16S and COI
makeblastdb-2.2.25+-64bit -in COI.fas -dbtype nucl
makeblastdb-2.2.25+-64bit -in 16S.fas -dbtype nucl
blastn-2.2.28+-64bit -task blastn -db COI.fas -query COI.fas -out COI_distances -perc_identity 95 -max_target_seqs 10000 -evalue 1e-10 -dust no -strand both -outfmt '6 qseqid sseqid evalue pident length'
blastn-2.2.28+-64bit -task blastn -db 16S.fas -query 16S.fas -out 16S_distances -perc_identity 95 -max_target_seqs 10000 -evalue 1e-10 -dust no -strand both -outfmt '6 qseqid sseqid evalue pident length'
# step 2, run esprit to cluster based on the blast similarity scores.
# two additional scripts are required to create esprit input files, and process output
perl ~/usr_scripts/blastout_to_esprit.pl COI_distances
hcluster COI_distances.esprit_in -e 0.05 -t 1000000 -p 1 -s 0.001
perl ~/usr_scripts/insert_ID_to_esprit_output.pl COI_distances.Cluster COI_distances.esprit_keyfile COI.espritOUT 0.001
# do the same for 16S:
perl ~/usr_scripts/blastout_to_esprit.pl 16S_distances
hcluster 16S_distances.esprit_in -e 0.05 -t 1000000 -p 1 -s 0.001
perl ~/usr_scripts/insert_ID_to_esprit_output.pl 16S_distances.Cluster 16S_distances.esprit_keyfile 16S.espritOUT 0.001
# step 3, run the heuristic search.
# 1 command argument (-clust), after this you have to write the file prefix name for the esprit cluster files
# the program will scan the directory to get all the threshold files present, since esprit often omits some thresholds
perl ~/usr_scripts/MLCO_1.05.pl -clust COI.espritOUT 16S.espritOUT
# ****** pipeline end ******
REFERENCES
Yijun Sun, Yunpeng Cai, Li Liu, Fahong Yu, Michael L. Farrell, William McKendree, William Farmerie. 2009
ESPRIT: Estimating Species Richness Using Large Collections of 16S rRNA Shotgun Sequences
Nucleic Acids Research
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T. (2009)
BLAST+: architecture and applications.
BMC Bioinformatics, 10