Updated versions of the programs/methods described in:
Parallel BLAST on split databases" Bioinformatics
2003 Sep 22;19(14):1865-6.
https://doi.org/10.1093/bioinformatics/btg250
These now work with modern BLAST (c++ toolkit) versions.
In brief, these split a database across multiple compute
nodes, search each slice with blast in parallel, and then
reassemble into a single output file at the end on the
master node.
The blast programs were slightly modified so that they
can use a different form of taxon id restriction than the
NCBI now supports in V5 databases. This method was described
in the above paper and works with either V4 or V5 databases.
It supports internal taxonomy nodes, like "rodentia", rather than
having to list the taxon id for every type of rodent.
Example: to search all rodents other than mouse use "9989,-1758".
Also included are scripts for downloading, splitting, formatting
databases, running blast queries against those databases, and backing
up and restoring data directories to other compute nodes. Pretty
much every script has a section near the top which must be modified.
Values which really, really must be changed, as they point to
our site, are marked with the string "CHANGETHIS_".
Example command (this assumes the database has already been
split across the compute nodes, parallel_dblist.txt filled
in with the database information, pvm set up, etc.)
pfm_ssh_blastmaster_rev5.pl \
-query test.pfa \
-program blastp \
-dbname=bac.aa \
-e 1e-80 \
-taxonid_cit 28211 \
-b 20 \
-v 20 \
-text
This would search the bacterial amino acid database restricting
the search to taxons under 28211 (Alphaproteobacteria) returning \
the best 20 hits and their corresponding alignments in text format.
If -text was omitted the result would be html.
If -nomerge was added the raw files from each node would be left on disk.
This is parallelblast_plus release 1.0.5.
The final C toolkit release, 2.0.9 is also present in the sourceforge
project files directory.
Jan 06, 2021
David Mathog
********************************************************************
Nomenclature specific to this package:
OAT "Oid AccessionNumber Taxonid". These triplets are
used to keep track of taxon informatin through
processing. Some are extracted from NCBI blast
databases and others are constructed on the fly.
********************************************************************
External Program Dependencies - these must be obtained from elsewhere:
ascp https://www.ncbi.nlm.nih.gov/books/NBK242625/
extract, accudate From https://sourceforge.net/projects/drmtools/files/
blastp (etc.) From blast_progs_2.9.0_CIT.tar.gz, should be in the
file section of the parallelblast_plus sourceforge
project.
CGI Perl script module dependencies.
DirHandle To see which is needed where: grep '^use' *pl
File::stat Obtain from cpan
File::Temp
File::Which
Getopt::Long
MIME::Lite
Parallel::ForkManager
POSIX
subs::parallel
********************************************************************
File Organization:
Cluster directories (/usr/common is exported by master, NFS mounted on compute nodes)
/usr/common/bin
#programs and scripts
/usr/common/BLASTDB (and subdirectories)
#used to download and unpack/repack sequence data
#Master node directory
/usr/common/ncbi
# Nodes define env variable NCBI
# Directory must contain the .ncbirc file
/usr/common/etc
# configuration files
Compute Node directories (local to node)
/usr/local/databases
#used to hold split database chunks for that node
/scratch/secondary
#holds a copy of a different node's databases (backup)
********************************************************************
Files in this distribution are:
blastcontrol_rev5.pl EXAMPLE web interface to run these programs. MANY DEPENDENCIES,
some of which are included here:
lookup_org_chr_position.html
fastarange.c
genericfailurehtml.pl
mailparallelblastresults.sh
mailparallelblastresults.pl
setuser.c (from w2h package)
blastcontrol_rev2.pl
blastcontrol_rev3.pl
blastcontrol_rev4.pl Older versions of the preceding.
blastplusmerge.c Merge output of database sliced queries.
blast_test_data.tar.gz Test data used in SAF_CHANGES_TESTS.txt
Downloaded separately from Sourceforge.
cdd.ncbirc .ncbirc file for the CDD directory
COPYING GPL v3 license as text file.
fastafrag.c Fragment fasta files.
fastaproperties.c Count entries and summarize properties of a fasta file.
fastarange.c Read a fasta file and emit a range of entries/and or
sequence positions.
fasta_range_calculate.c
Calculate start/end offsets for "equal" sized
chunks of a fasta file. Emits for each chunk:
number, start offset, end offset, bytes.
Results may be used with QUERY_OFFSET_FIRST and
QUERY_OFFSET_LAST to segment blast query input
without having to traverse the entire file up
to that point.
fetch_cdd.sh Download CDD databases from the NCBI.
fetch_db_bac.pl Download and repack the reference set of
Bacterial genomes from the NCBI.
fetch_db_bdb.sh Download nr,nt, swissprot, pdbaa, pdbnt, refseq_protein,
and refseq_rna from the NCBI as blast databases,
convert to fasta, extract OAT files, remove blast databases.
Conversion works on chunks of the database at a time.
fetch_db.sh Download a few small databases from the NCBI
fetchtaxon.sh Retrieve current taxonomy_names.dmp and
taxonomy_nodes.dmp from the NCBI.
final-2.10.1.patch
final-2.10.0.patch
final-2.9.0.patch
Patch for NCBI toolbox+ blast source code
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.9.0+-src.tar.gz
As of 2.10.1 this adds support for:
BDB_OIDRANGES series of OID ranges, comma separted
May be used to split work on compute nodes
which all have a copy of a given database.
Node1=1,1000; Node2 1001,2000, etc.
BDB_TAXONID is not required.
BDB_TAXONID root part of oid index filename, as prepared
by prep_taxon_oidl. Example, it would be
oid_bac.aa_lcl for the files
oid_bac.aa_lcl.bin and oid_bac.aa_lcl.idx
BDB_TAXONLIST Comma delimited list of taxonids, positive
are included, negative are excluded. Added
and removed in the order specified.
BDB_TAXONID is required.
QUERY_OFFSET_FIRST
QUERY_OFFSET_LAST
The offsets to the first and last range of bytes
in the query file. This allows the input to be "sliced"
on the fly to partition the work. Offsets are in range
{0,file_length-1}. The '>' in the first header should
be at QUERY_OFFSET_FIRST and the '\n' following the
sequence at QUERY_OFFSET_LAST.
The program fasta_range_calculate may be used to
calculate these values.
Default is {1,LLONG_MAX} (external coordinates)
blastdbcmd works with these with no other command line parameters
required
blastp etc. need "-taxids 1" to enable control by BDB_* variables.
When BDB_TAXONLIST is specified the value of -taxids
on the command line is ignored.
When BDB_TAXONLIST is not supplied the -taxids
command line value(s) is(are) used. However, these will
be seen in order of decreasing value regardless of their
original order, which does not provide the same level of control
as does BDB_TAXONLIST.
fix_oid_bac.pl Reprocesses bac.aa.gz and bac.aa.oat.gz
to remove duplicate protein sequences,
retaining an OAT entry for each.
(Most are marked MULTISPECIES, but sometimes they are not.)
from_secondary Script to restore a node's /local/databases from
a copy on another node
genericfailurehtml.pl Failure handling script called by web site.
HOW_TO_DO_CDD.TXT Documentation for downloading and installing
CDD data for rpsblast.
HOW_TO_DO_CORE_DATABASES.TXT
Documentation for downloading data and installing
it on the compute nodes.
HOW_TO_DO_SECONDARY_STORAGE.TXT
Documentation for backup/restore to secondary node.
HOW_TO_DO_REFSEQ.TXT
Documentation for refseq (for historical purposes).
machines.LINUX_INTEL64
MPI configuration file, with comments (#) and data
lines like: node15:8
List of nodes available for parallel operation.
mailparallelblastresults.sh
mailparallelblastresults.pl
Called by web server.
main.ncbirc Configuration file for most blast programs.
Mostly like the [NCBI] stanza is no longer needed
with the C++ toolbox. Save as .ncbirc in
database directory.
cdd.ncbirc Configuration file for rpsblast.
Used as: /usr/local/databases/cdd/.ncbirc
When needed env variable NCBI set in scripts
to this directory.
oid_db.sh Generate OID by TaxonID indices for a set
of NCBI databases.
parallel_dblist.txt EXAMPLE description of all databases, place in /usr/common/BLASTDB/
pfm_ssh_blastmaster_rev2.pl
Script to run BLAST jobs on remote nodes
where the split database pieces reside.
pfm_ssh_blastslave_rev2.sh
Run the blast job on this node, started remotely
by pfm_ssh_blastmaster_rev2.pl
pfm_ssh_rpsblastslave_rev2.sh
Called by pfm_ssh_blastmaster_rev2.pl
pfm_ssh_splitcddmaster.pl
Called by split_cdd.sh
pfm_ssh_splitcddslave.pl
Node script invoked by pfm_ssh_splitcddmaster.pl
pfm_ssh_splitdbmaster.pl
Called by split_db.sh.
pfm_ssh_splitdbslave.pl
Called by pfm_ssh_splitdbmaster.pl.
pfm_ssh_splitoidmaster.pl
Called by oid_db.sh
pfm_ssh_splitoidslave.pl
Node script invoked by pfm_ssh_splitoidmaster.pl
prep_taxon_oidl.c Makes OID databases from taxon information.
Run by pfm_ssh_splitoidslave.pl.
README.TXT This file
SAF_CHANGES_TEST.txt Instructions for testing modifications to blast programs.
Not exactly
secondary_storage Script to backup or restore the data in /local/databases
on a group of
secondary.txt Example configuration file for backup/restore.
setuser.c From w2h package, sets a run time user.
split_cdd.sh Split and distribute databases downloaded with fetch_cdd.sh.
split_db.sh Split and distribute databases downloaded with fetch_db.sh.
test_pfm_ssh.pl Test script, verify that ssh and Parallel::ForkManager
are working. Modify the list of nodes to match site.
********************************************************************
Compiling C programs:
gcc -Wall -std=c99 -pedantic -DMAXINFILE=50 -o blastplusmerge blastplusmerge.c
gcc -Wall -std=c99 -pedantic -o fastafrag fastafrag.c
gcc -Wall -std=c99 -pedantic -o fastaproperties fastaproperties.c -lm
gcc -Wall -std=c99 -pedantic -o fastarange fastarange.c
gcc -Wall -std=c99 -pedantic -o fasta_range_calculate fasta_range_calculate.c
gcc -Wall -std=c99 -pedantic -D_LARGE_FILE_SOURCE -D_FILE_OFFSET_BITS=64 \
-o prep_taxon_oidl prep_taxon_oidl.c
gcc -Wall -std=c99 -pedantic -o setuser setuser.c
********************************************************************
Compiling NCBI toolbox++ programs:
#define the next two before doing anything else!!!
export SOME_DIR= #where it will be unpacked, built
export TARGET_DIR= #where the binaries will be installed
#
cd $SOME_DIR
pversion=2.10.1
package=ncbi-toolbox-cpp
alt_pkg=ncbi-blast
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/${pversion}/${alt_pkg}-${pversion}+-src.tar.gz
gunzip -c ${alt_pkg}-${pversion}+-src.tar.gz | tar -xf -
/bin/rm ${alt_pkg}-${pversion}+-src.tar.gz
cd ${alt_pkg}-${pversion}+-src/c++
#there are already .orig files in distro, so make backup .dist of those
#that will be patched
grep -- --- /usr/common/src/saf_utilities/parallelblast_plus-1.0.5/final-2.10.0.patch \
| extract -mt -dl '\t' -fmt '[1]' \
| extract -sc 5 \
| extract -mt -dl '.' -fmt 'cp [1,-2] [1,-2].dist' \
| execinput
patch -p0 <final-2.10.1.patch
./configure --prefix=/usr/common 2>&1 | tee configure_2020_07_06.log
make -j 4 2>&1 | tee build.log
cat >/tmp/install_list.txt <<EOD
blastdb_aliastool
blastdbcheck
blastdbcmd
blastdb_convert
blastdbcp
blastdb_path
blast_formatter
blastn
blastp
blast_report
blastx
datatool
deltablast
dustmasker
makeblastdb
makembindex
makeprofiledb
psiblast
rpsblast
rpstblastn
segmasker
tblastn
tblastx
windowmasker
EOD
#as root
cd ReleaseMT/bin
extract -in /tmp/install_list.txt \
-fmt 'cp [1,] $TARGET_DIR' \
| execinput
********************************************************************
Running scripts:
Most of these assume that passwordless ssh has been configured, from the
master to the compute nodes.
The web interface blastcontrol_rev5.pl goes in the web server's cgi-bin
(or equivalent) directory. The scripts it calls go somewhere on
the running process's PATH.
********************************************************************
revisions
1.0.5 01/06/2021.
Add segmented query mode.
Add fasta_range_calculate program.
Add SAF_CHANGES_TEST.txt
Migrate PVM to Parallel::ForkManager with ssh.
1.0.4 07/17/2019. Fix minor bugs in blastcontrol_rev3.pl (invalid syntax
not handled, file deletion by time stamp broken.)
1.0.3 06/21/2019. Made blastcontrol_rev3.pl, which has "run again" links.
Modified blast patch so that BDB_TAXONLIST and BDB_OIDRANGES can
be specified together, selected entries are the intersection of
the two. Allows taxonid selection in a search of an intact database
which is being "virtually sliced" using BDB_OIDRANGES.
1.0.2 06/18/2019. blastplusmerge.c: fixed some problems resulting from
unexpected syntax in html formatted blast output and others (syntax
changes not caught when updating from blastmerge.c).
pvmblastmaster_rev2.pl: wrong filter used with translating programs,
some typo's.
1.0.1 06/13/2019 first release