Download Latest Version parallelblast_plus-1.0.3.tar.gz (173.1 kB)
Email in envelope

Get an email when there's a new version of parallelblast_plus

Home
Name Modified Size InfoDownloads / Week
README.TXT 2021-01-26 16.7 kB
blast_progs_2.10.1_CIT.tar.gz 2021-01-07 172.2 MB
parallelblast_plus-1.0.5.tar.gz 2021-01-07 266.8 kB
blast_test_data.tar.gz 2021-01-07 57.3 MB
blast_progs_2.9.0_CIT.tar.gz 2019-06-21 156.7 MB
parallelblast_plus-1.0.3.tar.gz 2019-06-21 173.1 kB
parallelblast_plus-1.0.2.tar.gz 2019-06-18 141.6 kB
parallelblast_plus-1.0.1.tar.gz 2019-06-15 140.2 kB
parallelblast_2.0.9.tar.gz 2019-06-13 841.4 kB
Totals: 9 Items   387.8 MB 0
Updated versions of the programs/methods described in:

  Parallel BLAST on split databases" Bioinformatics
  2003 Sep 22;19(14):1865-6.
  https://doi.org/10.1093/bioinformatics/btg250

These now work with modern BLAST (c++ toolkit) versions.
In brief, these split a database across multiple compute
nodes, search each slice with blast in parallel, and then 
reassemble into a single output file at the end on the 
master node.

The blast programs were slightly modified so that they
can use a different form of taxon id restriction than the
NCBI now supports in V5 databases.  This method was described
in the above paper and works with either V4 or V5 databases.
It supports internal taxonomy nodes, like "rodentia", rather than
having to list the taxon id for every type of rodent. 
Example: to search all rodents other than mouse use "9989,-1758".

Also included are scripts for downloading, splitting, formatting
databases, running blast queries against those databases, and backing
up and restoring data directories to other compute nodes.  Pretty
much every script has a section near the top which must be modified.
Values which really, really must be changed, as they point to
our site, are marked with the string "CHANGETHIS_".

Example command (this assumes the database has already been
split across the compute nodes, parallel_dblist.txt filled
in with the database information, pvm set up, etc.)

pfm_ssh_blastmaster_rev5.pl \
  -query test.pfa \
  -program blastp \
  -dbname=bac.aa \
  -e 1e-80 \
  -taxonid_cit 28211 \
  -b 20 \
  -v 20 \
  -text
  
This would search the bacterial amino acid database restricting
the search to taxons under 28211 (Alphaproteobacteria) returning \
the best 20 hits and their corresponding alignments in text format.
If -text was omitted the result would be html.
If -nomerge was added the raw files from each node would be left on disk.

This is parallelblast_plus release 1.0.5.

The final C toolkit release, 2.0.9 is also present in the sourceforge
project files directory.

Jan 06, 2021
David Mathog

********************************************************************
Nomenclature specific to this package:

OAT                    "Oid AccessionNumber Taxonid".  These triplets are
                       used to keep track of taxon informatin through
                       processing.  Some are extracted from NCBI blast
                       databases and others are constructed on the fly.
                      
********************************************************************
External Program Dependencies - these must be obtained from elsewhere:

ascp                   https://www.ncbi.nlm.nih.gov/books/NBK242625/
extract, accudate      From https://sourceforge.net/projects/drmtools/files/
blastp (etc.)          From blast_progs_2.9.0_CIT.tar.gz, should be in the
                          file section of the parallelblast_plus sourceforge 
                          project.
CGI                    Perl script module dependencies.
DirHandle                To see which is needed where: grep '^use' *pl
File::stat               Obtain from cpan
File::Temp
File::Which
Getopt::Long
MIME::Lite
Parallel::ForkManager
POSIX
subs::parallel

********************************************************************
File Organization:

Cluster directories (/usr/common is exported by master, NFS mounted on compute nodes)
                        /usr/common/bin 
                          #programs and scripts    
                        /usr/common/BLASTDB (and subdirectories)
                          #used to download and unpack/repack sequence data
                          #Master node directory
                        /usr/common/ncbi  
                          # Nodes define env variable NCBI
                          # Directory must contain the .ncbirc file
                        /usr/common/etc
                          # configuration files
Compute Node directories (local to node)
                        /usr/local/databases
                          #used to hold split database chunks for that node
                        /scratch/secondary
                          #holds a copy of a different node's databases (backup)

 
                        
********************************************************************
Files in this distribution are:


blastcontrol_rev5.pl   EXAMPLE web interface to run these programs.  MANY DEPENDENCIES,
                       some of which are included here:                     
  lookup_org_chr_position.html
  fastarange.c
  genericfailurehtml.pl
  mailparallelblastresults.sh
  mailparallelblastresults.pl
  setuser.c   (from w2h package)

blastcontrol_rev2.pl
blastcontrol_rev3.pl
blastcontrol_rev4.pl   Older versions of the preceding.

blastplusmerge.c       Merge output of database sliced queries.

blast_test_data.tar.gz Test data used in SAF_CHANGES_TESTS.txt
                       Downloaded separately from Sourceforge.

cdd.ncbirc             .ncbirc file for the CDD directory

COPYING                GPL v3 license as text file.

fastafrag.c            Fragment fasta files.

fastaproperties.c      Count entries and summarize properties of a fasta file.

fastarange.c           Read a fasta file and emit a range of entries/and or
                       sequence positions.
                       
fasta_range_calculate.c
                       Calculate start/end offsets for "equal" sized 
                       chunks of a fasta file. Emits for each chunk: 
                       number, start offset, end offset, bytes.
                       Results may be used with QUERY_OFFSET_FIRST and  
                       QUERY_OFFSET_LAST to segment blast query input
                       without having to traverse the entire file up
                       to that point.

fetch_cdd.sh           Download CDD databases from the NCBI.

fetch_db_bac.pl        Download and repack the reference set of
                       Bacterial genomes from the NCBI.

fetch_db_bdb.sh        Download nr,nt, swissprot, pdbaa, pdbnt, refseq_protein,
                         and refseq_rna from the NCBI as blast databases, 
                         convert to fasta, extract OAT files, remove blast databases.
                         Conversion works on chunks of the database at a time.

fetch_db.sh            Download a few small databases from the NCBI

fetchtaxon.sh          Retrieve current taxonomy_names.dmp and
                       taxonomy_nodes.dmp from the NCBI.

final-2.10.1.patch
final-2.10.0.patch
final-2.9.0.patch
                       Patch for NCBI toolbox+ blast source code
                         ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.9.0+-src.tar.gz
                       As of 2.10.1 this adds support for:
                         BDB_OIDRANGES  series of OID ranges, comma separted
                                        May be used to split work on compute nodes
                                        which all have a copy of a given database.
                                        Node1=1,1000; Node2 1001,2000, etc.
                                        BDB_TAXONID is not required.
                         BDB_TAXONID    root part of oid index filename, as prepared
                                        by prep_taxon_oidl. Example, it would be
                                        oid_bac.aa_lcl for the files
                                        oid_bac.aa_lcl.bin and oid_bac.aa_lcl.idx
                         BDB_TAXONLIST  Comma delimited list of taxonids, positive
                                        are included, negative are excluded. Added
                                        and removed in the order specified.
                                        BDB_TAXONID is required.
                         QUERY_OFFSET_FIRST
                         QUERY_OFFSET_LAST
                                        The offsets to the first and last range of bytes 
                                        in the query file.  This allows the input to be "sliced"
                                        on the fly to partition the work.  Offsets are in range
                                        {0,file_length-1}.  The '>' in the first header should
                                        be at QUERY_OFFSET_FIRST and the '\n' following the
                                        sequence at QUERY_OFFSET_LAST.
                                        The program fasta_range_calculate may be used to 
                                        calculate these values.
                                        Default is {1,LLONG_MAX} (external coordinates)
                        blastdbcmd works with these with no other command line parameters
                           required
                        blastp etc. need "-taxids 1" to enable control by BDB_* variables.  
                           When BDB_TAXONLIST is specified the value of -taxids 
                             on the command line is ignored.
                           When BDB_TAXONLIST is not supplied the -taxids
                             command line value(s) is(are) used.  However, these will 
                             be seen in order of decreasing value regardless of their
                             original order, which does not provide the same level of control
                             as does BDB_TAXONLIST.  

fix_oid_bac.pl         Reprocesses bac.aa.gz and bac.aa.oat.gz
                       to remove duplicate protein sequences,
                       retaining an OAT entry for each.
                       (Most are marked MULTISPECIES, but sometimes they are not.)

from_secondary         Script to restore a node's /local/databases from
                       a copy on another node

genericfailurehtml.pl  Failure handling script called by web site.

HOW_TO_DO_CDD.TXT      Documentation for downloading and installing
                       CDD data for rpsblast.

HOW_TO_DO_CORE_DATABASES.TXT
                       Documentation for downloading data and installing
                       it on the compute nodes.

HOW_TO_DO_SECONDARY_STORAGE.TXT
                       Documentation for backup/restore to secondary node.

HOW_TO_DO_REFSEQ.TXT
                       Documentation for refseq (for historical purposes).
                       
machines.LINUX_INTEL64
                       MPI configuration file, with comments (#) and data
                       lines like:   node15:8
                       List of nodes available for parallel operation.

mailparallelblastresults.sh
mailparallelblastresults.pl
                       Called by web server.
                       
main.ncbirc            Configuration file for most blast programs.
                       Mostly like the [NCBI] stanza is no longer needed
                       with the C++ toolbox.  Save as .ncbirc in
                       database directory.

cdd.ncbirc             Configuration file for rpsblast.
                       Used as: /usr/local/databases/cdd/.ncbirc
                       When needed env variable NCBI set in scripts
                       to this directory.

oid_db.sh              Generate OID by TaxonID indices for a set
                       of NCBI databases.

parallel_dblist.txt    EXAMPLE description of all databases, place in /usr/common/BLASTDB/

pfm_ssh_blastmaster_rev2.pl
                       Script to run BLAST jobs on remote nodes
                       where the split database pieces reside.

pfm_ssh_blastslave_rev2.sh
                       Run the blast job on this node, started remotely
                       by pfm_ssh_blastmaster_rev2.pl

pfm_ssh_rpsblastslave_rev2.sh
                       Called by pfm_ssh_blastmaster_rev2.pl 

pfm_ssh_splitcddmaster.pl   
                       Called by split_cdd.sh

pfm_ssh_splitcddslave.pl
                       Node script invoked by pfm_ssh_splitcddmaster.pl

pfm_ssh_splitdbmaster.pl    
                       Called by split_db.sh.

pfm_ssh_splitdbslave.pl     
                       Called by pfm_ssh_splitdbmaster.pl.

pfm_ssh_splitoidmaster.pl
                       Called by oid_db.sh

pfm_ssh_splitoidslave.pl
                       Node script invoked by pfm_ssh_splitoidmaster.pl

prep_taxon_oidl.c      Makes OID databases from taxon information.
                       Run by pfm_ssh_splitoidslave.pl.

README.TXT             This file

SAF_CHANGES_TEST.txt   Instructions for testing modifications to blast programs.
                       Not exactly

secondary_storage      Script to backup or restore the data in /local/databases
                       on a group of 

secondary.txt          Example configuration file for backup/restore.

setuser.c              From w2h package, sets a run time user.

split_cdd.sh           Split and distribute databases downloaded with fetch_cdd.sh.

split_db.sh            Split and distribute databases downloaded with fetch_db.sh.

test_pfm_ssh.pl        Test script, verify that ssh and Parallel::ForkManager
                       are working.  Modify the list of nodes to match site.

********************************************************************
Compiling C programs:

gcc -Wall -std=c99 -pedantic -DMAXINFILE=50  -o blastplusmerge blastplusmerge.c
gcc -Wall -std=c99 -pedantic -o fastafrag fastafrag.c
gcc -Wall -std=c99 -pedantic -o fastaproperties fastaproperties.c -lm
gcc -Wall -std=c99 -pedantic -o fastarange fastarange.c
gcc -Wall -std=c99 -pedantic -o fasta_range_calculate fasta_range_calculate.c
gcc -Wall -std=c99 -pedantic -D_LARGE_FILE_SOURCE -D_FILE_OFFSET_BITS=64 \
  -o  prep_taxon_oidl prep_taxon_oidl.c
gcc -Wall -std=c99 -pedantic -o setuser setuser.c

********************************************************************
Compiling NCBI toolbox++ programs:

#define the next two before doing anything else!!!
export SOME_DIR=   #where it will be unpacked, built
export TARGET_DIR= #where the binaries will be installed
#
cd $SOME_DIR
pversion=2.10.1
package=ncbi-toolbox-cpp
alt_pkg=ncbi-blast
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/${pversion}/${alt_pkg}-${pversion}+-src.tar.gz
gunzip -c ${alt_pkg}-${pversion}+-src.tar.gz | tar -xf -
/bin/rm ${alt_pkg}-${pversion}+-src.tar.gz
cd ${alt_pkg}-${pversion}+-src/c++
#there are already .orig files in distro, so make backup .dist of those
#that will be patched
grep -- --- /usr/common/src/saf_utilities/parallelblast_plus-1.0.5/final-2.10.0.patch \
  | extract -mt -dl '\t' -fmt '[1]' \
  | extract -sc 5 \
  | extract -mt -dl '.' -fmt 'cp [1,-2] [1,-2].dist' \
  | execinput
patch -p0 <final-2.10.1.patch
./configure --prefix=/usr/common 2>&1 | tee configure_2020_07_06.log
make -j 4 2>&1 | tee build.log
cat >/tmp/install_list.txt <<EOD
blastdb_aliastool
blastdbcheck
blastdbcmd
blastdb_convert
blastdbcp
blastdb_path
blast_formatter
blastn
blastp
blast_report
blastx
datatool
deltablast
dustmasker
makeblastdb
makembindex
makeprofiledb
psiblast
rpsblast
rpstblastn
segmasker
tblastn
tblastx
windowmasker
EOD
#as root
cd ReleaseMT/bin
extract -in /tmp/install_list.txt \
   -fmt 'cp [1,] $TARGET_DIR' \
   | execinput

********************************************************************
Running scripts:

Most of these assume that passwordless ssh has been configured, from the
master to the compute nodes.

The web interface blastcontrol_rev5.pl goes in the web server's cgi-bin
(or equivalent) directory.  The scripts it calls go somewhere on
the running process's PATH.

********************************************************************
revisions

1.0.5  01/06/2021.
         Add segmented query mode.
         Add fasta_range_calculate program.
         Add SAF_CHANGES_TEST.txt
         Migrate PVM to Parallel::ForkManager with ssh.
1.0.4  07/17/2019.  Fix minor bugs in blastcontrol_rev3.pl (invalid syntax
         not handled, file deletion by time stamp broken.)
1.0.3  06/21/2019.  Made blastcontrol_rev3.pl, which has "run again" links.
         Modified blast patch so that BDB_TAXONLIST and BDB_OIDRANGES can
         be specified together, selected entries are the intersection of
         the two.  Allows taxonid selection in a search of an intact database
         which is being "virtually sliced" using BDB_OIDRANGES.
1.0.2  06/18/2019.  blastplusmerge.c: fixed some problems resulting from
         unexpected syntax in html formatted blast output and others (syntax
         changes not caught when updating from blastmerge.c).
         pvmblastmaster_rev2.pl: wrong filter used with translating programs,
            some typo's.
1.0.1  06/13/2019 first release

Source: README.TXT, updated 2021-01-26