USAGE: parasim.pl [options] -q query.txt[.gz] [-r reference.txt[.gz]] OPTIONS: -min #min_similarity The minimum similarity (0.0 = dissimilarity, 1.0 = identity). This has impact on the performance. Default: 0.00 -max #max_similarity The maximum similarity (0.0 = dissimilarity, 1.0 = identity). This has impact on the performance. Default: 1.00 -n/k #num_similars The number of hits to keep (k nearest neighbors). Default: 1 -c similarity_coeff The similarity coefficient to use. Allowed values: 'tan' : Tanimoto/Jaccard similarity coefficient 'dice' : Dice similarity coefficient Default: 'tan' -v Verbose. Print detailed status and progress information. -q query.txt[.gz] The file containing the query fingerprints. Wildcards are expanded but have to be quoted. -r reference.txt[.gz] The file containing the reference fingerprints. Wildcards are expanded but have to be quoted. Use 'mem:#key' to identify a persistent memory object which was created with fp2mem-persist.pl before. Default: 'mem:0' -h/help Show this help. ADVANCED OPTIONS: -t #threads The number of threads to be used in parallel. Default: Number of available cores on host -b binary_class The class used to represent the fingerprint. This has impact on the performance. Allowed values: 'int' : Integer representation of fingerprint bitset 'char' : Character representation of fingerprint bitset Default: 'int' for fingerprints being a multiple of 32, 'char' for fingerprints being a multiple of 8. -u on/off Switch on/off loop-unrolling. This has impact on the performance. Default: 'on' for 32 x sizeof(int) bit fingerprints, 'on' for 64 x sizeof(char) bit fingerprints, 'off' for all other fingerprint lengths.
Beside the set of standard options whose purpose is to control the basic features of ParaSim, ParaSim also provides a set of advanced options for experienced users which control the technical behaviour of the software.
By default, ParaSim uses all available CPU cores for parallel calculations and automatically reduces the number to the number of query fingerprints if necessary. However, given the case that only a lower,
limited number of cores shall be used by ParaSim, this can be manually defined using option -t
.
ParaSim implements several different options for the most time-consuming calculation, the count of on-bits in a fingerprint, the so-called bitcount or popcount. By default, it determines the best
applicable method based on the length of the fingerprint. However, for test or research purposes, the calculation method can completely be controlled by the user:
1.The way how the fingerprint is internally interpreted (option -b
): char
(character) or int
(integer) with a speed advantage for 'int'.
2.Loop-unrolling (option -u
): on
or off
. For particular fingerprints lengths (currently 32 x sizeof(int)
and 64 x sizeof(char)
with sizeof(int) = 32
on most systems and usually sizeof(char) = 8
) a special internal algorithm is available which is supposed to result in additional gain of performance. If not set manually, it will be used automatically if applicable.
ParaSim comes with a central configuration file parasim-config.txt
which consolidates the different default values and makes it easy to modify them. Especially, paths to preinstalled third-party software
packages for the calculation of fingerprints from chemical structure files are defined here. Just use a text editor of your choice to edit the file and change default values if required. Comments within the file explain the default values' meanings.
The maximum number of allowed parallel threads (set to 256) is the only default value which can only be modified in the C source code section of the ParaSim Perl script. This parameter limits the memory used for thread function parameters and is more of technical value. The practically used number of threads is defined by option -t
and must be equal or lower than this value (checked during runtime). If this is not sufficient, replace the value by the one you require in the source code command #define MAX_THREADS 256
.
Several factors have direct influence on the calculation performance. In parts this can be significant.
1.The number of cores: Obviously, parallelisation has the strongest impact on performance (option -t
, see Advanced Options).
2.The fingerprint binary class: Where applicable, fingerprints should be interpreted as integers which is faster (option -b
, see Advanced Options).
3.The fingerprint length: Depending on the length of the fingerprint, faster or slower calculation routines can be called. Advisable is a fingerprint length of a multiple of 32 as fingerprints can then be interpreted as integers. Moreover, if the fingerprint length fulfils the requirements for loop-unrolling (option -u
, see Advanced Options), this adds additional speed. The current version of ParaSim contains algorithms optimized for a fingerprint length of 512 or, even better, 1024.
4.Thresholds: Application of similarity thresholds has strong influence on the computation speed because thresholds allow purging of reference compounds prior to similarity calculations. The narrower the thresholds are set, the faster the calculations are performed. Usually, for finding nearest neighbors, a minimum similarity of about 0.3-0.5 may be sufficient which already allows to save about a third to half of the computation time.