Menu

3. Using ParaSim

cherhaus

How to use ParaSim

Synopsis

USAGE: parasim.pl [options] -q query.txt[.gz] [-r reference.txt[.gz]]

OPTIONS: -min #min_similarity    The minimum similarity (0.0 = dissimilarity, 1.0 = identity).
                                 This has impact on the performance.
                                 Default: 0.00
         -max #max_similarity    The maximum similarity (0.0 = dissimilarity, 1.0 = identity).
                                 This has impact on the performance.
                                 Default: 1.00
         -n/k #num_similars      The number of hits to keep (k nearest neighbors).
                                 Default: 1
         -c similarity_coeff     The similarity coefficient to use. Allowed values:
                                 'tan' : Tanimoto/Jaccard similarity coefficient
                                 'dice' : Dice similarity coefficient
                                 Default: 'tan'
         -v                      Verbose. Print detailed status and progress information.
         -q query.txt[.gz]       The file containing the query fingerprints.
                                 Wildcards are expanded but have to be quoted.
         -r reference.txt[.gz]   The file containing the reference fingerprints.
                                 Wildcards are expanded but have to be quoted.
                                 Use 'mem:#key' to identify a persistent memory object
                                 which was created with fp2mem-persist.pl before.
                                 Default: 'mem:0'
         -h/help                 Show this help.

ADVANCED OPTIONS:
         -t #threads             The number of threads to be used in parallel.
                                 Default: Number of available cores on host
         -b binary_class         The class used to represent the fingerprint.
                                 This has impact on the performance. Allowed values:
                                 'int' : Integer representation of fingerprint bitset
                                 'char' : Character representation of fingerprint bitset
                                 Default: 'int' for fingerprints being a multiple of 32,
                                          'char' for fingerprints being a multiple of 8.
         -u on/off               Switch on/off loop-unrolling. This has impact on the performance.
                                 Default: 'on' for 32 x sizeof(int) bit fingerprints,
                                          'on' for 64 x sizeof(char) bit fingerprints,
                                          'off' for all other fingerprint lengths.


Advanced Options

Beside the set of standard options whose purpose is to control the basic features of ParaSim, ParaSim also provides a set of advanced options for experienced users which control the technical behaviour of the software.

By default, ParaSim uses all available CPU cores for parallel calculations and automatically reduces the number to the number of query fingerprints if necessary. However, given the case that only a lower,
limited number of cores shall be used by ParaSim, this can be manually defined using option -t.

ParaSim implements several different options for the most time-consuming calculation, the count of on-bits in a fingerprint, the so-called bitcount or popcount. By default, it determines the best
applicable method based on the length of the fingerprint. However, for test or research purposes, the calculation method can completely be controlled by the user:

1.The way how the fingerprint is internally interpreted (option -b): char (character) or int (integer) with a speed advantage for 'int'.

2.Loop-unrolling (option -u): on or off. For particular fingerprints lengths (currently 32 x sizeof(int) and 64 x sizeof(char) with sizeof(int) = 32 on most systems and usually sizeof(char) = 8) a special internal algorithm is available which is supposed to result in additional gain of performance. If not set manually, it will be used automatically if applicable.


User Defaults

ParaSim comes with a central configuration file parasim-config.txt which consolidates the different default values and makes it easy to modify them. Especially, paths to preinstalled third-party software
packages for the calculation of fingerprints from chemical structure files are defined here. Just use a text editor of your choice to edit the file and change default values if required. Comments within the file explain the default values' meanings.

The maximum number of allowed parallel threads (set to 256) is the only default value which can only be modified in the C source code section of the ParaSim Perl script. This parameter limits the memory used for thread function parameters and is more of technical value. The practically used number of threads is defined by option -t and must be equal or lower than this value (checked during runtime). If this is not sufficient, replace the value by the one you require in the source code command #define MAX_THREADS 256.


Factors influencing Calculation Performance

Several factors have direct influence on the calculation performance. In parts this can be significant.

1.The number of cores: Obviously, parallelisation has the strongest impact on performance (option -t, see Advanced Options).

2.The fingerprint binary class: Where applicable, fingerprints should be interpreted as integers which is faster (option -b, see Advanced Options).

3.The fingerprint length: Depending on the length of the fingerprint, faster or slower calculation routines can be called. Advisable is a fingerprint length of a multiple of 32 as fingerprints can then be interpreted as integers. Moreover, if the fingerprint length fulfils the requirements for loop-unrolling (option -u, see Advanced Options), this adds additional speed. The current version of ParaSim contains algorithms optimized for a fingerprint length of 512 or, even better, 1024.

4.Thresholds: Application of similarity thresholds has strong influence on the computation speed because thresholds allow purging of reference compounds prior to similarity calculations. The narrower the thresholds are set, the faster the calculations are performed. Usually, for finding nearest neighbors, a minimum similarity of about 0.3-0.5 may be sufficient which already allows to save about a third to half of the computation time.


Related

Wiki: Documentation

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.