ParaSim Wiki

Parallelized calculation of molecular similarities

Status: Beta

Brought to you by: herhaus

1. Introduction

Introduction

Diversity assessments and structural comparisons of large compound databases require calculating similarities of millions of compounds in an affordable time. The ParaSim programme addresses this challenge by adapting similarity calculations to high-performance computer environments.

ParaSim parallelizes the calculations according to the number of available computing cores on a single machine. The programme is optimized for the throughput of very large numbers of query structures against very large numbers of reference structures. For that reason the reference structure dataset in its entirety is loaded into memory prior to calculations. The size of the reference dataset is therefore solely limited by the available memory. As a special feature, repeatedly queried reference datasets can be kept in memory as persistent memory objects to be immediately available.

ParaSim calculates similarities based on binary structural fingerprints. A fingerprint is a set of "on" or "off" (0) bits for each present structural feature which can be stored as a binary object.
ParaSim does not compute fingerprints by itself but relies on third-party software to do so. Basically, all types of structural fingerprints which can be stored in a bitset can be used by ParaSim. Examples for fingerprints usable by ParaSim are included in the OpenSource chemoinformatics toolkits RDKit (http://www.rdkit.org), CDK (http://cdk.sourceforge.net/) or OpenBabel (http://www.openbabel.org) as well as the commercial chemoinformatics software packages Pipeline Pilot™ from Accelrys® (http://www.accelrys.com/products/pipeline-pilot/) or the Digital Chemistry® toolkit (http://www.digitalchemistry.co.uk). As ParaSim calculates similarities based on binary representations, fingerprint lengths should best be a multiple of 32 (as integer size is 32-bit on most systems, also on 64-bit machines) and must be a multiple of 8 (as character size is 8-bit).

What ParaSim does

ParaSim calculates the well-known Tanimoto (or Jaccard) and Dice similarity indexes from fingerprint query and reference input files. Dissimilarity is represented by a similarity index of 0.0, identity by 1.0. It can be defined by the user how many reference molecules/nearest neighbors shall be identified per query molecule. By default, one hit molecule, the nearest neighbour, is captured. Moreover, thresholds can be defined indicating minimum and maximum similarities for hit molecules. A maximum similarity threshold can further be used to decide whether identity hits with a similarity of 1.0 should be included or excluded. In case they should be excluded, the maximum similarity threshold can be set to a value < 1, e.g. 0.999999.

ParaSim accepts query and reference input files containing fingerprint information in a format described below in more detail. ParaSim output is written in a tab-delimited format to the system's standard output stream (stdout, usually the console) from where it can easily be redirected to files or pipes using the operation system's redirection mechanisms. Output consists of one row per result containing the ID of the query structure, the ID of the found reference structure and the computed similarity index. If only one nearest neighbour per query is requested and no similarity thresholds are applied (so the full reference set will be queried), then also the average similarity of the query molecule against all reference molecules is printed out for statistical purposes. For multiple nearest neighbour reference molecules per query, multiple output lines are written, all containing the same ID of the query structure. In this case the statistical information is omitted in order to keep the output clearly arranged. Using the "verbose" option -v, additional information describing the progress of file reading and calculating steps is written to the standard error stream (stderr, usually also the console).

Notes

Depending on the fingerprint type generated with third-party software, a similarity index of 1.0 does not necessarily mean full structural identity! E.g. fingerprints based on functional classifications may lead to a similarity of 1.0 for highly similar but not identical structures.
For speed reasons, ParaSim does not sort the output in any way but immediately returns results as they are computed by the executing threads. Therefore, the order of output lines may vary from run to run. If the output is required in a sorted way, this can easily be achieved by piping it into a subsequent sort command.