ParaSim is principally able to be executed by itself if appropriate fingerprint input files are provided. However, in order to extend the usage of ParaSim to persistent memory objects and to facilitate similarity searches directly from Smiles strings or structure files (SDF or Smiles), several additional tools are provided:
1.fp2mem.pl
: Creates and manages persistentently stored memory objects with reference fingerprint data.
2.rdkit2parasim.py
: Generates ParaSim input files from Smiles strings or SDF/Smiles files applying RDKit's Morgan or feature-based Morgan fingerprints. Requires installation of Python and RDKit.
3.Molecule2Parasim.xml
: A Pipeline Pilot™ protocol for the generation of ParaSim input files from Smiles strings or SDF/Smiles files applying ECFP or FCFP fingerprints. Requires installation of Pipeline Pilot™.
4.parasim-conversion-knime-demo.zip
: An example workflow for the Open-Source workflow engine Knime to demonstrate how ParaSim input files can be generated from within the OpenSource workflow engine Knime.
5.simsearch.pl
: Allows similarity searches directly from a Smiles string or structure files (SDF or Smiles) against a reference dataset stored in ParaSim format.
For testing, sample files with 10 records drawn from the freely available PubChem and ZINC databases are provided as well in the data/ subdirectory.
Moreover, several installation-related files are packaged together with ParaSim:
1.parasim-config.txt
: This central configuration file may be edited by the user and stores several default values used by the different scripts.
2.prepare_and_call_pipeline_pilot.csh
: A sample configuration shell script to prepare the system environment for inclusion of Pipeline Pilot™ fingerprint calculations by simsearch.pl
.
3.prepare_and_call_python_rdkit.csh
: A sample configuration shell script to prepare the system environment for inclusion of RDKit fingerprint calculations by simsearch.pl
.
4.Parasim.pm
: A Perl module containing all shared ParaSim functions.
Operating System
In the current implementation, ParaSim itself is a single Perl script with a parallelized computational core written in C. The C core potentially applies extensions of the GCC compiler or hardware routines of Intel® processors (Intel® Streaming SIMD Extensions/SSE4). For multithreading, the C core makes use of POSIX threads by the pthread library which has to be accessible to the compiler. For the use of persistent memory objects ParaSim uses SysV inter process communication (IPC) concepts. Due to these requirements, ParaSim currently can only be executed in a Unix/Linux OS environment with the GCC compiler installed.
Software
ParaSim was tested successfully with Perl version 5.10.0 and 5.12.1 under OpenSuse Linux 11.3 on 32-bit dual-core and Suse Linux Enterprise Server 11 SP3 on 64-bit multiprocessor machines up to 192 cores. Some Perl modules which are not part of the standard distribution are required:
If you want to make use of the tools packaged together with ParaSim, installation of third-party software like Python, RDKit and Pipeline Pilot™ or further software packages for fingerprint calculations may be necessary.
Memory
Because ParaSim loads the reference set into memory, the size of the reference set is limited only by the available memory. Typically, memory consumption per 1 million of reference fingerprints of length 1024 is ~150 MB as persistent memory object and ~300 MB during runtime.
Since version 0.05, ParaSim allows storing additional data like e.g. Smiles strings. Depending on the amount of additional data this of course has direct influence on memory consumption.
ParaSim itself currently consists of just a single Perl script including the C code as well. Compilation of the C source code is performed automatically by the Inline::C module when calling the Perl script. Therefore, basically no installation is required:
chmod 755 parasim.pl
). ParaSim expects the perl executable to be located in /usr/bin/perl. If that is not true in your case, change the default Perl path in the first line of the script's source code.In order to test if ParaSim runs correctly, try
perl parasim.pl -q data/pubchem-test-fcfp6.txt -r data/zinc-test-fcfp6.txt
The output should be
QUERY REFERENCE TANIMOTO AVG_TANIMOTO 68664 ZINC01914437 0.198019801980198 0.104496307506587 71360 ZINC03775002 0.133333333333333 0.103979492391050 68938 ZINC03774999 0.160377358490566 0.122158970101436 71696 ZINC03774999 0.163636363636364 0.118017086925888 71917 ZINC03774999 0.147368421052632 0.102165139370256 71107 ZINC03774999 0.173076923076923 0.128406853662191 71542 ZINC01914437 0.185185185185185 0.107759423159295 71227 ZINC03774999 0.181818181818182 0.129684949182247 71767 ZINC03775009 0.174418604651163 0.122120643622887 71923 ZINC03774991 0.154761904761905 0.117569042869504
If you want to start similarity searches directly from SDF or Smiles files using simsearch.pl
, fingerprints and input files for ParaSim need to be generated during runtime using third-party software. Therefore, third-party software packages like Python and RDKit or Pipeline Pilot™ need to be installed separately:
rdkit2parasim.py
make shure that beside RDKit modules also the modules "sys", "argparse", "gzip" and "base64" are accessible to the Python installation.parasim-config.txt
. Therefore replace placeholders like "my_path" or "my_server" in parasim-config.txt
by the path and server information fitting your environment.prepare_and_call_pipeline_pilot.csh
and prepare_and_call_python_rdkit.csh
for this purpose and adapt them to your needs.Technical note: In the current version of ParaSim, Inline::C compiles the C sources only if binaries do not yet exist or if the C sources were modified. Therefore, if you use ParaSim on different machines in a network, it may happen that you cannot run ParaSim on one architecture because it was compiled on a different architecture before. In this case, make sure that you either re-run the script from a different run directory or that you apply a slight change in the C section of the source code (a single space character is already sufficient) to trigger a recompilation for the new architecture. This issue will be addressed in a future version of ParaSim.