[Joelib-devel] JOELib: Speed optimization

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

here are some theoretical aspects at loading molecule files with JOELib:

0. TEST: loading speed 07/09/2002, AMD1400+, ASUS board, 1GB DDR RAM,
Win2K, SUN JDK1.4.0-beta2-b77
At the moment the loading process is very transparent, because of using
text based files and maximal flexible,
because descriptors can be simple integer/double value but also user
defined values, like integer/double array/matrices,
mixed input formats used for CTX files or anything you can imagine. For
descriptor development and processng that's really
great, but let's now talk about speeding up the loading process...

1. Molecular data
only molecules
10000 molecules successful loaded in 11406 ms.
20000 molecules successful loaded in 22562 ms.
30000 molecules successful loaded in 33228 ms.
-->1.1seconds/1000molecules

2. Molecular with descriptor data
with 204 double value descriptors
10000 molecules successful loaded in 92663 ms.
20000 molecules successful loaded in 185727 ms.
30000 molecules successful loaded in 273794 ms.
-->9.13seconds/1000molecules

OPTIMIZATION POSSIBILITIES:

The question is, what do we want exactly ...

0. Techniques
 a. Can we define a faster SDF loader or an user defined loader ? YES,
import/export types can be dynamically be defined.
 b. The text loading process can be optimized by defining a loader which
works directly on the input stream, which makes it
    necessary to define a stream SDF parser. One possibility can be to
write an own parser or to use the JavaCompilerCompiler
    to generate a parser.
    Both possibilities are a lot of work, i assume the loading process
can speeded up to a factor of 1.3 to 1.9
 c. Use a binary import format for which you can define a loader. That's
less flexible and less transparent, but the speed up
    should be very high (i assume a factor greater 2).

1. Molecular data
 a. Speeding up loading molecular data especially is only possibly by
using techniques from 0.

2. Descriptor data
 a. Speeding up loading descriptor data especially is possible by using
a text or binary based flat file format or the techniques 0. Descriptor
data sets have a greter potential for optimization.

Regards, Joerg K. Wegner

Dipl. Chem. Joerg K. Wegner
Univ. Tuebingen, Computer Architecture, Sand 1, D-72076 Tuebingen, Germany
Tel. (+49/0) 7071 29 78970, Fax (+49/0) 7071 29 5091
E-Mail: mailto:we...@in...
WWW:    http://www-ra.informatik.uni-tuebingen.de