Re: [Open Babel] Serializing OBMol to memory

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

ern...@ba... wrote:
> maybe I shall clarify my question. :-)
> 
> I filter many matches of OBMols through a SMARTS pattern in PostgreSQL. 
> Within PostgreSQL, every time a function is called - per table row - it 
> forgets all that was before. So I have to parse the OBMol _every row_ and 
> instantiate the SMARTS Pattern _every row_.
> 
> For the SMARTS pattern, there might be a solution of saving a pointer to 
> in shared memory between calls to avoid init()-ing it every call, but for 
> the parsing of the OBMols I'm still searching for a clever method to avoid 
> the parsing and perception steps every time. As parsing and perception can 
> be done upon saving the OBMol the first time in the database, a means of 
> storing the OBMol, or at least the essential data structures to recreate 
> it faster than from a textual format like SMILES, would be great. Is this 
> RTFM? I have not found any documentation about serializing OBMol, parts of 
> OBMol or something like the fastest input format yet...

One thing that might help a little ... sometimes Postgress will call the same function more than once in a single query, say something like:

  select a, ob_foobar(a) from tbl where ... having ob_foobar(a) > 2;

it might call ob_foobar() twice unless you declare the function "IMMUTABLE" when you install it as a plug-in function:

   CREATE OR REPLACE FUNCTION ob_foobar(text) RETURNS integer
   AS '/usr/local/pgsql/lib/libmylib.so', 'ob_foobar '
   LANGUAGE 'C' STRICT IMMUTABLE;

Another thing that can help: The Postgres back-end is a process-per-client model, that is, each client has its own server process running on the back end for the duration of the session.  So if you're linking code into the backend, you can create a cache of parsed OBMol objects.

I keep a single string -- whatever I parsed to create the last OBMol object, usually either a SMILES or a MOL file, along with a pointer to the OBMol object, in a static global variable.  When a new SQL statement comes along, I do a simple string comparison to the last molecule, and if the strings are identical, I reuse the OBMol object.

I do the same thing with SMARTS patterns -- keep the most-recent SMARTS and the pattern object.

There's no reason you couldn't keep a small cache of pre-parsed objects, say, an array of a dozen or so, or even a hash table that could hold hundreds of objects.  Whether this is useful or not just depends on your application, how likely it is that a particular object will be reused.

On the other hand, if you're scanning through an entire database of molecules, then this strategy won't help.  In that case, the serialization methods you're talking about would be a benefit ... but there is no such method.

In my experience, though, parsing a MOL file is not the bottleneck, it's the SMARTS matching that is slow.  Serializing would help somewhat, but the OB SMARTS pattern matcher is overdue for some serious optimizing.

The real trick, of course, is indexing -- figure out in advance how to NOT parse a particular molecule object, because you know in advance that it can't meet your criteria.

Craig