Re: [Rdkit-discuss] Polymers, S-Groups, and molblock-parsing (oh my!)
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
|
From: Greg L. <gre...@gm...> - 2011-11-03 04:29:28
|
James,
On Thu, Oct 27, 2011 at 2:30 AM, James Davidson <J.D...@ve...> wrote:
>
> I found some time to have a look into this a bit more myself, and would be
> inclined to agree that the best thing to do would be to reject polymers.
> From my reading of the CTFILE spec, and the (extremely useful) Gushurst
> paper (J. Chem. Inf. Comput. Sci. 1991, 31, 447-454) I would suggest
> rejecting any molecule with "polymer" or "components, mixtures, and
> formulations" Sgroup data in the molblock; and ignoring or handling the
> "drawing and displaying shortcuts" Sgroup data (if there are no Sgroup data
> in the previous two categories).
>
> I have copied below my understanding of the categories:
>
> Sgroup types for "polymers":
>
> SRU - structural repeating unit (for structure-based representation)
> MON - monomer type (for source-based representation)
> COP - copolymer
> CRO - cross-link across two polymers
> GRA - graft (eg terminally-attached) polymer2 on repeat unit of polymer1
> MOD - for representing incomplete(?) modifications
> MER - used when monomer repeat is 1 - ie alternating copolymers
> ANY - (query) for posing more general polymer search queries
>
>
> Sgroup types for "components, mixtures, and formulations":
>
> COM - components (members of mixtures/formulations)
> MIX - mixtures (order is not important)
> FOR - formulations (order is important)
>
>
> Sgroup types for "drawing and display shortcuts"
>
> SUP - superatoms (can be contracted/expanded for representation purposes)
> MUL - multiple groups (like a repeating superatom, but can only have 0 or 2
> crossing bonds)
> GEN - generic bracketing (does not affect structure)
Thanks for tracking down the reference and distilling the contents for us. :-)
> Rejecting based on the first two categories should be straightforward(?),
> and equally applicable to V2000 and V3000. Ignoring the SUP and MUL types
> will only (I think...) cause issues in 2D layout - so 'handling' could maybe
> be to force the expansion of these groups, then get rid of them and
> regenerate coordinates?
I just checked in a change that causes the Mol file parser to generate
errors for the following S groups:
// polymer sgroups:
"SRU","MON","COP","CRO","GRA","MOD","MER","ANY",
// formulations/mixtures:
"COM","MIX","FOR"
The parser accepts (but ignores) other types of S group.
The new behavior for your sample molecule is:
>>> m = Chem.MolFromMolFile('heparin.mol')
[05:18:15] Unhandled CTAB feature: S group SRU. Molecule skipped.
>>> m is None
True
I'm assuming that the S groups listed above all indicate that the
molecule is of a type that isn't properly supported by the RDKit, so
it's reasonable to not produce anything. This is a pretty broad brush
-- there could well be uses of those S groups for molecules the RDKit
represents correctly -- but without a library of examples it is very
difficult to tell. The current handling, corresponding to "it may not
be correctly represented, so reject it" is extremely picky, maybe too
picky.
Any additional feedback ?
-greg
|