Re: [Rdkit-discuss] Polymers, S-Groups, and molblock-parsing (oh my!)
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
|
From: James D. <J.D...@ve...> - 2011-10-27 00:49:40
|
Hi Greg: James wrote: >> So I guess the simple question is - should polymers, etc be handled by the >> parser (maybe if not fully, just partially - eg by deleting the * atoms if >> the S-Group data are found)? Greg wrote: > I'm reluctant to do this since I don't understand the semantics of > Sgroups well enough to be able to tell if this modification only makes > sense in this one case or if it's general. In the cases of polymers I > would tend to say that the correct thing to do is to reject the > molecule completely since the RDKit is incapable of correctly storing > what the user intended with the mol block. > I will try to find the time to grok the CTFile documentation for > Sgroups, but I would be happy to get input on this from others. I found some time to have a look into this a bit more myself, and would be inclined to agree that the best thing to do would be to reject polymers. From my reading of the CTFILE spec, and the (extremely useful) Gushurst paper (J. Chem. Inf. Comput. Sci. 1991, 31, 447-454) I would suggest rejecting any molecule with "polymer" or "components, mixtures, and formulations" Sgroup data in the molblock; and ignoring or handling the "drawing and displaying shortcuts" Sgroup data (if there are no Sgroup data in the previous two categories). I have copied below my understanding of the categories: Sgroup types for "polymers": SRU - structural repeating unit (for structure-based representation) MON - monomer type (for source-based representation) COP - copolymer CRO - cross-link across two polymers GRA - graft (eg terminally-attached) polymer2 on repeat unit of polymer1 MOD - for representing incomplete(?) modifications MER - used when monomer repeat is 1 - ie alternating copolymers ANY - (query) for posing more general polymer search queries Sgroup types for "components, mixtures, and formulations": COM - components (members of mixtures/formulations) MIX - mixtures (order is not important) FOR - formulations (order is important) Sgroup types for "drawing and display shortcuts" SUP - superatoms (can be contracted/expanded for representation purposes) MUL - multiple groups (like a repeating superatom, but can only have 0 or 2 crossing bonds) GEN - generic bracketing (does not affect structure) Rejecting based on the first two categories should be straightforward(?), and equally applicable to V2000 and V3000. Ignoring the SUP and MUL types will only (I think...) cause issues in 2D layout - so 'handling' could maybe be to force the expansion of these groups, then get rid of them and regenerate coordinates? Kind regards James ______________________________________________________________________ PLEASE READ: This email is confidential and may be privileged. It is intended for the named addressee(s) only and access to it by anyone else is unauthorised. If you are not an addressee, any disclosure or copying of the contents of this email or any action taken (or not taken) in reliance on it is unauthorised and may be unlawful. If you have received this email in error, please notify the sender or pos...@ve.... Email is not a secure method of communication and the Company cannot accept responsibility for the accuracy or completeness of this message or any attachment(s). Please check this email for virus infection for which the Company accepts no responsibility. If verification of this email is sought then please request a hard copy. Unless otherwise stated, any views or opinions presented are solely those of the author and do not represent those of the Company. The Vernalis Group of Companies Oakdene Court 613 Reading Road Winnersh, Berkshire RG41 5UA. Tel: +44 118 977 3133 To access trading company registration and address details, please go to the Vernalis website at www.vernalis.com and click on the "Company address and registration details" link at the bottom of the page.. ______________________________________________________________________ |