Re: [Rdkit-discuss] SD tag reordering follow up
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
|
From: Greg L. <gre...@gm...> - 2015-10-22 09:06:32
|
Hi JW,
On Thu, Oct 22, 2015 at 12:47 AM, JW Feng <fe...@dn...> wrote:
>
> I read a post (link below) about SD tag reordering by Matthew and replied
> by Greg and I have a follow up question. I would like to preserve the
> ordering of SD tags as they appear in the input SD file. I tried getting
> the list of SD tags by mol.GetPropNames() and setting the order with
> sd_writer.SetProps() but that didn't work. Turns out mol.GetPropNames()
> returns a list in alphabetical order instead of order of appearance.
>
I would say instead that they appear in an unspecified, implementation
dependant, order. This may be alphabetic, but it's certainly not guaranteed
to be so.
> Is there a way to preserve SD tag orders?
>
There is currently no way to do this automatically. I have always thought
about those properties as being unordered, so the RDKit doesn't maintain
any record of what order properties are added to a molecule.
As long as you have the original SDMolSupplier, you can pretty easily get
the ordered list of property names from that:
In [22]: suppl = Chem.SDMolSupplier('tmp.sdf')
In [23]: m = suppl[0]
In [25]: list(m.GetPropNames()) # <- here's the non-ordered list
Out[25]:
['PUBCHEM_ATOM_DEF_STEREO_COUNT',
'PUBCHEM_ATOM_UDEF_STEREO_COUNT',
'PUBCHEM_BONDANNOTATIONS',
'PUBCHEM_BOND_DEF_STEREO_COUNT',
'PUBCHEM_BOND_UDEF_STEREO_COUNT',
'PUBCHEM_CACTVS_COMPLEXITY',
'PUBCHEM_CACTVS_HBOND_ACCEPTOR',
'PUBCHEM_CACTVS_HBOND_DONOR',
'PUBCHEM_CACTVS_ROTATABLE_BOND',
'PUBCHEM_CACTVS_SUBSKEYS',
'PUBCHEM_CACTVS_TAUTO_COUNT',
'PUBCHEM_CACTVS_TPSA',
'PUBCHEM_COMPONENT_COUNT',
'PUBCHEM_COMPOUND_CANONICALIZED',
'PUBCHEM_COMPOUND_CID',
'PUBCHEM_COORDINATE_TYPE',
'PUBCHEM_EXACT_MASS',
'PUBCHEM_HEAVY_ATOM_COUNT',
'PUBCHEM_ISOTOPIC_ATOM_COUNT',
'PUBCHEM_IUPAC_CAS_NAME',
'PUBCHEM_IUPAC_INCHI',
'PUBCHEM_IUPAC_INCHIKEY',
'PUBCHEM_IUPAC_NAME',
'PUBCHEM_IUPAC_OPENEYE_NAME',
'PUBCHEM_IUPAC_SYSTEMATIC_NAME',
'PUBCHEM_IUPAC_TRADITIONAL_NAME',
'PUBCHEM_MOLECULAR_FORMULA',
'PUBCHEM_MOLECULAR_WEIGHT',
'PUBCHEM_MONOISOTOPIC_WEIGHT',
'PUBCHEM_OPENEYE_CAN_SMILES',
'PUBCHEM_OPENEYE_ISO_SMILES',
'PUBCHEM_TOTAL_CHARGE',
'PUBCHEM_XLOGP3_AA']
In [26]: txt = suppl.GetItemText(0)
In [27]: pns = re.findall(r'> *<(\w+)>',txt) # <- this gives you the
list in order
In [28]: pns
Out[28]:
['PUBCHEM_COMPOUND_CID',
'PUBCHEM_COMPOUND_CANONICALIZED',
'PUBCHEM_CACTVS_COMPLEXITY',
'PUBCHEM_CACTVS_HBOND_ACCEPTOR',
'PUBCHEM_CACTVS_HBOND_DONOR',
'PUBCHEM_CACTVS_ROTATABLE_BOND',
'PUBCHEM_CACTVS_SUBSKEYS',
'PUBCHEM_IUPAC_OPENEYE_NAME',
'PUBCHEM_IUPAC_CAS_NAME',
'PUBCHEM_IUPAC_NAME',
'PUBCHEM_IUPAC_SYSTEMATIC_NAME',
'PUBCHEM_IUPAC_TRADITIONAL_NAME',
'PUBCHEM_IUPAC_INCHI',
'PUBCHEM_IUPAC_INCHIKEY',
'PUBCHEM_XLOGP3_AA',
'PUBCHEM_EXACT_MASS',
'PUBCHEM_MOLECULAR_FORMULA',
'PUBCHEM_MOLECULAR_WEIGHT',
'PUBCHEM_OPENEYE_CAN_SMILES',
'PUBCHEM_OPENEYE_ISO_SMILES',
'PUBCHEM_CACTVS_TPSA',
'PUBCHEM_MONOISOTOPIC_WEIGHT',
'PUBCHEM_TOTAL_CHARGE',
'PUBCHEM_HEAVY_ATOM_COUNT',
'PUBCHEM_ATOM_DEF_STEREO_COUNT',
'PUBCHEM_ATOM_UDEF_STEREO_COUNT',
'PUBCHEM_BOND_DEF_STEREO_COUNT',
'PUBCHEM_BOND_UDEF_STEREO_COUNT',
'PUBCHEM_ISOTOPIC_ATOM_COUNT',
'PUBCHEM_COMPONENT_COUNT',
'PUBCHEM_CACTVS_TAUTO_COUNT',
'PUBCHEM_COORDINATE_TYPE',
'PUBCHEM_BONDANNOTATIONS']
If you pass that list of property names to the SDWriter's SetPropNames()
method, it will write things out in the input order.
I hope this helps,
-greg
|