rdkit-discuss Mailing List for RDKit (Page 5)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

Perhaps I’m missing something obvious, but is there a way to calculate the number of aromatic atoms in a molecule?

Cheers

Chris

Good afternoon,

I am attempting to compile and build Shape-It using Visual Studio and Anaconda . I believe I have all of my variables and environment configured correctly, although I could be wrong as well of course but the issue I am running into is when I attempt to build the project, and have confirmed that the build files have been written to the project folder, I run into errors in regards to some missing C++ header files it is searching for within RDKIT I believe. I know RDKIT is installed in my conda environment, and some of these C++ header files or source files is for GraphMol for example. I've looked everywhere in my conda environment or prefix conda environment and even outside my conda environment in Windows, GraphMol is not found anywhere on the machine. Not sure if I am missing something, but any insight is greatly appreciated. My colleague was able to install Shape-It successfully on Linux with no issues, and apparently no tweaking of environment variables whatsoever.

Is Shape-It even possible to build on Windows? Thank you in advance

[cid:image001.png@01DA0B14.3140C230]

Best regards,

Nick Jones
IT Support Engineer

[DISCOVERY -01]

Eurofins Panlabs Inc.
6 Research Park Dr
St. Charles, MO 63304
United States of America
Cell Phone: 636-445-4759 or 636-328-8303
Email: Nic...@eu...
Website: www.eurofinsdiscoveryservices.com<http://www.eurofinsdiscoveryservices.com/> and www.discoverx.com<http://www.discoverx.com/>

Hi Rocco,

That is exactly what I was looking for. Thanks so much for your kind suggestion!

Massive Thanks,
Amy

From: Rocco Moretti <rmo...@gm...>
Date: Friday, October 27, 2023 at 12:30 PM
To: He, Amy <he...@bu...>
Cc: rdk...@li... <rdk...@li...>
Subject: Re: [Rdkit-discuss] Is there a Smiles library for common amino acids and ligands that can be used for AssignBondOrdersFromTemplate
I'll note that the official definitions for all the chemical entities in the PDB can be found in the wwPDB's Chemical Component Dictionary: https: //www. wwpdb. org/data/ccd That's in mmCIF format, but there are various SMILES and

I'll note that the official definitions for all the chemical entities in the PDB can be found in the wwPDB's Chemical Component Dictionary: https://www.wwpdb.org/data/ccd<https://urldefense.com/v3/__https:/www.wwpdb.org/data/ccd__;!!KGKeukY!y9HDu9vHVJVCcrYdSUZbjDSBQhAAwMD-nxfmBFBdHFgYzBJt5OMl2TjF4lNApoBGu8c1ht_UtIxEBeozNWEJfjnXFN_BcxlM$>

That's in mmCIF format, but there are various SMILES and InChI definitions for the residues included in the file. (Your mileage may vary for the quality of those representations, though, especially for the rarer ones, but it should be no worse than the SDFs.)

You should be able to use an mmCIF parser to extract them.

e.g.
from mmcif.core.mmciflib import ParseCifSimple  # py-mmcif from the RCSB: `pip install mmcif`
ccd = ParseCifSimple("components.cif", True, 0, 255, "?", "logfile.txt") # logfile.txt is an arbitrary name

ALA = ccd.GetBlock("ALA")
desc = ALA.GetTable("pdbx_chem_comp_descriptor")
print( desc.GetColumnNames() )
for ii in range(desc.GetNumRows()):
    print( desc.GetRow(ii) )

['comp_id', 'type', 'program', 'program_version', 'descriptor']
['ALA', 'SMILES', 'ACDLabs', '10.04', 'O=C(O)C(N)C']
['ALA', 'SMILES_CANONICAL', 'CACTVS', '3.341', 'C[C@H](N)C(O)=O']
['ALA', 'SMILES', 'CACTVS', '3.341', 'C[CH](N)C(O)=O']
['ALA', 'SMILES_CANONICAL', 'OpenEye OEToolkits', '1.5.0', 'C[C@@H](C(=O)O)N']
['ALA', 'SMILES', 'OpenEye OEToolkits', '1.5.0', 'CC(C(=O)O)N']
['ALA', 'InChI', 'InChI', '1.03', 'InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1']
['ALA', 'InChIKey', 'InChI', '1.03', 'QNAYBMKLOCPYGJ-REOHCLBHSA-N']

The components file is rather large, so parsing time might be a little long at times.

On Fri, Oct 27, 2023 at 10:55 AM He, Amy <he...@bu...<mailto:he...@bu...>> wrote:
Dear RDKit experts,

I need your advice on finding a source Smiles library for reference, to build the template molecule from Smiles for AssignBondOrdersFromTemplate<https://urldefense.com/v3/__https:/www.rdkit.org/docs/source/rdkit.Chem.AllChem.html__;!!KGKeukY!y9HDu9vHVJVCcrYdSUZbjDSBQhAAwMD-nxfmBFBdHFgYzBJt5OMl2TjF4lNApoBGu8c1ht_UtIxEBeozNWEJfjnXFI6QyOaN$>.

I am using AssignBondOrdersFromTemplate to perceive bonds in a residue-wise manner from an input PDB, using a reference Smiles library like this:

ref_smi = {

    "ALA": "NC(C)C(=O)",
    "GLY": "NCC(=O)",
    "ILE": "NC(C(C)CC)C(=O)",

}

I wonder if there has been an open reference library for common amino acids and ligands that present in PDB files. A previous post on rdkit-discuss (https://rdkit-discuss.narkive.com/JM2IGLQz/pdb-reader-and-bond-perception<https://urldefense.com/v3/__https:/rdkit-discuss.narkive.com/JM2IGLQz/pdb-reader-and-bond-perception__;!!KGKeukY!y9HDu9vHVJVCcrYdSUZbjDSBQhAAwMD-nxfmBFBdHFgYzBJt5OMl2TjF4lNApoBGu8c1ht_UtIxEBeozNWEJfjnXFC9xFZti$>) points me to this website:
ftp://ftp.ebi.ac.uk/pub/databases/msd/pdbechem/files/pdb.tar.gz<https://urldefense.com/v3/__ftp:/ftp.ebi.ac.uk/pub/databases/msd/pdbechem/files/pdb.tar.gz__;!!KGKeukY!y9HDu9vHVJVCcrYdSUZbjDSBQhAAwMD-nxfmBFBdHFgYzBJt5OMl2TjF4lNApoBGu8c1ht_UtIxEBeozNWEJfjnXFPDTdrMJ$>
and useful links from
http://www.ebi.ac.uk/pdbe-srv/pdbechem/<https://urldefense.com/v3/__http:/www.ebi.ac.uk/pdbe-srv/pdbechem/__;!!KGKeukY!y9HDu9vHVJVCcrYdSUZbjDSBQhAAwMD-nxfmBFBdHFgYzBJt5OMl2TjF4lNApoBGu8c1ht_UtIxEBeozNWEJfjnXFCmcNNh2$>

But I am no longer able to access the contents.

I guess we could always generate Smiles from the standardized SDF files.. Still I am wondering if there is an existing Smiles library (like a reference datafile), where we can retrieve the Smiles string using the residue names of common amino acids and maybe also ligands.

Any comments or suggestions would be greatly appreciated. Thank you for your time and kind support in advance!

Bests,

--
Amy He
Chemistry Graduate Teaching Assistant
Hadad Lab
Ohio State University
he...@os...<mailto:he...@os...>

_______________________________________________
Rdkit-discuss mailing list
Rdk...@li...<mailto:Rdk...@li...>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss<https://urldefense.com/v3/__https:/lists.sourceforge.net/lists/listinfo/rdkit-discuss__;!!KGKeukY!y9HDu9vHVJVCcrYdSUZbjDSBQhAAwMD-nxfmBFBdHFgYzBJt5OMl2TjF4lNApoBGu8c1ht_UtIxEBeozNWEJfjnXFOA_LWfL$>

I'll note that the official definitions for all the chemical entities in
the PDB can be found in the wwPDB's Chemical Component Dictionary:
https://www.wwpdb.org/data/ccd

That's in mmCIF format, but there are various SMILES and InChI definitions
for the residues included in the file. (Your mileage may vary for the
quality of those representations, though, especially for the rarer ones,
but it should be no worse than the SDFs.)

You should be able to use an mmCIF parser to extract them.

e.g.
from mmcif.core.mmciflib import ParseCifSimple  # py-mmcif from the RCSB:
`pip install mmcif`
ccd = ParseCifSimple("components.cif", True, 0, 255, "?", "logfile.txt") #
logfile.txt is an arbitrary name

ALA = ccd.GetBlock("ALA")
desc = ALA.GetTable("pdbx_chem_comp_descriptor")
print( desc.GetColumnNames() )
for ii in range(desc.GetNumRows()):
    print( desc.GetRow(ii) )

*['comp_id', 'type', 'program', 'program_version', 'descriptor']*

*['ALA', 'SMILES', 'ACDLabs', '10.04', 'O=C(O)C(N)C']['ALA',
'SMILES_CANONICAL', 'CACTVS', '3.341', 'C[C@H](N)C(O)=O']['ALA', 'SMILES',
'CACTVS', '3.341', 'C[CH](N)C(O)=O']['ALA', 'SMILES_CANONICAL', 'OpenEye
OEToolkits', '1.5.0', 'C[C@@H](C(=O)O)N']['ALA', 'SMILES', 'OpenEye
OEToolkits', '1.5.0', 'CC(C(=O)O)N']['ALA', 'InChI', 'InChI', '1.03',
'InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1']['ALA',
'InChIKey', 'InChI', '1.03', 'QNAYBMKLOCPYGJ-REOHCLBHSA-N']*

The components file is rather large, so parsing time might be a little long
at times.

On Fri, Oct 27, 2023 at 10:55 AM He, Amy <he...@bu...>
wrote:

> Dear RDKit experts,
>
>
>
> I need your advice on finding a source Smiles library for reference, to
> build the template molecule from Smiles for AssignBondOrdersFromTemplate
> <https://www.rdkit.org/docs/source/rdkit.Chem.AllChem.html>.
>
>
>
> I am using AssignBondOrdersFromTemplate to perceive bonds in a
> residue-wise manner from an input PDB, using a reference Smiles library
> like this:
>
>
>
> ref_smi = {
>
>
>
>     "ALA": "NC(C)C(=O)",
>
>     "GLY": "NCC(=O)",
>
>     "ILE": "NC(C(C)CC)C(=O)",
>
>
>
> }
>
>
> I wonder if there has been an open reference library for common amino
> acids and ligands that present in PDB files. A previous post on
> rdkit-discuss (
> https://rdkit-discuss.narkive.com/JM2IGLQz/pdb-reader-and-bond-perception)
> points me to this website:
>
> ftp://ftp.ebi.ac.uk/pub/databases/msd/pdbechem/files/pdb.tar.gz
>
> and useful links from
>
> http://www.ebi.ac.uk/pdbe-srv/pdbechem/
>
>
>
> But I am no longer able to access the contents.
>
>
>
> I guess we could always generate Smiles from the standardized SDF files..
> Still I am wondering if there is an existing Smiles library (like a
> reference datafile), where we can retrieve the Smiles string using the
> residue names of common amino acids and maybe also ligands.
>
>
>
> Any comments or suggestions would be greatly appreciated. Thank you for
> your time and kind support in advance!
>
>
>
>
>
> Bests,
>
>
>
>
>
> --
>
> Amy He
>
> Chemistry Graduate Teaching Assistant
>
> Hadad Lab
>
> Ohio State University
>
> he...@os...
>
>
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdk...@li...
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

Dear RDKit experts,

I need your advice on finding a source Smiles library for reference, to build the template molecule from Smiles for AssignBondOrdersFromTemplate<https://www.rdkit.org/docs/source/rdkit.Chem.AllChem.html>.

I am using AssignBondOrdersFromTemplate to perceive bonds in a residue-wise manner from an input PDB, using a reference Smiles library like this:

ref_smi = {

    "ALA": "NC(C)C(=O)",
    "GLY": "NCC(=O)",
    "ILE": "NC(C(C)CC)C(=O)",

}

I wonder if there has been an open reference library for common amino acids and ligands that present in PDB files. A previous post on rdkit-discuss (https://rdkit-discuss.narkive.com/JM2IGLQz/pdb-reader-and-bond-perception) points me to this website:
ftp://ftp.ebi.ac.uk/pub/databases/msd/pdbechem/files/pdb.tar.gz
and useful links from
http://www.ebi.ac.uk/pdbe-srv/pdbechem/

But I am no longer able to access the contents.

I guess we could always generate Smiles from the standardized SDF files.. Still I am wondering if there is an existing Smiles library (like a reference datafile), where we can retrieve the Smiles string using the residue names of common amino acids and maybe also ligands.

Any comments or suggestions would be greatly appreciated. Thank you for your time and kind support in advance!

Bests,

--
Amy He
Chemistry Graduate Teaching Assistant
Hadad Lab
Ohio State University
he...@os...<mailto:he...@os...>

Hi all,

If I create a molecule from smiles:
```
from rdkit import Chem

mol = Chem.MolFromSmiles("C")
for atm in mol.GetAtoms():
    print(atm.GetPDBResidueInfo())
```
then the pdb residue info for each atom is None

However, if I created a molecule from pdb, is it possible to delete the pdb
residue info associated to each atom, so that None is returned still? For
illustrative purposes:
```
from rdkit import Chem
from rdkit.Chem import AllChem

mol = Chem.MolFromSmiles("C")

mol = AllChem.AssignBondOrdersFromTemplate(
            mol, Chem.MolFromPDBBlock(Chem.MolToPDBBlock(mol))
        )

## !! This would instead return an "ValueError: MonomerInfo is not a PDB
Residue"
for atm in mol.GetAtoms():
    atm.SetMonomerInfo(Chem.AtomMonomerInfo())
    print(atm.GetPDBResidueInfo())
```

Is it possible to set PDBResidueInfo to None? Or will I have to work around
it by say writing out an intermediate sdf file and read it back in?

Thank you

I'm not sure exactly what you're looking for, but all of the code for
reading and writing SMILES is here:
https://github.com/rdkit/rdkit/tree/master/Code/GraphMol/SmilesParse

-greg

On Tue, Oct 24, 2023 at 11:51 AM Eduardo Mayo <edu...@gm...>
wrote:

> Hello all,
>
> I hope you all are doing well.
>
> I am struggling trying to find the code where all the smile to mol and mol
> to smile translation happens. Can someone point me in the right direction?
>
> kind regards,
> eduardo
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdk...@li...
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

Hello all,

I hope you all are doing well.

I am struggling trying to find the code where all the smile to mol and mol
to smile translation happens. Can someone point me in the right direction?

kind regards,
eduardo

Apologies for not posting code.  I’m experiencing this error in a large class object and it occurs only a few times per thousand objects.

In calling, RDKit::MolToSmiles() on an RWMol object (which is a member of the larger class object), sometimes it returns an empty string.  However, I am able to perform other operations (e.g. in a Chemical Reaction).

I am soliciting suggestions on how to approach debugging.

Thanks, 

J  

Thanks, Diogo and Fio. That solved that problem.

Jeremy

On Sun, Oct 8, 2023 at 1:23 AM Diogo Martins <dio...@gm...> wrote:

> Hi Jeremy,
>
> Chem.AddHs returns a new molecule, you could reassign the variable:
>
> mol = Chem.AddHs(mol)
>
> Best regards,
> Diogo
>
> On Sat, Oct 7, 2023 at 9:36 PM Jeremy Monat <je...@gm...> wrote:
>
>> In Python, I'd like to iterate through all the atoms in a molecule,
>> including hydrogens, so I can assign an isotope to each atom. I haven't
>> been able to include hydrogens in the iterable of atoms:
>>
>> from rdkit import Chem
>>
>> mol = Chem.MolFromSmiles("CCO") # Example molecule: Ethanol (C2H5OH)
>>
>> # Add explicit hydrogens
>> Chem.AddHs(mol)
>>
>> for atom in mol.GetAtoms():
>> print(f"Atom Symbol: {atom.GetSymbol()}")
>> Output:
>> Atom Symbol: C
>> Atom Symbol: C
>> Atom Symbol: O
>>
>> Similarly, mol.GetAtomWithIdx() works up to an index of only 3, giving
>> C, C, and O atoms but no hydrogens.
>>
>> Thanks,
>> Jeremy
>>  -- ~ -- ~ --
>> Jeremy Monat, PhD
>> LinkedIn: http://www.linkedin.com/in/jemonat
>> Portfolio: https://bertiewooster.github.io
>> GitHub: https://github.com/bertiewooster
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdk...@li...
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdk...@li...
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

Hi Jeremy,

Chem.AddHs returns a new molecule, you could reassign the variable:

mol = Chem.AddHs(mol)

Best regards,
Diogo

On Sat, Oct 7, 2023 at 9:36 PM Jeremy Monat <je...@gm...> wrote:

> In Python, I'd like to iterate through all the atoms in a molecule,
> including hydrogens, so I can assign an isotope to each atom. I haven't
> been able to include hydrogens in the iterable of atoms:
>
> from rdkit import Chem
>
> mol = Chem.MolFromSmiles("CCO") # Example molecule: Ethanol (C2H5OH)
>
> # Add explicit hydrogens
> Chem.AddHs(mol)
>
> for atom in mol.GetAtoms():
> print(f"Atom Symbol: {atom.GetSymbol()}")
> Output:
> Atom Symbol: C
> Atom Symbol: C
> Atom Symbol: O
>
> Similarly, mol.GetAtomWithIdx() works up to an index of only 3, giving C,
> C, and O atoms but no hydrogens.
>
> Thanks,
> Jeremy
>  -- ~ -- ~ --
> Jeremy Monat, PhD
> LinkedIn: http://www.linkedin.com/in/jemonat
> Portfolio: https://bertiewooster.github.io
> GitHub: https://github.com/bertiewooster
> _______________________________________________
> Rdkit-discuss mailing list
> Rdk...@li...
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

Hi Jeremy,

iirc you have to write mol = Chem.AddHs(mol).
In your code you are not keeping the object with the added Hs so there are
no explicit Hs to find when you iterate.

Cheers,
Fio

On Sat, Oct 7, 2023 at 9:36 PM Jeremy Monat <je...@gm...> wrote:

> In Python, I'd like to iterate through all the atoms in a molecule,
> including hydrogens, so I can assign an isotope to each atom. I haven't
> been able to include hydrogens in the iterable of atoms:
>
> from rdkit import Chem
>
> mol = Chem.MolFromSmiles("CCO") # Example molecule: Ethanol (C2H5OH)
>
> # Add explicit hydrogens
> Chem.AddHs(mol)
>
> for atom in mol.GetAtoms():
> print(f"Atom Symbol: {atom.GetSymbol()}")
> Output:
> Atom Symbol: C
> Atom Symbol: C
> Atom Symbol: O
>
> Similarly, mol.GetAtomWithIdx() works up to an index of only 3, giving C,
> C, and O atoms but no hydrogens.
>
> Thanks,
> Jeremy
>  -- ~ -- ~ --
> Jeremy Monat, PhD
> LinkedIn: http://www.linkedin.com/in/jemonat
> Portfolio: https://bertiewooster.github.io
> GitHub: https://github.com/bertiewooster
> _______________________________________________
> Rdkit-discuss mailing list
> Rdk...@li...
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

In Python, I'd like to iterate through all the atoms in a molecule,
including hydrogens, so I can assign an isotope to each atom. I haven't
been able to include hydrogens in the iterable of atoms:

from rdkit import Chem

mol = Chem.MolFromSmiles("CCO") # Example molecule: Ethanol (C2H5OH)

# Add explicit hydrogens
Chem.AddHs(mol)

for atom in mol.GetAtoms():
print(f"Atom Symbol: {atom.GetSymbol()}")
Output:
Atom Symbol: C
Atom Symbol: C
Atom Symbol: O

Similarly, mol.GetAtomWithIdx() works up to an index of only 3, giving C,
C, and O atoms but no hydrogens.

Thanks,
Jeremy
 -- ~ -- ~ --
Jeremy Monat, PhD
LinkedIn: http://www.linkedin.com/in/jemonat
Portfolio: https://bertiewooster.github.io
GitHub: https://github.com/bertiewooster

Thank you Andrew for the information. It is good to know that this is part
of the standard. So I don't need to worry now. And I like the safety
checking part of your code.

Dan, I wrote my email because from the SD file definition documents that I
could find, I did not see any mention of this. I could have overlooked. But
if it really is not part of the definition,  it is always possible to
encounter I/O problems. And we have encountered several similar situations
with non-conformed files and non-conformed parsers. I had to check the
format definition to determine which (writer or reader side) customer
support to write to. This is why I am careful now. Updating the software
you use would not solve it. It's not a bug as far as the parsing software
is concerned.

Ling

On Fri., Sep. 29, 2023, 10:07 Dan Nealschneider, <
dan...@sc...> wrote:

> I'd also be curious how the index is causing you problems. All SD reading
> code that I know about ignores those suffixes. If you're not using RDKit to
> read the SD file, maybe it would be best to update whatever it is you *are
> *using to parse the file.
>
> dan nealschneider | senior staff developer
>
> *he/him/his*
>
> [image: Schrödinger, Inc.] <https://schrodinger.com/>
>
>
> On Fri, Sep 29, 2023 at 1:08 AM Andrew Dalke <da...@da...>
> wrote:
>
>> On Sep 26, 2023, at 01:17, Ling Chan <lin...@gm...> wrote:
>> > >  <pKa>  (1)
>> > 4.0999999
>>   ..
>> > Just wonder what was the rationale behind this extra "(1)" on the
>> property field lines (pKa and logP in the above example)?
>> >
>> > And is there a way to get rid of these? I am not sure if this extra
>> "(1)" is part of the standard sd format.
>>
>> RDKit uses the increasing value as a sort of per-file registry number.
>>
>> This is follows the part of the standard which says "External registry
>> numbers must be enclosed in parentheses."
>>
>> The relevant code is in Code/GraphMol/FileParsers/SDWriter.cpp :
>>
>>   if (d_molid >= 0) {
>>     (*dp_ostream) << "(" << d_molid + 1 << ") ";
>>   }
>>
>> There is no way to suppress this output. No only is there no direct way
>> to change the d_molid, but d_molid cannot be negative as
>> Code/GraphMol/FileParsers/MolWriters.h declares it as:
>>
>>   unsigned int d_molid;      // the number of the molecules we wrote so
>> far
>>
>>
>> Wim suggested a post-processing approach. Another is to write the SD data
>> items yourself, that is, use MolToMolBlock() to generate the connection
>> table/molfile as a string, then iterate through the properties and generate
>> the data items.
>>
>>
>> import sys
>> from rdkit import Chem
>>
>> def MolToSDFRecord(
>>         mol,
>>         includeStereo: bool = True,
>>         confId: int = -1,
>>         kekulize: bool = True,
>>         forceV3000: bool = False):
>>     mol_block = Chem.MolToMolBlock(mol, includeStereo, confId, kekulize,
>> forceV3000)
>>
>>     lines = []
>>     for prop_name in mol.GetPropNames():
>>         if "\n" in prop_name or ">" in prop_name or "<" in prop_name:
>>             sys.stderr.write(f"WARNING: Skipping property {prop_name!r}
>> because the "
>>                              "name includes an unsupported character.\n")
>>             continue
>>
>>         prop_value = mol.GetProp(prop_name)
>>         if "\n" in prop_value:
>>             if "\n\n" in prop_value or "\r\n\r\n" in prop_value:
>>                 sys.stderr.write(f"WARNING: Skipping property
>> {prop_name!r} because the "
>>                                  "value includes an embedded newline.\n")
>>                 continue
>>             if prop_value.endswith("\r\n"):
>>                 prop_value = prop_value[:-2]
>>             elif prop_value.endswith("\n"):
>>                 prop_value = prop_value[:-1]
>>
>>         lines.append(f"> <{prop_name}>\n{prop_value}\n\n")
>>
>>     lines.append("$$$$\n")
>>
>>     return mol_block + "".join(lines)
>>
>> mol = Chem.MolFromSmiles("CCO")
>> mol.SetProp("pKa","3.3\r\n")
>> print(MolToSDFRecord(mol))
>>
>>
>>                                 Andrew
>>                                 da...@da...
>>
>>
>>
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdk...@li...
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>

I'd also be curious how the index is causing you problems. All SD reading
code that I know about ignores those suffixes. If you're not using RDKit to
read the SD file, maybe it would be best to update whatever it is you
*are *using
to parse the file.

dan nealschneider | senior staff developer

*he/him/his*

[image: Schrödinger, Inc.] <https://schrodinger.com/>

On Fri, Sep 29, 2023 at 1:08 AM Andrew Dalke <da...@da...>
wrote:

> On Sep 26, 2023, at 01:17, Ling Chan <lin...@gm...> wrote:
> > >  <pKa>  (1)
> > 4.0999999
>   ..
> > Just wonder what was the rationale behind this extra "(1)" on the
> property field lines (pKa and logP in the above example)?
> >
> > And is there a way to get rid of these? I am not sure if this extra
> "(1)" is part of the standard sd format.
>
> RDKit uses the increasing value as a sort of per-file registry number.
>
> This is follows the part of the standard which says "External registry
> numbers must be enclosed in parentheses."
>
> The relevant code is in Code/GraphMol/FileParsers/SDWriter.cpp :
>
>   if (d_molid >= 0) {
>     (*dp_ostream) << "(" << d_molid + 1 << ") ";
>   }
>
> There is no way to suppress this output. No only is there no direct way to
> change the d_molid, but d_molid cannot be negative as
> Code/GraphMol/FileParsers/MolWriters.h declares it as:
>
>   unsigned int d_molid;      // the number of the molecules we wrote so far
>
>
> Wim suggested a post-processing approach. Another is to write the SD data
> items yourself, that is, use MolToMolBlock() to generate the connection
> table/molfile as a string, then iterate through the properties and generate
> the data items.
>
>
> import sys
> from rdkit import Chem
>
> def MolToSDFRecord(
>         mol,
>         includeStereo: bool = True,
>         confId: int = -1,
>         kekulize: bool = True,
>         forceV3000: bool = False):
>     mol_block = Chem.MolToMolBlock(mol, includeStereo, confId, kekulize,
> forceV3000)
>
>     lines = []
>     for prop_name in mol.GetPropNames():
>         if "\n" in prop_name or ">" in prop_name or "<" in prop_name:
>             sys.stderr.write(f"WARNING: Skipping property {prop_name!r}
> because the "
>                              "name includes an unsupported character.\n")
>             continue
>
>         prop_value = mol.GetProp(prop_name)
>         if "\n" in prop_value:
>             if "\n\n" in prop_value or "\r\n\r\n" in prop_value:
>                 sys.stderr.write(f"WARNING: Skipping property
> {prop_name!r} because the "
>                                  "value includes an embedded newline.\n")
>                 continue
>             if prop_value.endswith("\r\n"):
>                 prop_value = prop_value[:-2]
>             elif prop_value.endswith("\n"):
>                 prop_value = prop_value[:-1]
>
>         lines.append(f"> <{prop_name}>\n{prop_value}\n\n")
>
>     lines.append("$$$$\n")
>
>     return mol_block + "".join(lines)
>
> mol = Chem.MolFromSmiles("CCO")
> mol.SetProp("pKa","3.3\r\n")
> print(MolToSDFRecord(mol))
>
>
>                                 Andrew
>                                 da...@da...
>
>
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdk...@li...
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

On Sep 26, 2023, at 01:17, Ling Chan <lin...@gm...> wrote:
> >  <pKa>  (1) 
> 4.0999999
  ..
> Just wonder what was the rationale behind this extra "(1)" on the property field lines (pKa and logP in the above example)?
> 
> And is there a way to get rid of these? I am not sure if this extra "(1)" is part of the standard sd format.

RDKit uses the increasing value as a sort of per-file registry number.

This is follows the part of the standard which says "External registry numbers must be enclosed in parentheses."

The relevant code is in Code/GraphMol/FileParsers/SDWriter.cpp :

  if (d_molid >= 0) {
    (*dp_ostream) << "(" << d_molid + 1 << ") ";
  }

There is no way to suppress this output. No only is there no direct way to change the d_molid, but d_molid cannot be negative as Code/GraphMol/FileParsers/MolWriters.h declares it as:

  unsigned int d_molid;      // the number of the molecules we wrote so far

Wim suggested a post-processing approach. Another is to write the SD data items yourself, that is, use MolToMolBlock() to generate the connection table/molfile as a string, then iterate through the properties and generate the data items.

import sys
from rdkit import Chem

def MolToSDFRecord(
        mol,
        includeStereo: bool = True,
        confId: int = -1,
        kekulize: bool = True,
        forceV3000: bool = False):
    mol_block = Chem.MolToMolBlock(mol, includeStereo, confId, kekulize, forceV3000)

    lines = []
    for prop_name in mol.GetPropNames():
        if "\n" in prop_name or ">" in prop_name or "<" in prop_name:
            sys.stderr.write(f"WARNING: Skipping property {prop_name!r} because the "
                             "name includes an unsupported character.\n")
            continue

        prop_value = mol.GetProp(prop_name)
        if "\n" in prop_value:
            if "\n\n" in prop_value or "\r\n\r\n" in prop_value:
                sys.stderr.write(f"WARNING: Skipping property {prop_name!r} because the "
                                 "value includes an embedded newline.\n")
                continue
            if prop_value.endswith("\r\n"):
                prop_value = prop_value[:-2]
            elif prop_value.endswith("\n"):
                prop_value = prop_value[:-1]

        lines.append(f"> <{prop_name}>\n{prop_value}\n\n")

    lines.append("$$$$\n")

    return mol_block + "".join(lines)

mol = Chem.MolFromSmiles("CCO")
mol.SetProp("pKa","3.3\r\n")
print(MolToSDFRecord(mol))

				Andrew
				da...@da...

Thank you Wim. I shall post-process the SDF as you suggested.
Ling

Wim Dehaen <wim...@gm...> 於 2023年9月25日週一 下午5:11寫道：

> Why there is a counter between parentheses there, I don't know, but in
> case there's no option to remove it, you might just manually remove it
> using a regex to remove anything between parentheses on a line that starts
> with >
> for example:
>
> from rdkit import Chem
> import re
> from io import StringIO
> m = Chem.MolFromSmiles("CCC")
> m.SetProp("pKa","3.3")
> sio = StringIO()
> with Chem.SDWriter(sio) as o:
>     o.write(m)
> sio.seek(0)
> with open("temp3.sdf", "w") as f:
>     for line in sio.readlines():
>         f.write(re.sub(r'^>(.*?)\((.*?)\)', r'>\1', line))
>
> best wishes
> wim
>
> On Tue, Sep 26, 2023 at 1:20 AM Ling Chan <lin...@gm...> wrote:
>
>> Dear Colleagues,
>>
>> I noticed that when writing out molecules using SDWriter() , the
>> properties fields are followed by something like "(1)" , "(2)". I mean, the
>> sdf looks like:
>>
>> propane
>>      RDKit          3D
>>
>>   3  2  0  0  0  0  0  0  0  0999 V2000
>>     0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>     1.4280    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>     1.9040    1.3000   -0.3480 C   0  0  0  0  0  0  0  0  0  0  0  0
>>   1  2  1  0
>>   2  3  1  0
>> M  END
>> >  <pKa>  (1)
>> 4.0999999
>>
>> >  <logP>  (1)
>> 2
>>
>> $$$$
>>
>> Just wonder what was the rationale behind this extra "(1)" on the
>> property field lines (pKa and logP in the above example)?
>>
>> And is there a way to get rid of these? I am not sure if this extra "(1)"
>> is part of the standard sd format.
>>
>> Thank you!
>>
>> Regards,
>> Ling
>>
>>
>> ---------------------------------------------------------------------------------------------------
>>
>> To create an sdf, you can do something like:
>>
>> >>> from rdkit import Chem
>> >>> m = Chem.MolFromSmiles("CCC")
>> >>> m.SetProp("pKa","3.3")
>> >>> with Chem.SDWriter("temp3.sdf") as o:
>> ...   o.write(m)
>>
>> Or use Chem.SDMolSupplier() to get mols from another sdf.
>>
>>
>>
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdk...@li...
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>

Why there is a counter between parentheses there, I don't know, but in case
there's no option to remove it, you might just manually remove it using a
regex to remove anything between parentheses on a line that starts with >
for example:

from rdkit import Chem
import re
from io import StringIO
m = Chem.MolFromSmiles("CCC")
m.SetProp("pKa","3.3")
sio = StringIO()
with Chem.SDWriter(sio) as o:
    o.write(m)
sio.seek(0)
with open("temp3.sdf", "w") as f:
    for line in sio.readlines():
        f.write(re.sub(r'^>(.*?)\((.*?)\)', r'>\1', line))

best wishes
wim

On Tue, Sep 26, 2023 at 1:20 AM Ling Chan <lin...@gm...> wrote:

> Dear Colleagues,
>
> I noticed that when writing out molecules using SDWriter() , the
> properties fields are followed by something like "(1)" , "(2)". I mean, the
> sdf looks like:
>
> propane
>      RDKit          3D
>
>   3  2  0  0  0  0  0  0  0  0999 V2000
>     0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>     1.4280    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>     1.9040    1.3000   -0.3480 C   0  0  0  0  0  0  0  0  0  0  0  0
>   1  2  1  0
>   2  3  1  0
> M  END
> >  <pKa>  (1)
> 4.0999999
>
> >  <logP>  (1)
> 2
>
> $$$$
>
> Just wonder what was the rationale behind this extra "(1)" on the property
> field lines (pKa and logP in the above example)?
>
> And is there a way to get rid of these? I am not sure if this extra "(1)"
> is part of the standard sd format.
>
> Thank you!
>
> Regards,
> Ling
>
>
> ---------------------------------------------------------------------------------------------------
>
> To create an sdf, you can do something like:
>
> >>> from rdkit import Chem
> >>> m = Chem.MolFromSmiles("CCC")
> >>> m.SetProp("pKa","3.3")
> >>> with Chem.SDWriter("temp3.sdf") as o:
> ...   o.write(m)
>
> Or use Chem.SDMolSupplier() to get mols from another sdf.
>
>
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdk...@li...
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

Dear Colleagues,

I noticed that when writing out molecules using SDWriter() , the properties
fields are followed by something like "(1)" , "(2)". I mean, the sdf looks
like:

propane
     RDKit          3D

  3  2  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4280    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.9040    1.3000   -0.3480 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  2  3  1  0
M  END
>  <pKa>  (1)
4.0999999

>  <logP>  (1)
2

$$$$

Just wonder what was the rationale behind this extra "(1)" on the property
field lines (pKa and logP in the above example)?

And is there a way to get rid of these? I am not sure if this extra "(1)"
is part of the standard sd format.

Thank you!

Regards,
Ling

---------------------------------------------------------------------------------------------------

To create an sdf, you can do something like:

>>> from rdkit import Chem
>>> m = Chem.MolFromSmiles("CCC")
>>> m.SetProp("pKa","3.3")
>>> with Chem.SDWriter("temp3.sdf") as o:
...   o.write(m)

Or use Chem.SDMolSupplier() to get mols from another sdf.

Hi all,

I’m working with several chemical structure sources and it seems like there are some differences between implementations of stereochemistry specification of bridged bicyclic compounds which leads to unlikely structures. For example, one such structure is:

CC(C)(C)OC(=O)N1[C@]2([H])C[C@@](CC2)([H])[C@H]1C3Nc4c(ccc(c4)B5OC(C)(C)C(C)(C)O5)N=3

When I create a molecule from this string with RDKit, I get a rendering which indicates the structure is incorrectly specified (by opposite wedge directions for the hydrogens on the bridgehead carbons). The molecule additionally fails to produce a 3D conformer by `rdDistGeom.EmbedMolecule` (after AddHs).

The fix seems to be to correct the original SMILES string by changing one of the bridgehead stereochemistry configurations, which then leads to same-wedge renderings and successful 3D conformer generation.

My question is: where do these SMILES strings with “problematic” stereochemistry specifications originate? Are there software implementations of SMILES generation that are internally consistent but incompatible with RDKit’s internal consistency? Does such a disagreement in details originate from ambiguity in the SMILES specification?

Best regards,
Steve Brown

Hi Greg,

Thanks for the info. I'll keep an eye on that then.

Giammy

On Thu, 7 Sept 2023 at 13:08, Greg Landrum <gre...@gm...> wrote:

> Hi Giammy,
>
> We currently only have the Python implementation. Doing a C++ version is
> on my ToDo list, but I'm not sure when we'll get there.
>
> best regards,
> -greg
>
>
> On Thu, Sep 7, 2023 at 1:17 PM Gianmarco Ghiandoni <ghi...@gm...>
> wrote:
>
>> Hello all,
>>
>> I've been testing the Python module from rdkit.Chem import
>> RegistrationHash for some time now and I would like to use it in Java
>> too. I browsed the RDKit repository but I could not find it implemented in
>> C++, and therefore, not available in the Java JARs.
>>
>> Am I missing it from somewhere else or is it just implemented in Python?
>>
>> Thanks,
>>
>> Giammy
>>
>> --
>> *Gianmarco*
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdk...@li...
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>

-- 
*Gianmarco*

Hi Giammy,

We currently only have the Python implementation. Doing a C++ version is on
my ToDo list, but I'm not sure when we'll get there.

best regards,
-greg

On Thu, Sep 7, 2023 at 1:17 PM Gianmarco Ghiandoni <ghi...@gm...>
wrote:

> Hello all,
>
> I've been testing the Python module from rdkit.Chem import
> RegistrationHash for some time now and I would like to use it in Java
> too. I browsed the RDKit repository but I could not find it implemented in
> C++, and therefore, not available in the Java JARs.
>
> Am I missing it from somewhere else or is it just implemented in Python?
>
> Thanks,
>
> Giammy
>
> --
> *Gianmarco*
> _______________________________________________
> Rdkit-discuss mailing list
> Rdk...@li...
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

Hello all,

I've been testing the Python module from rdkit.Chem import RegistrationHash for
some time now and I would like to use it in Java too. I browsed the RDKit
repository but I could not find it implemented in C++, and therefore, not
available in the Java JARs.

Am I missing it from somewhere else or is it just implemented in Python?

Thanks,

Giammy

-- 
*Gianmarco*

Hello,

Is there a way to query a SQL database (through the cartridge) using a fingerprinting method other than the ones listed below? 

rdkit.morgan_fp_size : the size (in bits) of morgan fingerprints
rdkit.featmorgan_fp_size : the size (in bits) of featmorgan fingerprints
rdkit.layered_fp_size : the size (in bits) of layered fingerprints
rdkit.rdkit_fp_size : the size (in bits) of RDKit fingerprints
rdkit.torsion_fp_size : the size (in bits) of topological torsion bit vector fingerprints
rdkit.atompair_fp_size : the size (in bits) of atom pair bit vector fingerprints
rdkit.avalon_fp_size : the size (in bits) of avalon fingerprints

I am interested in using the "pattern" method right now (rdkit.Chem.rdmolops.PatternFingerprint), but I would be interested in learning about a general method for implementing custom fingerprints in the RDKit cartridge. If such a thing is possible, I would also be interested in using similarity scoring methods other than Tanimoto and Dice at some point.  

Thanks in advance for your help.
-Ken

I just learned how to search the archives, and I found this well-titled resource: rdkit.blogspot.com/2017/04/using-custom-fingerprint-in-postgresql.html, which may have all the answers I need.  Apologies for the (hopefully) unnecessary post.

-Ken

On Fri, Sep 1, 2023, at 12:03 PM, Ken wrote:
> Hello,
>
> Is there a way to query a SQL database (through the cartridge) using a 
> fingerprinting method other than the ones listed below? 
>
> rdkit.morgan_fp_size : the size (in bits) of morgan fingerprints
> rdkit.featmorgan_fp_size : the size (in bits) of featmorgan fingerprints
> rdkit.layered_fp_size : the size (in bits) of layered fingerprints
> rdkit.rdkit_fp_size : the size (in bits) of RDKit fingerprints
> rdkit.torsion_fp_size : the size (in bits) of topological torsion bit 
> vector fingerprints
> rdkit.atompair_fp_size : the size (in bits) of atom pair bit vector 
> fingerprints
> rdkit.avalon_fp_size : the size (in bits) of avalon fingerprints
>
> I am interested in using the "pattern" method right now 
> (rdkit.Chem.rdmolops.PatternFingerprint), but I would be interested in 
> learning about a general method for implementing custom fingerprints in 
> the RDKit cartridge. If such a thing is possible, I would also be 
> interested in using similarity scoring methods other than Tanimoto and 
> Dice at some point.  
>
> Thanks in advance for your help.
> -Ken

2006	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2007	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep (27)	Oct (4)	Nov (20)	Dec (4)
2008	Jan (12)	Feb (2)	Mar (23)	Apr (40)	May (30)	Jun (6)	Jul (35)	Aug (60)	Sep (31)	Oct (33)	Nov (35)	Dec (3)
2009	Jan (16)	Feb (77)	Mar (88)	Apr (57)	May (33)	Jun (27)	Jul (55)	Aug (26)	Sep (12)	Oct (45)	Nov (42)	Dec (23)
2010	Jan (64)	Feb (17)	Mar (30)	Apr (55)	May (30)	Jun (65)	Jul (112)	Aug (26)	Sep (67)	Oct (20)	Nov (67)	Dec (23)
2011	Jan (57)	Feb (43)	Mar (50)	Apr (66)	May (95)	Jun (73)	Jul (64)	Aug (47)	Sep (22)	Oct (56)	Nov (51)	Dec (34)
2012	Jan (64)	Feb (45)	Mar (65)	Apr (85)	May (76)	Jun (47)	Jul (75)	Aug (72)	Sep (31)	Oct (77)	Nov (61)	Dec (41)
2013	Jan (68)	Feb (63)	Mar (36)	Apr (73)	May (61)	Jun (69)	Jul (98)	Aug (60)	Sep (74)	Oct (102)	Nov (92)	Dec (63)
2014	Jan (112)	Feb (84)	Mar (72)	Apr (59)	May (96)	Jun (54)	Jul (91)	Aug (54)	Sep (38)	Oct (47)	Nov (33)	Dec (39)
2015	Jan (41)	Feb (115)	Mar (66)	Apr (87)	May (63)	Jun (53)	Jul (61)	Aug (59)	Sep (115)	Oct (42)	Nov (60)	Dec (20)
2016	Jan (52)	Feb (72)	Mar (100)	Apr (125)	May (61)	Jun (106)	Jul (62)	Aug (74)	Sep (151)	Oct (151)	Nov (117)	Dec (148)
2017	Jan (106)	Feb (75)	Mar (106)	Apr (67)	May (85)	Jun (144)	Jul (53)	Aug (73)	Sep (188)	Oct (106)	Nov (118)	Dec (74)
2018	Jan (96)	Feb (43)	Mar (40)	Apr (111)	May (77)	Jun (112)	Jul (64)	Aug (85)	Sep (73)	Oct (117)	Nov (97)	Dec (47)
2019	Jan (63)	Feb (112)	Mar (109)	Apr (61)	May (51)	Jun (41)	Jul (57)	Aug (68)	Sep (47)	Oct (126)	Nov (117)	Dec (96)
2020	Jan (84)	Feb (82)	Mar (80)	Apr (100)	May (78)	Jun (68)	Jul (76)	Aug (69)	Sep (76)	Oct (73)	Nov (69)	Dec (42)
2021	Jan (44)	Feb (30)	Mar (85)	Apr (65)	May (41)	Jun (72)	Jul (55)	Aug (9)	Sep (44)	Oct (44)	Nov (30)	Dec (40)
2022	Jan (35)	Feb (29)	Mar (55)	Apr (30)	May (31)	Jun (27)	Jul (49)	Aug (15)	Sep (17)	Oct (25)	Nov (15)	Dec (40)
2023	Jan (32)	Feb (10)	Mar (10)	Apr (21)	May (33)	Jun (31)	Jul (12)	Aug (17)	Sep (14)	Oct (12)	Nov (8)	Dec (12)
2024	Jan (10)	Feb (18)	Mar (7)	Apr (4)	May (6)	Jun (4)	Jul (5)	Aug (6)	Sep (8)	Oct (1)	Nov (1)	Dec
2025	Jan	Feb	Mar (3)	Apr	May	Jun	Jul (1)	Aug (2)	Sep (3)	Oct (2)	Nov	Dec

rdkit-discuss Mailing List for RDKit (Page 5)

Open-Source Cheminformatics and Machine Learning

rdkit-discuss — Mailing list for discussion, questions and answers.