Re: [Rdkit-discuss] canonical fragment SMILES

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thank you, Wim. It works. Even a simpler solution can be to remove all 
atoms except required ones. I had to guess :)
However, this is a bug in the recent RDKit versions. The function 
MolFragmentToSmiles works correctly in version 2023, but not in 2024.

On 28/03/2025 00:10, Wim Dehaen wrote:
> Pavel,
> this is a bit hacky, but you can try the below:
> ```
> def get_frag_smi(mol,frag_atoms):
>     if len(frag_atoms) > 1:
>         b2b = [] # bonds to break
>         fsmi = "" #fragment smiles
>         # get bonds outside of fragment
>         for b in mol.GetBonds():
>             b_idx = b.GetBeginAtomIdx()
>             e_idx = b.GetEndAtomIdx()
>             if e_idx not in frag_atoms\
>             or b_idx not in frag_atoms:
>                 b2b.append(b.GetIdx())
>         # break all bonds except those in fragments
>         fmol = Chem.FragmentOnBonds(mol,b2b,addDummies=0)
>         smis = Chem.MolToSmiles(fmol).split(".")
>         # retain the only fragment with more than one atom in there
>         while fsmi == "":
>             smi = smis.pop(0)
>             m = Chem.MolFromSmiles(smi,sanitize=False)
>             if len(m.GetAtoms()) > 1:
>                 fsmi = smi
>     else: #one atom, no canonicalize needed
>         fsmi = Chem.MolFragmentToSmiles(mol, frag_atoms)
>     return fsmi
> ```
> it is based on the observation/assumption that FragmentOnBonds() and 
> then MolToSmiles() canonizes the fragments cleanly.
> > print(get_frag_smi(mol,[1,2,3,17]))
> > print(get_frag_smi(mol,[9,10,11,12]))
> prints `cN(c)O` twice.
>
> best wishes,
> wim
>
> On Thu, Mar 27, 2025 at 12:23 PM Pavel Polishchuk 
> <pav...@uk...> wrote:
>
>     Hello,
>
>       I encountered an issue with SMILES of fragments. Maybe someone
>     may suggest a workaround.
>       I attached the notebook, but will also reproduce some code here.
>
>       We have a structure with two Ns and we take an N atom and
>     adjacent atoms to make a fragment SMILES and got different
>     results, while SMILES represent the same pattern (only the order
>     of atoms is different). I guess this happens due to
>     canonicalization algorithm, which takes into account some
>     additional information missing in the output SMILES (e.g. ring
>     membership). For instance, if we break a saturated cycle (bond
>     8-9), we get identical SMILES output.
>
>     mol = Chem.MolFromSmiles('CCn1c2cccc3CCn(c23)c2ccccc12')
>
>
>     print(Chem.MolFragmentToSmiles(mol, [1,2,3,17], canonical=True))
>     print(Chem.MolFragmentToSmiles(mol, [9,10,11,12], canonical=True))
>
>     cN(C)c
>     cN(c)C
>
>       So, the question is how to workaround this issue? We already
>     have millions of such patterns. So, it will work if we will be
>     able to canonicalize them. However, standard canonicalization does
>     not work, because we have disable sanitization during SMILES
>     parsing. It returns the same output as input SMILES. Any ideas are
>     appreciated.
>
>     print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(C)c', sanitize=False)))
>     print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(c)C', sanitize=False)))
>
>     cN(C)c
>     cN(c)C
>
>       This issue actually came from the code of identification of
>     functional groups.
>
>     Kind regards,
>     Pavel
>     _______________________________________________
>     Rdkit-discuss mailing list
>     Rdk...@li...
>     https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

Re: [Rdkit-discuss] canonical fragment SMILES

Open-Source Cheminformatics and Machine Learning

Re: [Rdkit-discuss] canonical fragment SMILES