Re: [Rdkit-discuss] canonical fragment SMILES
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
|
From: Pavel P. <pav...@uk...> - 2025-03-28 07:57:30
|
Thank you, Wim. It works. Even a simpler solution can be to remove all
atoms except required ones. I had to guess :)
However, this is a bug in the recent RDKit versions. The function
MolFragmentToSmiles works correctly in version 2023, but not in 2024.
On 28/03/2025 00:10, Wim Dehaen wrote:
> Pavel,
> this is a bit hacky, but you can try the below:
> ```
> def get_frag_smi(mol,frag_atoms):
> if len(frag_atoms) > 1:
> b2b = [] # bonds to break
> fsmi = "" #fragment smiles
> # get bonds outside of fragment
> for b in mol.GetBonds():
> b_idx = b.GetBeginAtomIdx()
> e_idx = b.GetEndAtomIdx()
> if e_idx not in frag_atoms\
> or b_idx not in frag_atoms:
> b2b.append(b.GetIdx())
> # break all bonds except those in fragments
> fmol = Chem.FragmentOnBonds(mol,b2b,addDummies=0)
> smis = Chem.MolToSmiles(fmol).split(".")
> # retain the only fragment with more than one atom in there
> while fsmi == "":
> smi = smis.pop(0)
> m = Chem.MolFromSmiles(smi,sanitize=False)
> if len(m.GetAtoms()) > 1:
> fsmi = smi
> else: #one atom, no canonicalize needed
> fsmi = Chem.MolFragmentToSmiles(mol, frag_atoms)
> return fsmi
> ```
> it is based on the observation/assumption that FragmentOnBonds() and
> then MolToSmiles() canonizes the fragments cleanly.
> > print(get_frag_smi(mol,[1,2,3,17]))
> > print(get_frag_smi(mol,[9,10,11,12]))
> prints `cN(c)O` twice.
>
> best wishes,
> wim
>
> On Thu, Mar 27, 2025 at 12:23 PM Pavel Polishchuk
> <pav...@uk...> wrote:
>
> Hello,
>
> I encountered an issue with SMILES of fragments. Maybe someone
> may suggest a workaround.
> I attached the notebook, but will also reproduce some code here.
>
> We have a structure with two Ns and we take an N atom and
> adjacent atoms to make a fragment SMILES and got different
> results, while SMILES represent the same pattern (only the order
> of atoms is different). I guess this happens due to
> canonicalization algorithm, which takes into account some
> additional information missing in the output SMILES (e.g. ring
> membership). For instance, if we break a saturated cycle (bond
> 8-9), we get identical SMILES output.
>
> mol = Chem.MolFromSmiles('CCn1c2cccc3CCn(c23)c2ccccc12')
>
>
> print(Chem.MolFragmentToSmiles(mol, [1,2,3,17], canonical=True))
> print(Chem.MolFragmentToSmiles(mol, [9,10,11,12], canonical=True))
>
> cN(C)c
> cN(c)C
>
> So, the question is how to workaround this issue? We already
> have millions of such patterns. So, it will work if we will be
> able to canonicalize them. However, standard canonicalization does
> not work, because we have disable sanitization during SMILES
> parsing. It returns the same output as input SMILES. Any ideas are
> appreciated.
>
> print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(C)c', sanitize=False)))
> print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(c)C', sanitize=False)))
>
> cN(C)c
> cN(c)C
>
> This issue actually came from the code of identification of
> functional groups.
>
> Kind regards,
> Pavel
> _______________________________________________
> Rdkit-discuss mailing list
> Rdk...@li...
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
|