Re: [Rdkit-discuss] canonical fragment SMILES
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Pavel P. <pav...@uk...> - 2025-03-28 07:57:30
|
Thank you, Wim. It works. Even a simpler solution can be to remove all atoms except required ones. I had to guess :) However, this is a bug in the recent RDKit versions. The function MolFragmentToSmiles works correctly in version 2023, but not in 2024. On 28/03/2025 00:10, Wim Dehaen wrote: > Pavel, > this is a bit hacky, but you can try the below: > ``` > def get_frag_smi(mol,frag_atoms): > if len(frag_atoms) > 1: > b2b = [] # bonds to break > fsmi = "" #fragment smiles > # get bonds outside of fragment > for b in mol.GetBonds(): > b_idx = b.GetBeginAtomIdx() > e_idx = b.GetEndAtomIdx() > if e_idx not in frag_atoms\ > or b_idx not in frag_atoms: > b2b.append(b.GetIdx()) > # break all bonds except those in fragments > fmol = Chem.FragmentOnBonds(mol,b2b,addDummies=0) > smis = Chem.MolToSmiles(fmol).split(".") > # retain the only fragment with more than one atom in there > while fsmi == "": > smi = smis.pop(0) > m = Chem.MolFromSmiles(smi,sanitize=False) > if len(m.GetAtoms()) > 1: > fsmi = smi > else: #one atom, no canonicalize needed > fsmi = Chem.MolFragmentToSmiles(mol, frag_atoms) > return fsmi > ``` > it is based on the observation/assumption that FragmentOnBonds() and > then MolToSmiles() canonizes the fragments cleanly. > > print(get_frag_smi(mol,[1,2,3,17])) > > print(get_frag_smi(mol,[9,10,11,12])) > prints `cN(c)O` twice. > > best wishes, > wim > > On Thu, Mar 27, 2025 at 12:23 PM Pavel Polishchuk > <pav...@uk...> wrote: > > Hello, > > I encountered an issue with SMILES of fragments. Maybe someone > may suggest a workaround. > I attached the notebook, but will also reproduce some code here. > > We have a structure with two Ns and we take an N atom and > adjacent atoms to make a fragment SMILES and got different > results, while SMILES represent the same pattern (only the order > of atoms is different). I guess this happens due to > canonicalization algorithm, which takes into account some > additional information missing in the output SMILES (e.g. ring > membership). For instance, if we break a saturated cycle (bond > 8-9), we get identical SMILES output. > > mol = Chem.MolFromSmiles('CCn1c2cccc3CCn(c23)c2ccccc12') > > > print(Chem.MolFragmentToSmiles(mol, [1,2,3,17], canonical=True)) > print(Chem.MolFragmentToSmiles(mol, [9,10,11,12], canonical=True)) > > cN(C)c > cN(c)C > > So, the question is how to workaround this issue? We already > have millions of such patterns. So, it will work if we will be > able to canonicalize them. However, standard canonicalization does > not work, because we have disable sanitization during SMILES > parsing. It returns the same output as input SMILES. Any ideas are > appreciated. > > print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(C)c', sanitize=False))) > print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(c)C', sanitize=False))) > > cN(C)c > cN(c)C > > This issue actually came from the code of identification of > functional groups. > > Kind regards, > Pavel > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > |