[Rdkit-discuss] canonical fragment SMILES
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Pavel P. <pav...@uk...> - 2025-03-27 11:19:34
|
Hello, I encountered an issue with SMILES of fragments. Maybe someone may suggest a workaround. I attached the notebook, but will also reproduce some code here. We have a structure with two Ns and we take an N atom and adjacent atoms to make a fragment SMILES and got different results, while SMILES represent the same pattern (only the order of atoms is different). I guess this happens due to canonicalization algorithm, which takes into account some additional information missing in the output SMILES (e.g. ring membership). For instance, if we break a saturated cycle (bond 8-9), we get identical SMILES output. mol = Chem.MolFromSmiles('CCn1c2cccc3CCn(c23)c2ccccc12') print(Chem.MolFragmentToSmiles(mol, [1,2,3,17], canonical=True)) print(Chem.MolFragmentToSmiles(mol, [9,10,11,12], canonical=True)) cN(C)c cN(c)C So, the question is how to workaround this issue? We already have millions of such patterns. So, it will work if we will be able to canonicalize them. However, standard canonicalization does not work, because we have disable sanitization during SMILES parsing. It returns the same output as input SMILES. Any ideas are appreciated. print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(C)c', sanitize=False))) print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(c)C', sanitize=False))) cN(C)c cN(c)C This issue actually came from the code of identification of functional groups. Kind regards, Pavel |