[Rdkit-discuss] canonical fragment SMILES
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
|
From: Pavel P. <pav...@uk...> - 2025-03-27 11:19:34
|
Hello,
I encountered an issue with SMILES of fragments. Maybe someone may
suggest a workaround.
I attached the notebook, but will also reproduce some code here.
We have a structure with two Ns and we take an N atom and adjacent
atoms to make a fragment SMILES and got different results, while SMILES
represent the same pattern (only the order of atoms is different). I
guess this happens due to canonicalization algorithm, which takes into
account some additional information missing in the output SMILES (e.g.
ring membership). For instance, if we break a saturated cycle (bond
8-9), we get identical SMILES output.
mol = Chem.MolFromSmiles('CCn1c2cccc3CCn(c23)c2ccccc12')
print(Chem.MolFragmentToSmiles(mol, [1,2,3,17], canonical=True))
print(Chem.MolFragmentToSmiles(mol, [9,10,11,12], canonical=True))
cN(C)c
cN(c)C
So, the question is how to workaround this issue? We already have
millions of such patterns. So, it will work if we will be able to
canonicalize them. However, standard canonicalization does not work,
because we have disable sanitization during SMILES parsing. It returns
the same output as input SMILES. Any ideas are appreciated.
print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(C)c', sanitize=False)))
print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(c)C', sanitize=False)))
cN(C)c
cN(c)C
This issue actually came from the code of identification of
functional groups.
Kind regards,
Pavel |