Re: [Rdkit-discuss] comparing two or more tables of molecules
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: James A. L. <lum...@li...> - 2016-11-29 11:58:51
|
Hi Steve, It's definitely not a fool proof method, but we have a KNIME node to compare two tables of molecules. We use it frequently when testing code changes in KNIME development (regression testing nodes). The node is found in the Community Nodes -> Erlwood Nodes -> Testing -> Molecule Difference Checker (screenshot below). It takes two tables on the two input ports and compares all columns that contain structure data, between the two tables. It will fail if there are a different number of structure columns or if they are of a different type (it is also column order sensitive). It will also fail if any given chemical structure changes in a given row between two identically named/typed columns. There is a box you can un-check 'Fail on first error' that will not fail the node and instead output a view that shows only those rows that differ (view not table). We tried various way to do these comparisons but ultimately used this Chemaxon method based on a graph comparison "isSimilarTo" (not similarity or text string based). This performed best for regression testing compared to similarity or string based checking which are either too relaxed (we care about atom/ring/bond type changes) or too strict (string comparisons fail on minor canonicalization differences). I don't know if there is an RDKit alternative? https://www.chemaxon.com/marvin/help/developer/beans/api/chemaxon/struc/MoleculeGraph.html#isSimilarTo-chemaxon.struc.MoleculeGraph- The second screenshot below shows the use of the node when testing the chemical file reader. The output of this file reader node after execution is compared against a reference table saved to disk and it will (a) pass if the structures are the same (b) fail if the columns don't match or (c) fail if the smiles don't match (conversion of input smiles with removal of chirality in this case). If there was an RDKit equivalent to the isSimilarTo method and it was of use, we could adapt the node to use it. Thanks James (Eli Lilly / Erl Wood Nodes) [cid:image001.png@01D24A37.EAB43120] [cid:image002.png@01D24A37.EAB43120] From: Stephen O'hagan [mailto:SO...@ma...] Sent: 28 November 2016 16:25 To: rdk...@li...<mailto:rdk...@li...> Subject: [Rdkit-discuss] comparing two or more tables of molecules Has anyone come up with fool-proof way of matching structurally equivalent molecules? Unique Smiles or InChI String comparisons don't appear to work presumable because there are different but equivalent structures, e.g. explicit vs non-explicit H's, Kekule vs Aromatic, isomeric forms vs non-isomeric form, tautomers etc. I also expect that comparing InChI strings might need something more than just a simple string comparison, such as masking off stereo information when you don't care about stereo isomers. I assume there are suitable tools within RDKit that can do this? N.B. I need to collate tables from several sources that have a mix of smiles / InChI / sdf molecular representations. I usually use RDKit via Python and/or Knime. Cheers, Steve. |