From: Egon W. <ego...@gm...> - 2012-03-15 15:27:53
|
Hi Kevin, On Wed, Mar 14, 2012 at 6:18 PM, <kev...@sy...> wrote: > I have been having problems with the Deduce Bond System Tool. When I use > the routine FixAromaticBondOrders I find that it quite often times out (100 > seconds) That is because of the AllRingsFinder being used, which is an exhaustive ring searchers and indeed time consuming algorithm. However, it has already been said to not search for rings larger than 7 atoms... so, that combo might indicate a bug. > if the molecule contains 5 or more (possibly unconnected) rings and > one of them has complexities such that a simple solution cannot be found. > On inspecting the source code (version 1.4.2), it appears that all rings in > the molecule being tested are treated combinatorially Which is in fact needed, because you cannot know on beforehand which bonds lack the required 'double' order flag, and which are OK... that is, on the old CDK world... and the Tool has been written in that old world. In fact, this was one of the reasons why I set out to rewrite atom detection in the CDK. > – all combinations of > possible solutions for each ring being tested. It occurred to me that it > would potentially be faster (for complex molecules) to divide the rings up > into groups of interconnected rings and treat these separately. I use this trick somewhere else too, in the aromaticity detector, if not mistaken... > The attached code (which I have been trying out in the LICSS system) does just > this – the core algorithm is unchanged, but is applied sequentially to each > group of interconnected rings rather than to the whole molecule. In my > hands, this makes things much faster. For example, Smiles such as > Oc1c(C2CC(Cc3ccccc23)c4ccc(cc4)c5ccccc5)c(=O)oc6ccccc16 (Difenacoum) is now > configured very quickly with four kekulised aromatic rings, the remaining > pyranone ring (which isn’t recognised as aromatic) is returned unchanged. > With the previous algorithm, this molecule timed out. I was discussing the class with Ola (Bioclipse) this week, and since so many people use input formats where double bond localization is missing (SMILES as above, but also MDL molfiles with query bond type 4), we just have to have a good algorithm. I think your approach is a good one, and wonder why it times out in the first place, because the AllRingsFinder is supposed to stop after finding 7 atoms anyway... that should be enough a termination condition for it the be fast anyway... I think we might have a regression... The other thought is that both input types have more information: which bonds do not have a well-defined bond order. For SMILES this is each bond between two lower case organic subset atoms. For MDL molfiles it is the bonds of type 4. Thus, I think we should not search in just small ring systems (as they might easily get large enough for the AllRingsFinger to explode on anyway), but restrict the searching on just the unknown bits. This idea I was planning to discuss next week Thursday with Ola and Klas (cc-ed), looking into implementing that approach. That should even be faster and I expect more accurate, as we might not fall into the trap changing bond orders that were explicitly single in the SMILES or molfile in the first place... (That may sound obscure, but was in fact one of the issue with the double bond recovering tool we used before the we got this code donated by the EPA...) That set aside... we always have room in the CDK for alternative algorithms, and the thing I am thinking of requires more prior information that your suggestion does... > When I originally tried this approach to grouping the interconnected rings, > I was expecting to have to rewrite the IsStructureOK routine which checks > each structure for being ‘properly’ aromatic. The current 1.4.x DeduceBondSystemTool does not know about the atom types in the structure... how does your code figure out what is aromatic? > In fact, no modification is > necessary in that it quite happily accepts rings with no double bonds as > aromatic (if the atoms are sp2 hybridised). This can lead to problems with > both the existing algorithm (and my variant). For example, anthracene or > acridine are sometimes kekulised with the central ring containing no double > bonds (depending on the exact ordering of the rings detected by the ring > finder – itself dependent, I believe, on how the Smiles parser happens to > set up the IMolecule). If you do decide to look at my suggested approach, > and also fix the aromaticity checking, it will be necessary to make sure > that it is applied only to those rings in each interconnected group. I very much like to work things out... one aspect is, is to isolate a set of good structures that (alternative) implementations should be able to solve in reasonable time... The second thing is how we want to do things. The DeduceBondSystemTool effectively uses it's own atom type code; I personally like to reuse the central atom typer, which we can independently test. Same for aromaticity detection... but, aromaticity itself is not important for deducing double bond locations... it's just the delocalized nature we want to get rid of, and should not even care about what is aromatic, anti-aromatic, or non-aromatic... Thoughts? Egon -- Dr E.L. Willighagen Postdoctoral Researcher Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers |