From: <ma...@eb...> - 2009-02-20 09:57:28
|
>> I've been testing substructure queries with the ExtendedFingerprinter. >> Attached are two mol files in *.txt, and their pictures in *.png format) >> >> My question is if the query molecule should be considered a subgraph of >> 35623? (I assume it should) > >No, that is not a subgraph. The "color" of both edges and vertices >counts (bond order and element symbols). > >Most sophisticated chemical query systems support wild cards for bonds >(like "any bond order"). If that were defined for all the bonds in the >query, 35623 would be a hit. That would, as discussed, be a feature for >later versions. > Thanks Andrew and Chris. I'm still a bit puzzled - first of all I think the rings in 35623 are aromatic (I just checked the rings' bond aromaticity flags using that Java program I attached before). Secondly Chris, if you draw that query in Pubchem for substructure query searching, you get 1625 hits of which mant look to me suspiciously like 35623.. cheers, Mark |
From: <ma...@eb...> - 2009-02-20 14:23:11
|
> >Looking at the structure, there's no place with 3 connected double >bonds. Or even two for that matter. > >I suspect this is a PubChem tweak because aromaticity is an ambiguous >definition and people sketching the structure might not be as >sensitive to the problems that can arise. Thanks again Andrew/Chris, this gives me some stuff to think about. For the development of OrChem (a CDK based Oracle plugin) this is relevant - it means that OrChem's subtructure searches will be more restrictive than PubChem's. Consistently, Rajarshi's new VF2 isomorphism class also considers my previous example query/target not being a substructure case (so that's good). My concern would be that people get used to PubChem as the default way of searching, and any user would expect something like Orchem to work with the same query/result behaviour. But that won't be the case then. Mark (PS - resent this e-mail , didn't go through the first time it seems) Mark |
From: Andrew D. <da...@da...> - 2009-02-20 14:41:56
|
On Feb 20, 2009, at 3:23 PM, ma...@eb... wrote: > My concern would be that people get used to PubChem as the default > way of > searching, and any user would expect something like Orchem to work > with > the same query/result behaviour. But that won't be the case then. Ask your target users to see what they expect for a few of those cases. What's the population overlap between them and PubChem users? Perhaps the answers are "prefer CDK's answer" and "very small." Andrew da...@da... |
From: <ma...@eb...> - 2009-02-20 18:00:53
|
> >Looking at the structure, there's no place with 3 connected double >bonds. Or even two for that matter. > >I suspect this is a PubChem tweak because aromaticity is an ambiguous >definition and people sketching the structure might not be as >sensitive to the problems that can arise. Thanks again Andrew, this gives me some stuff to think about. For the development of OrChem (a CDK based Oracle plugin) this is relevant - it means that OrChem's subtructure searches will be more restrictive than PubChem's. Consistently, Rajarshi's new VF2 isomorphism class also considers my previous example query/target not being a substructure case (so that's good). My concern is that people get used to PubChem as the default way of searching, and any user would expect Orchem to work with the same query/result behaviour. But that won't be the case then. Mark Mark |
From: Rajarshi G. <rg...@in...> - 2009-02-20 18:09:55
|
On Feb 20, 2009, at 6:58 AM, ma...@eb... wrote: >> >> Looking at the structure, there's no place with 3 connected double >> bonds. Or even two for that matter. >> >> I suspect this is a PubChem tweak because aromaticity is an ambiguous >> definition and people sketching the structure might not be as >> sensitive to the problems that can arise. > > Thanks again Andrew, this gives me some stuff to think about. > > For the development of OrChem (a CDK based Oracle plugin) this is > relevant > - it means that OrChem's subtructure searches will be more restrictive > than PubChem's. Consistently, Rajarshi's new VF2 isomorphism class > also > considers my previous example query/target not being a substructure > case > (so that's good). I haven't read this whole thread so maybe you've already done this - but if you create a query using the 'any atom, any bond' style query containers, you could ignore bond types and thus achieve a 'looser' matching. For the case of fingerprints, one possibility is to have a GraphOnlyFingerprinter based of the ExtendedFingerprinter (since the current GraphOnlyFingerprinter ignores rings), which would ignore bond orders - and thus just focus on topology ------------------------------------------------------------------- Rajarshi Guha <rg...@in...> GPG Fingerprint: D070 5427 CC5B 7938 929C DD13 66A1 922C 51E7 9E84 ------------------------------------------------------------------- So the Zen master asked the hot-dog vendor, "Can you make me one with everything?" - TauZero on Slashdot |
From: Egon W. <ego...@gm...> - 2009-02-20 18:43:53
|
Mark, On Fri, Feb 20, 2009 at 7:09 PM, Rajarshi Guha <rg...@in...> wrote: > On Feb 20, 2009, at 6:58 AM, ma...@eb... wrote: > I haven't read this whole thread so maybe you've already done this - > but if you create a query using the 'any atom, any bond' style query > containers, you could ignore bond types and thus achieve a 'looser' > matching. > > For the case of fingerprints, one possibility is to have a > GraphOnlyFingerprinter based of the ExtendedFingerprinter (since the > current GraphOnlyFingerprinter ignores rings), which would ignore > bond orders - and thus just focus on topology I second these proposals... The original Fingerprinter takes into account aromaticty... but aromaticity indeed is a tricky thing... There was also recently a discussion on these matters, What is more interesting, is to encode if a ring bond is sp2-sp2, or even delocalized... these things are much more well-defined than aromaticity, and not overly difficult to implement... So, maybe it is interesting to add a variant to the above mentioned list, which matches like the current FingerPrinter, with the difference that not aromatic but sp2-sp2 ring bonds are dealt with in a special way... Egon -- Post-doc @ Uppsala University http://chem-bla-ics.blogspot.com/ |
From: Andrew D. <da...@da...> - 2009-02-21 02:22:18
|
On Feb 20, 2009, at 7:09 PM, Rajarshi Guha wrote: > I haven't read this whole thread so maybe you've already done this - > but if you create a query using the 'any atom, any bond' style query > containers, you could ignore bond types and thus achieve a 'looser' > matching. My guess was that only cycles inside of rings suppressed the difference between single and double bonds. Searching for C#CCCC finds plenty of triple-bond containing (and only triple-bond containing) structures. Therefore, PubChem isn't doing an "any bond" style query. > Andrew da...@da... |
From: Andrew D. <da...@da...> - 2009-02-20 10:25:11
|
On Feb 20, 2009, at 10:57 AM, ma...@eb... wrote: > Thanks Andrew and Chris. I'm still a bit puzzled - first of all I > think > the rings in 35623 are aromatic (I just checked the rings' bond > aromaticity flags using that Java program I attached before). They could be. I only checked the SD file and I'm not so experienced with SD files as I am with SMILES. But the query structure was definitely not aromatic. > Secondly Chris, if you draw that query in Pubchem for substructure > query > searching, you get 1625 hits of which mant look to me suspiciously > like > 35623.. That's a different question. Now you're asking "why is the CDK definition of substructure different than the PubChem definition?" Here's your query as a SMILES, generated via PubChem's sketcher. C1C2C(CCC1)CC4C3C2CCCC3CCN4 Here's an example matched target, CID 2215 http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=2215&loc=ec_rcs It has the SMILES CN1CCC2=CC=CC3=C2C1CC4=C3C(=C(C=C4)O)O Clearly the latter contains many double bonds while the former only contains single bonds. This is a guess: PubChem might treat single, double, and aromatic bonds as the same if they are between carbons in a ring. For example, I did a search for this C1=CC=C=C=CC1 which is a 7-member ring containing 3 double-bonds in a row (C=C=C=C). One of the PubChem matches was http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi? cid=10330236&loc=ec_rcs SMILES C1C=CC=CC2=C1C3=CC=CC=CC3=N2 Looking at the structure, there's no place with 3 connected double bonds. Or even two for that matter. I suspect this is a PubChem tweak because aromaticity is an ambiguous definition and people sketching the structure might not be as sensitive to the problems that can arise. Andrew da...@da... |
From: Christoph S. <ste...@eb...> - 2009-02-20 13:34:39
|
> Secondly Chris, if you draw that query in Pubchem for substructure > query searching, you get 1625 hits of which mant look to me > suspiciously like 35623.. Mark, can I only tell you from my experience in chemical searching that if you search for the skeleton with single bonds, you should get only structures with this single-bonded substructure. Again, there might be "switches" to alter that behaviour, but this is the default. Anything else should be very explicitly stated in the documentation. Cheers, Chris ma...@eb... wrote: >>> I've been testing substructure queries with the ExtendedFingerprinter. >>> Attached are two mol files in *.txt, and their pictures in *.png format) >>> >>> My question is if the query molecule should be considered a subgraph of >>> 35623? (I assume it should) >> No, that is not a subgraph. The "color" of both edges and vertices >> counts (bond order and element symbols). >> >> Most sophisticated chemical query systems support wild cards for bonds >> (like "any bond order"). If that were defined for all the bonds in the >> query, 35623 would be a hit. That would, as discussed, be a feature for >> later versions. >> > > Thanks Andrew and Chris. I'm still a bit puzzled - first of all I think > the rings in 35623 are aromatic (I just checked the rings' bond > aromaticity flags using that Java program I attached before). > > Secondly Chris, if you draw that query in Pubchem for substructure query > searching, you get 1625 hits of which mant look to me suspiciously like > 35623.. > > cheers, > Mark > > > > ------------------------------------------------------------------------------ > Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA > -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise > -Strategies to boost innovation and cut costs with open source participation > -Receive a $600 discount off the registration fee with the source code: SFAD > http://p.sf.net/sfu/XcvMzF8H > _______________________________________________ > Cdk-devel mailing list > Cdk...@li... > https://lists.sourceforge.net/lists/listinfo/cdk-devel -- Dr. Christoph Steinbeck Head of Chemoinformatics and Metabolism European Bioinformatics Institute (EBI) Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD UK Phone +44 1223 49 2640 What is man but that lofty spirit - that sense of enterprise. ... Kirk, "I, Mudd," stardate 4513.3.. |