#237 getSubgraphMaps() hangs...

closed
nobody
None
5
2012-10-08
2004-11-04
No

Hi!

Currently I am trying to use the marvelous CDK library
to introduce some order into the KEGG LIGAND database's
compound chaos... The following works fine with
hundreds of molecules, but strangely fails with
carnitine and L-carnitine (ans maybe a few others):

--------8<--------

import java.io.;
import java.util.
;

import org.openscience.cdk.;
import org.openscience.cdk.exception.
;
import org.openscience.cdk.io.;
import org.openscience.cdk.isomorphism.
;
import org.openscience.cdk.smiles.;
import org.openscience.cdk.tools.
;

public class CarnitineTest {

// MDL MOL file content for carnitine,
// as delivered with KEGG LIGAND (entry C00487)
private static final String CARNITINE_MDL_STRING =
"\r\n" +
" ISISHOST03040423022D 1 1.00000 0.00000
469\r\n" +
"\r\n" +
" 11 10 0 0 0 999 V2000\r\n" +
" 0.5379 0.5379 0.0000 C 0 0 0 0 0
0 0 0 0\r\n" +
" 1.2276 0.0793 0.0000 N 0 3 3 0 0
0 0 0 0\r\n" +
" -0.1966 0.1690 0.0000 C 0 0 3 0 0
0 0 0 0\r\n" +
" 1.1759 -0.7414 0.0000 C 0 0 0 0 0
0 0 0 0\r\n" +
" 1.9655 0.4483 0.0000 C 0 0 0 0 0
0 0 0 0\r\n" +
" 2.0069 -0.1828 0.0000 C 0 0 0 0 0
0 0 0 0\r\n" +
" -0.9241 0.5414 0.0000 C 0 0 0 0 0
0 0 0 0\r\n" +
" -0.2034 -0.7276 0.0000 O 0 0 0 0 0
0 0 0 0\r\n" +
" -1.6069 0.0931 0.0000 C 0 0 0 0 0
0 0 0 0\r\n" +
" -2.3655 0.4621 0.0000 O 0 0 0 0 0
0 0 0 0\r\n" +
" -1.6138 -0.6793 0.0000 O 0 0 0 0 0
0 0 0 0\r\n" +
" 1 2 1 0 0 0\r\n" +
" 1 3 1 0 0 0\r\n" +
" 2 4 1 0 0 0\r\n" +
" 2 5 1 0 0 0\r\n" +
" 2 6 1 0 0 0\r\n" +
" 3 7 1 0 0 0\r\n" +
" 3 8 1 4 0 0\r\n" +
" 7 9 1 0 0 0\r\n" +
" 9 10 1 0 0 0\r\n" +
" 9 11 2 0 0 0\r\n" +
"M CHG 1 2 1\r\n" +
"M END\r\n" +
"";

// MDL MOL file content for L-carnitine,
// as delivered with KEGG LIGAND (entry C00318)
private static final String L_CARNITINE_MDL_STRING =
"\r\n" +
" ISISHOST03040423012D 1 1.00000 0.00000
310\r\n" +
"\r\n" +
" 11 10 0 1 0 999 V2000\r\n" +
" 0.5379 0.5379 0.0000 C 0 0 0 0 0
0 0 0 0\r\n" +
" 1.2241 0.0793 0.0000 N 0 3 3 0 0
0 0 0 0\r\n" +
" -0.1966 0.1690 0.0000 C 0 0 2 0 0
0 0 0 0\r\n" +
" 1.1724 -0.7448 0.0000 C 0 0 0 0 0
0 0 0 0\r\n" +
" 1.9655 0.4483 0.0000 C 0 0 0 0 0
0 0 0 0\r\n" +
" 2.0103 -0.1793 0.0000 C 0 0 0 0 0
0 0 0 0\r\n" +
" -0.9241 0.5414 0.0000 C 0 0 0 0 0
0 0 0 0\r\n" +
" -0.1966 -0.7276 0.0000 O 0 0 0 0 0
0 0 0 0\r\n" +
" -1.6103 0.0931 0.0000 C 0 0 0 0 0
0 0 0 0\r\n" +
" -2.3655 0.4621 0.0000 O 0 0 0 0 0
0 0 0 0\r\n" +
" -1.6138 -0.6793 0.0000 O 0 0 0 0 0
0 0 0 0\r\n" +
" 1 2 1 0 0 0\r\n" +
" 1 3 1 0 0 0\r\n" +
" 2 4 1 0 0 0\r\n" +
" 2 5 1 0 0 0\r\n" +
" 2 6 1 0 0 0\r\n" +
" 3 7 1 0 0 0\r\n" +
" 3 8 1 1 0 0\r\n" +
" 7 9 1 0 0 0\r\n" +
" 9 10 1 0 0 0\r\n" +
" 9 11 2 0 0 0\r\n" +
"M CHG 1 2 1\r\n" +
"M END\r\n" +
"";

public static void main(String[] args) {
SmilesGenerator sg = new SmilesGenerator();

// Create CDK Molecules from MDL strings
Molecule carnitine =

createMoleculeFromMDLString(CARNITINE_MDL_STRING);
Molecule L_carnitine =
createMoleculeFromMDLString(L_CARNITINE_MDL_STRING);

// Now adding explicit hydrogens to both molecules 
HydrogenAdder ha = new HydrogenAdder();
try {
  ha.addExplicitHydrogensToSatisfyValency(carnitine);
  ha.addExplicitHydrogensToSatisfyValency(L_carnitine);
}
catch (IOException e) {
  e.printStackTrace();
}
catch (ClassNotFoundException e) {
  e.printStackTrace();
}
catch (CDKException e) {
  e.printStackTrace();
}

// Get a list of overlapping graph strucures (here:

yields 1 overlap)
List overlaps =
UniversalIsomorphismTester.getOverlaps(carnitine,
L_carnitine);
// Iterate through all overlaps
for (int i = 0; i < overlaps.size(); i++) {
// Get the ith overlap
AtomContainer overlap =
(AtomContainer)overlaps.get(i);
System.out.print("Getting map list for overlap "
+ i + "... ");
// Now get a list of SubgraphMaps between
carnitine and the current overlap
List mapList =
UniversalIsomorphismTester.getSubgraphMaps(carnitine,
overlap);
// ...program hangs!!!
System.out.println("done!");
}
}

// Returns a molecule from a passed MDL MOL string
private static Molecule
createMoleculeFromMDLString(String s) {
StringReader sr = new StringReader(s);
MDLReader mr = new MDLReader(sr);
Molecule mol = null;
try {
mol = (Molecule)mr.read(new Molecule());
}
catch (CDKException e) {
e.printStackTrace();
}
return mol;
}

}

--------8<--------

When the program hangs, the stack looks like

Thread [main] (Suspended)
RGraph.projectG1(BitSet) line: 580
RGraph.solution(BitSet) line: 362
RGraph.parseRec(BitSet, BitSet, BitSet) line: 270
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parseRec(BitSet, BitSet, BitSet) line: 327
RGraph.parse(BitSet, BitSet, boolean, boolean) line: 246
UniversalIsomorphismTester.search(AtomContainer,
AtomContainer, BitSet, BitSet, boolean, boolean) line: 360
UniversalIsomorphismTester.getSubgraphMaps(AtomContainer,
AtomContainer) line: 195
CarnitineTest.main(String[]) line: 106

Any idea what's wrong -- is it a bug, or am I doing
something really stupid?

Regards from Göttingen, Germany --

Torsten

Discussion

  • Rajarshi Guha

    Rajarshi Guha - 2004-11-04

    Logged In: YES
    user_id=349408

    Hi, the code will not hang if you drop the addition of
    hydrogens. I've faced this problem in a slightlly different
    manner: when getOverlaps() is supplied a molecule(s)
    containing H's some of the overlaps include the hydrogens as
    part of the path. As a result if I just supply hydrogen
    excluded structures, the algorithm detects the common
    substructure(s) properly.

    I removed the hydrogen addition from your code and it works fine

    I would also be interested in knowing whether this behavior
    with hydrogen included and excluded structures is a bug or a
    feature :) It appears to make sense that common
    substructures should exclude hydrogens as they can always be
    considered implicitly

     
  • Torsten Crass

    Torsten Crass - 2004-11-04

    Logged In: YES
    user_id=779249

    Hi Rajarshi,

    yes, I know that the prog works fine when skipping the
    hydrogen addition step. However, I need those hydrogens --
    well, at least I guess I need them (I am still playing
    around a bit). Anyway, explicit hydrogens are nothing but
    plain atoms and should not be treated specially by the
    overlap algorithm, should they?

    Regards --

    Torsten

     
  • Rajarshi Guha

    Rajarshi Guha - 2004-11-04

    Logged In: YES
    user_id=349408

    You're right that explicit hydrogens are just atoms. But how
    woul;d the presence/abscence of hydrogens affect the
    substructure search? I've also been playing with the
    substructure seacrch code a bit and from what I see, adding
    hydrogens gives me erroneous hits - or rather, they're not
    erroneous when you consider that the hydrogens are there,
    but they're not what I would be looking for say, when
    searching for a common scaffold in a set of molecules.

    Could you elaborate on why you need the hydrogens?

     
  • Torsten Crass

    Torsten Crass - 2004-11-04

    Logged In: YES
    user_id=779249

    Thanks for asking! Not that you mention it, I really do not
    need the hydrogens in the "getSubgraphMaps" part of my
    program... :-)

    However, I still think this is a bug -- though I have no
    clue what is so special about carnitine that it reveals the
    bug, in contrast to hundreds of other molecules...

    Regards --

    Torsten

     
  • Rajarshi Guha

    Rajarshi Guha - 2004-11-04

    Logged In: YES
    user_id=349408

    An update to the problem: it appears that by adding the
    hydrogens, the program just takes a very long time (56
    seconds on a 1GHz P4, 512MB RAM) to generate the subgraph
    map list.

    In fact by adding the hydrogens the number of matching
    subgraphs is 2048! In the abscence of hydrogens the number
    of matching subgraphs is 6. This would explain why the
    program seems to hang.

    Unfortunately I don't know enough of the details of the
    resolution graph appraoch to see why it should take so long
    - but it appears that the increase in time is because the
    possible subgraphs between carnitine and the overlap
    fragment (which is effectively carnitine itself) will
    include not only paths between heavy atoms, but also paths
    between heavy atoms and hydrogens as well.

     
  • Torsten Crass

    Torsten Crass - 2004-11-04

    Logged In: YES
    user_id=779249

    Still the funny thing is: it works perfect with molecules
    way larger than carnitine...

     
  • Kai Hartmann

    Kai Hartmann - 2004-11-05

    Logged In: YES
    user_id=743500

    As Rajarshi already pointed out, the algorithm works okay.
    Please close this bug.

    Explanation:

    The number of matching subgraphs directly depends on the
    number of equivalent atom positions in the molecule. When
    you have carnithine, the N+(C)C group has three
    equivalent carbon positions. Therefore you get 3! = 6 hits
    from the subgraph search. However, N+([CH3])[CH3]
    has nine equivalent hydrogen positions, resulting in 9! =
    362880 subgraph matches. Because this number can grow
    extremely large, a cut-off has been introduced in the RGraph
    class. It is set to 10000 iterations which results in 2078
    subgraphs in my case.

    On the other hand, when you have a much larger molecule with
    lower local symmetry, the algorithm may finish much faster.

     
  • Torsten Crass

    Torsten Crass - 2004-11-08

    Logged In: YES
    user_id=779249

    I see... thanks for the explanation, sorry for bothering
    you, bug gets closed immediately.

     
  • Egon Willighagen

    Logged In: YES
    user_id=25678

    Torsten, don't worry about it... thanx for raising the issue... at
    least I learned something :)