Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#648 Speed up of Murcko fragmenter

Accepted
closed
nobody
None
master
1
2013-06-26
2013-06-20
Rajarshi Guha
No

The current Murcko fragmenter takes a very long time (> 30 min) for a complex molecule (http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16129878). This was due to its repeatedly processing duplicate fragments. The current patches fixes by skipping duplicate fragments and drops processing time to 90s or so

1 Attachments

Related

Patches: #648

Discussion

  • John May
    John May
    2013-06-20

    Hi Rajarshi,

    Looks good, a couple of things to change though.
    - The import java.util.*; should be left as explicit imports.
    - Whats the point of the reset() method?
    - you need to add the dependence of fragment on cdk-hash... or even better make the generator constructor injectable - users can then choose how accurate they want to be (see below).
    - My fault on this as I probably was not clear, Long.toHexString is good if you want to print them to be human readable. Otherwise just the the 'Long'. The Set<Long> fragment set an map.

    Strictly speaking, what it needs is an interface to tell molecular identity. The best we have at the moment is the hash code in which case it would looks something like this.

        MolecularHashGenerator generator = ...
        MurckoFragmenter       mf        = new MurckoFragmenter(generator);
    

    90 seconds is also ridiculously long for a single molecule. I suspect AllRingsFinder may play a part in that. It could just be the atom container access and copies which are slow but I wouldn't of thought it was that slow. It would be interesting to see where the bottle neck is.

     
    Last edit: John May 2013-06-20
  • Rajarshi Guha
    Rajarshi Guha
    2013-06-20

    Thanks, I'll update the code.

    Yeah, the reset() method can be dropped.

    The time is actually not for the ring finder - it actually runs in < 1s for this case. The bottleneck is the recursive fragmentation procedure. This molecule recurses ~ 380 times.

     
  • John May
    John May
    2013-06-20

    Okay that makes sense. J

    Sent from my iPhone

    On 20 Jun 2013, at 22:10, "Rajarshi Guha" rajarshi@users.sf.net wrote:

    Thanks, I'll update the code.

    Yeah, the reset() method can be dropped.

    The time is actually not for the ring finder - it actually runs in < 1s for this case. The bottleneck is the recursive fragmentation procedure. This molecule recurses ~ 380 times.

    [patches:#648] Speed up of Murcko fragmenter

    Status: open
    Created: Thu Jun 20, 2013 01:53 AM UTC by Rajarshi Guha
    Last Updated: Thu Jun 20, 2013 08:54 AM UTC
    Owner: nobody

    The current Murcko fragmenter takes a very long time (> 30 min) for a complex molecule (http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16129878). This was due to its repeatedly processing duplicate fragments. The current patches fixes by skipping duplicate fragments and drops processing time to 90s or so

    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/cdk/patches/648/

    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

     

    Related

    Patches: #648

  • Rajarshi Guha
    Rajarshi Guha
    2013-06-22

    New set of patches (2 new ones on top of the first one) that address your points. Also cleaned up code to remove unecessary cycle detection. As a result >30min -> 70s -> 14s

     
    Attachments
  • Great work, guys!

    Rajarshi, just wondering... what is the improvement for simple molecules, for which it was fast anyway? (in nanoseconds, please :)

     
  • Rajarshi Guha
    Rajarshi Guha
    2013-06-22

    Actually for small molecules this doesn't really change much - the problem occurs when you have large numbers of repeated units or repeated units + lots of branched paths. For most small molecules, you won't have (too many) repeated fragments.

     
  • John May
    John May
    2013-06-24

    I'm getting tests fails from this. It could be to do with the atom configuration for the hash but some of the fragments are disconnected so I think it might be something else. There seems to be a couple of extra if statements in the original commit which might cause it but I'm not sure.

    Also,
    - the macro molecule test takes too long to run. The unit tests are meant to be run on each build and should be quick. As there is no place for long running tests i would change, '@Test' -> '@Ignore("takes too long")'.
    - should fragment depend on hash? I know it's for convenient constructor... but I'm not sure. Untangling the current dependencies for maven has made me very cynical :(. Thoughts from Egon would be good on this dependency?

    J

     
  • John May
    John May
    2013-06-24

    • Group: Needs_Review --> Needs_Revision
     
  • Rajarshi Guha
    Rajarshi Guha
    2013-06-24

    OK, I'll ignore the macrocycle test. The fails are strange - all tests pass on my machine, based on a clean build.

    As for the hash dependency, the fact that the hash is used in the code implies that the class depends on the hash module, irrespective of the constructor (?)

     
  • John May
    John May
    2013-06-24

    Very strange… could you publish your branch on github so I can run directly without patching.

    Yes and no, the generator is in interfaces module so if you inject via constructor no dependance is needed. I do need the hash module (or another custom implementation) if I'm using Murcko fragmenter.. but if i'm using another one (I think there is only one other one) then I don't need the hash module at all.

    J

    On 24 Jun 2013, at 16:08, Rajarshi Guha rajarshi@users.sf.net wrote:

    OK, I'll ignore the macrocycle test. The fails are strange - all tests pass on my machine, based on a clean build.

    As for the hash dependency, the fact that the hash is used in the code implies that the class depends on the hash module, irrespective of the constructor (?)

    [patches:#648] Speed up of Murcko fragmenter

    Status: open
    Created: Thu Jun 20, 2013 01:53 AM UTC by Rajarshi Guha
    Last Updated: Mon Jun 24, 2013 03:04 PM UTC
    Owner: nobody

    The current Murcko fragmenter takes a very long time (> 30 min) for a complex molecule (http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16129878). This was due to its repeatedly processing duplicate fragments. The current patches fixes by skipping duplicate fragments and drops processing time to 90s or so

    Sent from sourceforge.net because cdk-patches@lists.sourceforge.net is subscribed to https://sourceforge.net/p/cdk/patches/

    To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/cdk/admin/patches/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.


    This SF.net email is sponsored by Windows:

    Build for Windows Store.

    http://p.sf.net/sfu/windows-dev2dev_______
    Cdk-patches mailing list
    Cdk-patches@lists.sourceforge.net
    https://lists.sourceforge.net/lists/listinfo/cdk-patches

     
  • Nice clean patch. I left a few small discretionary questions on GitHub.

     
  • John May
    John May
    2013-06-25

    Nope, same results -

    [intrepid ~/workspace/github/johnmay/cdk]: ant clean dist-all test-dist-all jarTestdata
    ...
    [intrepid ~/workspace/github/johnmay/cdk]: ant -Dmodule=fragment test-module
    Buildfile: /Users/johnmay/workspace/github/johnmay/cdk/build.xml

    checkPlatforms:

    check:

    noJunit:

    test-module:
    [echo] Testing classes for CDK's fragment module.
    [echo] CDK dependencies defined: true
    [echo] Library dependencies defined: true
    [echo] Developer Library dependencies defined: true
    [junit] Running org.openscience.cdk.modulesuites.MfragmentTests
    [junit] Tests run: 33, Failures: 8, Errors: 0, Time elapsed: 159.913 sec
    [junit] Test org.openscience.cdk.modulesuites.MfragmentTests FAILED

    BUILD SUCCESSFUL
    Total time: 2 minutes 41 seconds

    Also looks like it still takes 159 seconds on my machine - didn't you say you got it down more (70s -> 14s)?

     
  • Rajarshi Guha
    Rajarshi Guha
    2013-06-25

    aah crap. let me start over ...

     
  • Rajarshi Guha
    Rajarshi Guha
    2013-06-25

    I pushed some new updates to the fastmurcko branch that address your and Egons comments. Importantly, all tests should run now and the macrocycle runs in 14s - so I kept that in (?)

    Also the method is now threadsafe

    Regarding the dependency on cdk-hash.jar - I don't see how I can avoid that since the fragmenter has to have a default molecular hash function

     
  • John May
    John May
    2013-06-26

    All working good now will sign/push apply later.

    Still apprehensive about the dependency, it's nice to provide the default constructor but I always thought the CDK philosophy was 'building blocks'. There are lots of cases of tools doing things they shouldn't (i.e. SmilesReader does atom typing by default). It's fine as it is though - just wanted to really make sure this it is the 'right' thing.

    The 14 seconds is still quite long - I have a couple of ways to bring this down though. Will add these as a separate patch. Mainly it is:
    - better linker method, 'all paths' is very slow and not needed - can invert the check and find cyclic and terminally reducible vertices (named side chains here) in linear time. You can then get the frameworks by combining linker and framework components.
    - faster fragmenting - IAtomContainer isn't very good for doing these type of things on large molecules.

     
  • Rajarshi Guha
    Rajarshi Guha
    2013-06-26

    Thanks - I agree about the dependency; but removing the use of the default molecule hash generator requires a user to construct one and pass it in. While not a big deal, it adds an extra step on the user side and exposes what should be (IMO) an implementation detail of the fragmentation method.

    As for speedups - yes definitely room for improvement. You're right - profiling the code does show that all paths is a bottleneck currently. But for now definitely better than >30 min :)

     
  • John May
    John May
    2013-06-26

    • status: open --> closed
    • Group: Needs_Revision --> Accepted
     
  • John May
    John May
    2013-06-26

    Applied and pushed.

    One thing that caught by eye, why does it still recurse if it only wants the first fragment (i.e. like the macrocycle test)? I understand the recursion would be needed for the framework combinations but not for the single framework.

     
  • Rajarshi Guha
    Rajarshi Guha
    2013-06-26

    The idea was that subsequent calls to get the rings etc would return immediately