Menu

#66 additional fingerprint ECFP4

2.6.1
closed
None
31
6
2017-01-17
2015-09-16
No

A very helpful set of fingerprints is the extended connectivity fingerprint class (ECFP), especially ECFP4. It would be nice to have the possibility to calculate the ECFP4 (and maybe some others from the same class) using "Manage Datasets" and "Calculate Properties".

Discussion

  • Nils Kriege

    Nils Kriege - 2015-09-16

    Unfortunately, the version of CDK we currently using does not provide these fingerprints. ECFP is implemented in a more recent version of the library (see http://cheminf20.org/2014/02/21/open-source-ecfpfcfp-circular-fingerprints-in-cdk/). Switching to the new CDK version requires some work and will take some time, but is planned as one of the next steps (see Feature request #31).

     
  • Nils Kriege

    Nils Kriege - 2015-09-16
    • Depends On: --> 34
     
  • Till Schäfer

    Till Schäfer - 2015-09-16
    • Depends On: 34 --> 31
     
  • Nils Kriege

    Nils Kriege - 2015-09-19
    • assigned_to: Sven Schrinner
    • Priority: 5 --> 6
     
  • Sven Schrinner

    Sven Schrinner - 2016-02-25
    • status: open --> in-progress
    • Related To: -->
     
  • Sven Schrinner

    Sven Schrinner - 2016-02-26

    I need some information for the correct format of these fingerprints: I read the link from Nils' first comment and if I understand correctly, an ECFP/FCFP fingerprint consist of an ordered list of 32bit integers, which are hashcodes of some structural information (the calculation of the hashcodes is done by the CDK, we don't need to bother about that).

    Now we have several options to convert these lists into an internal SH format. The most convenient one in my opinion is to use Numerical Fingerprints, which is a comma separated sequence of integers. The fingerprinter generator of the CDK also offers to fold the list into a Bit Fingerprint. It is created as follows: For every hashcode with value i in the original fingerprint, the i-th bit of the bit fingerprint is set to 1. The bit fingerprint has limited length of k, but i can be a number between -2^31 and 2^31. Therefore the actual position is determined by i modulo k.

    The numeric fingerprint would be compared by jaccard similarity, while the bit fingerprint is compared with tanimoto distance. Both metrics ignore the ordering of the hashcodes, as well as the number of occurences of the hashcodes.

    So, finally it comes down to the difference, that numeric fingerprints need much more space, while the bit fingerprints are less accurate, because they mix up hashcodes if they coincidently are mapped to the same position by the modulo operation.

     
  • Sven Schrinner

    Sven Schrinner - 2016-02-26
    • status: in-progress --> needs-info
    • Group: any future version --> next release
     
  • Nils Kriege

    Nils Kriege - 2016-02-26

    I think that the ECFP fingerprint must not be compared with our current implementation of the Jaccard index. The interpretation of a numerical fingerprint (x_1, ..., x_n) in SH is that the i-th feature occurs x_i times. This is not how the ECFP fingerprint should be interpreted. I suggest to use the binary version of ECFP (which can be compared with our Tanimoto coefficient implementation) for now.
    Supporting the other format is highly desirable and requires to add a new type of fingerprint with appropriate similarity/distance measures.

     
  • Sven Schrinner

    Sven Schrinner - 2016-02-26

    The current state of the new fingerprint calculator is now present in the master branch. Do we mean the same things with "binary version of ECFP" and the bit fingerprint I described in my last comment? The result for SH is a string (with characters), which is interpreted as a bit sequence by the distance calculators.

     
  • Nils Kriege

    Nils Kriege - 2016-02-26

    Yes, that is what I meant with "binary version". This type of fingerprint can be compared in the same way as the Daylight fingerprint we already have integrated. So adding this version does not require any further changes.

     
  • Till Schäfer

    Till Schäfer - 2016-03-07

    i have noticed two things:

    1. the configure panel seems to break the calc dialog layout (see screenshot). This happens, whenever i add an ExtedndedConnectivityFingerprinter job. It does not happen for any other calc plugin.
    2. recalculate 2D-Coordinates is always greyed out.
     
  • Sven Schrinner

    Sven Schrinner - 2016-03-14
    1. This only occurs in Ubuntu for me. The combo box for the new plugin somehow invokes some layout updates, which causes the two lists on the left to consume more vertical space - even if they move out of the dialog size. This issue has occured many times in the past with the Form Layout. Still working on this.
    2. This is the case for almost (all?) plugins. It is a generalized building block, which is used by almost all plugins. Maybe we should completely hide unused options for all plugins.
     
  • Sven Schrinner

    Sven Schrinner - 2016-03-14
    • status: needs-info --> in-progress
     
  • Sven Schrinner

    Sven Schrinner - 2016-03-18
    1. should already be fixed.
     
  • Sven Schrinner

    Sven Schrinner - 2016-03-18
    • status: in-progress --> needs-info
     
  • Till Schäfer

    Till Schäfer - 2016-03-22

    regarding 2: if we can detect, wheter this option is not used at all, i would say it is much cleaner to simply remove it. Otherwise, the user might expect, that the checkbox is greyed out just because some other comfiguration options.

     
  • Sven Schrinner

    Sven Schrinner - 2016-03-25

    is fixed.

     
  • Sven Schrinner

    Sven Schrinner - 2016-03-25
    • status: needs-info --> needs-review
     
  • Till Schäfer

    Till Schäfer - 2016-03-29
    • status: needs-review --> closed
     
  • Till Schäfer

    Till Schäfer - 2017-01-17
    • Group: 2.6.2 --> 2.6.1
     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.