A very helpful set of fingerprints is the extended connectivity fingerprint class (ECFP), especially ECFP4. It would be nice to have the possibility to calculate the ECFP4 (and maybe some others from the same class) using "Manage Datasets" and "Calculate Properties".
Unfortunately, the version of CDK we currently using does not provide these fingerprints. ECFP is implemented in a more recent version of the library (see http://cheminf20.org/2014/02/21/open-source-ecfpfcfp-circular-fingerprints-in-cdk/). Switching to the new CDK version requires some work and will take some time, but is planned as one of the next steps (see Feature request #31).
I need some information for the correct format of these fingerprints: I read the link from Nils' first comment and if I understand correctly, an ECFP/FCFP fingerprint consist of an ordered list of 32bit integers, which are hashcodes of some structural information (the calculation of the hashcodes is done by the CDK, we don't need to bother about that).
Now we have several options to convert these lists into an internal SH format. The most convenient one in my opinion is to use Numerical Fingerprints, which is a comma separated sequence of integers. The fingerprinter generator of the CDK also offers to fold the list into a Bit Fingerprint. It is created as follows: For every hashcode with value i in the original fingerprint, the i-th bit of the bit fingerprint is set to 1. The bit fingerprint has limited length of k, but i can be a number between -2^31 and 2^31. Therefore the actual position is determined by i modulo k.
The numeric fingerprint would be compared by jaccard similarity, while the bit fingerprint is compared with tanimoto distance. Both metrics ignore the ordering of the hashcodes, as well as the number of occurences of the hashcodes.
So, finally it comes down to the difference, that numeric fingerprints need much more space, while the bit fingerprints are less accurate, because they mix up hashcodes if they coincidently are mapped to the same position by the modulo operation.
I think that the ECFP fingerprint must not be compared with our current implementation of the Jaccard index. The interpretation of a numerical fingerprint (x_1, ..., x_n) in SH is that the i-th feature occurs x_i times. This is not how the ECFP fingerprint should be interpreted. I suggest to use the binary version of ECFP (which can be compared with our Tanimoto coefficient implementation) for now.
Supporting the other format is highly desirable and requires to add a new type of fingerprint with appropriate similarity/distance measures.
The current state of the new fingerprint calculator is now present in the master branch. Do we mean the same things with "binary version of ECFP" and the bit fingerprint I described in my last comment? The result for SH is a string (with characters), which is interpreted as a bit sequence by the distance calculators.
Yes, that is what I meant with "binary version". This type of fingerprint can be compared in the same way as the Daylight fingerprint we already have integrated. So adding this version does not require any further changes.
i have noticed two things:
regarding 2: if we can detect, wheter this option is not used at all, i would say it is much cleaner to simply remove it. Otherwise, the user might expect, that the checkbox is greyed out just because some other comfiguration options.
is fixed.