From: <ern...@ba...> - 2008-09-18 22:41:30
|
Hi, >I've thought about something like this before. The proof will be in how well it actually performs, and the trick there is to find a meaningful metric. my current metric is how man false duplicates from FP2 (chemically different structure but identical fingerprint) it eliminates when combined with that FP2. >One suggestion: you could probably cut down from 16 bits to 8 and hardly lose any selectivity by skipping more steps. Does it really matter whether you have 5 or 6 >occurances of a pattern? Does it really matter if you have more that 63 occurances? I'd try two alternatives, one encoding the whole thing into 8 bits, and the other >into 16 bits, and use your performance metric to see how they compare. I've tried it with 8 and 16 and the outcome up to now was that with 8 bit it needs more input patterns to compete with 16. Since setting bits is cheaper than doing substructure searches I'd chose this approach. But my set of input patterns is probably not optimal by itself yet... BTW: How would "unsaturated non-aromatic nitrogen-containing ring size 3" be translated into a SMARTS pattern? :-) >Another thing to consider: The distribution of counts for functional groups is wildly different. For example, the difference between three and four nitro groups might >be very significant, but the difference between three or four methyls is not. Similarly, the presence of a metal atom is, by itself, all you need to know; there are >almost never two, so you're wasting 15 bits if you encode metals the same way you encode nitro groups or methyls. Using 16 bits to encode every functional group is >simple, but can waste a lot of space, which translates to slower searching. Yes, this would lead towards something like the CACTVS keys in PubChem. For OB, the significance of each pattern could be coded in the input patterns file and control the bit-setter. >By the way, if I understand what you're doing, these bit-vectors are more properly called "structural keys", not fingerprints. "Fingerprint" traditinally means a bit >vector in which the meaning of individual bits is not meaningful because of the hash process used to encode features. The term "fingerprints" seems to be used these >days to describe both, but historically, bitmaps in which each bit (or word, in your case) has a specific meaning have been called "structural keys." Unfortunately, >in OpenBabel we use a single C++ class, the FP, to hold both, so the distinction is getting blurred. I know. Actually it's range-coded molecular statistics or molecular keys with a hardcoded translation table. But since it's all called fingerprint in OB, I used this term. Best Regards Ernst-Georg Schmid |