From: Chris M. <c.m...@ga...> - 2008-09-23 20:58:14
|
ern...@ba... wrote: >> I'm currently experimenting with a modified FP3 fingerprint that uses >> multiple bits to code not only the existence of a pattern but also it's >> count. >> Craig A. James wrote: > I've thought about something like this before. > Another thing to consider: The distribution of counts for functional groups > is wildly different. For example, the difference between three and four > nitro groups might be very significant, but the difference between three > or four methyls is not. Similarly, the presence of a metal atom is, by > itself, all you need to know; there are almost never two, so you're wasting > 15 bits if you encode metals the same way you encode nitro groups or methyls. > Using 16 bits to encode every functional group is simple, but can waste a lot > of space, which translates to slower searching. > >> Yes, this would lead towards something like the CACTVS keys in PubChem. >> For OB, the significance of each pattern could be coded in the input >> patterns file and control the bit-setter. Following this discussion, I have had a go at modifying finger3.cpp so that PatternFP is more versatile, while remaining backward compatibility - FP3 and FP4 are the same as they were. I haven't been able to see any other established schemes (like the CACTVS one mentioned above) so maybe there are things that could be improved. Each pattern in a datafile like SMARTS_InteLigand.txt can now have two optional parameters, one specifying a number of occurrences (m) and the other the number of bits (n) for that pattern. If n=1 (the default) or absent then the pattern will match only when there are greater than m matches in the molecule. If m=0 (the default) then the pattern has n bits (and extra weight in any similarity tests). If the parameters are n-1 and n, a bit is set for each of the conditions, number of matches >=n, >=n-1, ... , >=1 This can be used to distinguish structures with many similar atoms like n-alkanes. So a datafile with just one line: Secondary_carbon: [CX4H2]([#6])[#6] 13 14 will distinguish all n-alkanes up to C14H30. But the parameters can have any positive value and other behaviours are possible. As with some other plugin classes, fingerprints based on Pattern FP can now be defined without re-compiling, by making an entry in plugindefines.txt. This looks like PatternFP MACCS #ID of this fingerprint type MACCS.txt #File containing the SMARTS patterns The data for this fingerprint is taken from RDKit and is in yet another file format. (Andrew Dalke won't be pleased.) This has about 160 patterns, but I'm not sure about how complete it is. It makes use of the occurrences parameter above. The new and modified fiiles are in the trunk on SVN. Chris |