From: Christoph S. <ste...@ic...> - 2003-06-12 09:27:41
|
Hi everybody, Brett Saunders sent me the following helpful comments on the FingerPrinter, which I would like to share with you: ---start of quote--- I've come across a couple of problems in the fingerprint code as well - I haven't got code fixes for them yet, but probably will in the next couple of days/week. The two problems are: 1) After the DFS path search, symmetric patterns are checked for by using string reversal. This breaks on multi-letter atoms. So, the path CCC-Cl and Cl-CCC are the same path; however, when reversed the latter path becomes CCC-lC. Path generation should be done on the atom ID instead, then only converted to letters prior to hashing. I'm not a java programmer, but this is what I would do in c++: make a vector of int in place of the string. On each atom to append, I'd push back the atomic number shifted left by 8 bits. On bonds, I'd push back the bond order. 2) In the selection of a bit to set, at least 4, possibly more bits should be set for a given hash code derived from a path. This is specifically so that two different hash keys with same initial random number produce different resulting fingerprints. Consider the following hypothetical situation: PATH HashCode RandomNumberSequence starting from HashCode CCC-C 531 102, 681, 325 ... CC-C 120 102, 681, 895 We require three pseudo-random bits to differentiate between these two "molecular features". I also suspect that the fingerprint length should be used to control how many random bits are chosen. Finally, I think (but am not convinced yet) that atoms that are ring members should be excluded as starting points for the path search. Rings should probably be hashed differently - because they only encode rotations of the same features - however, I'm not sure how to do this yet. Also some careful thinking about the probability distribution of the hashing function and number of bits per key could lead us to a good theoretical metric of "good key size" / chosen bits per path. ---end of quote--- -- Dr. Christoph Steinbeck (e-mail: c.s...@un...) Groupleader Junior Research Group for Applied Bioinformatics Cologne University BioInformatics Center (http://www.cubic.uni-koeln.de) Zülpicher Str. 47, 50674 Cologne Tel: +49(0)221-470-7426 Fax: +49 (0) 221-470-5092 What is man but that lofty spirit - that sense of enterprise. ... Kirk, "I, Mudd," stardate 4513.3.. |