From: Rajarshi G. <raj...@gm...> - 2010-10-29 17:41:01
|
Hi, the CDK has a decent collection of binary fingerprints. However we are missing other useful ones such as circular etc. I know that at least one group is working on a circular fingerprint, but more generally, has there any thoughts on how we can represent arbitrary fingerprints within the API? Using circular fingerprints as an example, the fingerprint is essentially a list of integer or string values. More generally, one could assume a fingerprint to be a sequence of strings or integers, each with an associated count. Note that, even the current hash based fingerprints can be represented in this format. While one can always convert the general key,value representation into a fixed length bit string, it's sometimes useful to directly work with the key,value pairs. Thus, one possible refactoring of IFingerprinter is to have the following methods BitSet calculate(IAtomContainer); BitSet calculate(IAtomContainer, int); HashSet<T,Integer> calculateRaw(IAtomContainer); For cases such as the StandardFingerpriner, there is no change to the current behavior. On the other hand, for a circular fingerprint, the first two methods could be used to convert the raw circular fingerprint to a desired bitstring, while one can always access the raw fingerprint with the third method. Comments welcome -- Rajarshi Guha NIH Chemical Genomics Center |
From: Rajarshi G. <raj...@gm...> - 2010-10-29 20:03:57
|
I have implemented the refactoring and an example of a class for non-binary fp (LINGO's) at http://github.com/rajarshi/cdk/tree/newfp -- Rajarshi Guha NIH Chemical Genomics Center |
From: Egon W. <ego...@gm...> - 2010-10-29 20:24:58
|
On Fri, Oct 29, 2010 at 7:40 PM, Rajarshi Guha <raj...@gm...> wrote: > Hi, the CDK has a decent collection of binary fingerprints. However we > are missing other useful ones such as circular etc. I know that at > least one group is working on a circular fingerprint, but more > generally, has there any thoughts on how we can represent arbitrary > fingerprints within the API? Just for my sanity... this is the kind of fingerprint I used in this post too, correct? http://chem-bla-ics.blogspot.com/2010/06/qspr-modeling-with-signatures.html I'm not so very good at remembering exact terminology... but the fingerprint I use in that post in PLS modeling is I think a circular fingerprint... agreed? > Using circular fingerprints as an > example, the fingerprint is essentially a list of integer or string > values. More generally, one could assume a fingerprint to be a > sequence of strings or integers, each with an associated count. Note > that, even the current hash based fingerprints can be represented in > this format. One important difference is that the length of this fingerprint is undefined... and for every next molecule you calculate it for, the fingerprint will grow. Moreover, the order of the molecules for which they are calculated determine indirectly the meaning of the bits... Correct? > While one can always convert the general key,value representation into > a fixed length bit string, it's sometimes useful to directly work with > the key,value pairs. I had a glance over your patch, but I think I would like to have the two type of fingerprints to have different interfaces, rather than to have unsupported methods, and methods returning a -1 size... > Thus, one possible refactoring of IFingerprinter > is to have the following methods > > BitSet calculate(IAtomContainer); > BitSet calculate(IAtomContainer, int); > HashSet<T,Integer> calculateRaw(IAtomContainer); > > For cases such as the StandardFingerpriner, there is no change to the > current behavior. ... <snip> >From your other email: On Fri, Oct 29, 2010 at 10:03 PM, Rajarshi Guha <raj...@gm...> wrote: > I have implemented the refactoring and an example of a class for > non-binary fp (LINGO's) at http://github.com/rajarshi/cdk/tree/newfp That sounds useful. I do not know these fingerprints yet, but the support for circular fingerprints is certainly interesting... I would likely implement the interface for the fingerprint described in that blog post too (if no one beats me to it...) My main argument for a separate interface would be that circular fingerprints are strongly tied to the data set they are calculated for... Here too, comments appreciated! Egon -- Dr E.L. Willighagen Postdoctoral Research Associate University of Cambridge Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers |
From: Nina J. <jel...@gm...> - 2010-10-29 20:42:07
|
Hi, You might have a look on my (rather old) implementation of atom environments (reference inside the code) https://ambit.svn.sourceforge.net/svnroot/ambit/trunk/ambit2-all/ambit2-descriptors/src/main/java/ambit2/descriptors/AtomEnvironmentDescriptor.java https://ambit.svn.sourceforge.net/svnroot/ambit/trunk/ambit2-all/ambit2-descriptors/src/main/java/ambit2/descriptors/AtomEnvironment.java It uses list of integers, and is wrapped as IMolecularDescriptor. I am fine with moving this code into cdk, if this is of interest. Nina On 29 October 2010 23:24, Egon Willighagen <ego...@gm...>wrote: > On Fri, Oct 29, 2010 at 7:40 PM, Rajarshi Guha <raj...@gm...> > wrote: > > Hi, the CDK has a decent collection of binary fingerprints. However we > > are missing other useful ones such as circular etc. I know that at > > least one group is working on a circular fingerprint, but more > > generally, has there any thoughts on how we can represent arbitrary > > fingerprints within the API? > > Just for my sanity... this is the kind of fingerprint I used in this > post too, correct? > > http://chem-bla-ics.blogspot.com/2010/06/qspr-modeling-with-signatures.html > > I'm not so very good at remembering exact terminology... but the > fingerprint I use in that post in PLS modeling is I think a circular > fingerprint... agreed? > > > Using circular fingerprints as an > > example, the fingerprint is essentially a list of integer or string > > values. More generally, one could assume a fingerprint to be a > > sequence of strings or integers, each with an associated count. Note > > that, even the current hash based fingerprints can be represented in > > this format. > > One important difference is that the length of this fingerprint is > undefined... and for every next molecule you calculate it for, the > fingerprint will grow. Moreover, the order of the molecules for which > they are calculated determine indirectly the meaning of the bits... > > Correct? > > > While one can always convert the general key,value representation into > > a fixed length bit string, it's sometimes useful to directly work with > > the key,value pairs. > > I had a glance over your patch, but I think I would like to have the > two type of fingerprints to have different interfaces, rather than to > have unsupported methods, and methods returning a -1 size... > > > Thus, one possible refactoring of IFingerprinter > > is to have the following methods > > > > BitSet calculate(IAtomContainer); > > BitSet calculate(IAtomContainer, int); > > HashSet<T,Integer> calculateRaw(IAtomContainer); > > > > For cases such as the StandardFingerpriner, there is no change to the > > current behavior. > > ... > > <snip> > > >From your other email: > > On Fri, Oct 29, 2010 at 10:03 PM, Rajarshi Guha <raj...@gm...> > wrote: > > I have implemented the refactoring and an example of a class for > > non-binary fp (LINGO's) at http://github.com/rajarshi/cdk/tree/newfp > > That sounds useful. I do not know these fingerprints yet, but the > support for circular fingerprints is certainly interesting... I would > likely implement the interface for the fingerprint described in that > blog post too (if no one beats me to it...) > > My main argument for a separate interface would be that circular > fingerprints are strongly tied to the data set they are calculated > for... > > Here too, comments appreciated! > > Egon > > > -- > Dr E.L. Willighagen > Postdoctoral Research Associate > University of Cambridge > Homepage: http://egonw.github.com/ > LinkedIn: http://se.linkedin.com/in/egonw > Blog: http://chem-bla-ics.blogspot.com/ > PubList: http://www.citeulike.org/user/egonw/tag/papers > > > ------------------------------------------------------------------------------ > Nokia and AT&T present the 2010 Calling All Innovators-North America > contest > Create new apps & games for the Nokia N8 for consumers in U.S. and Canada > $10 million total in prizes - $4M cash, 500 devices, nearly $6M in > marketing > Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store > http://p.sf.net/sfu/nokia-dev2dev > _______________________________________________ > Cdk-devel mailing list > Cdk...@li... > https://lists.sourceforge.net/lists/listinfo/cdk-devel > |
From: Rajarshi G. <raj...@gm...> - 2010-10-30 04:48:10
|
On Fri, Oct 29, 2010 at 4:24 PM, Egon Willighagen <ego...@gm...> wrote: > On Fri, Oct 29, 2010 at 7:40 PM, Rajarshi Guha <raj...@gm...> wrote: > Just for my sanity... this is the kind of fingerprint I used in this > post too, correct? > > http://chem-bla-ics.blogspot.com/2010/06/qspr-modeling-with-signatures.html Yes > I'm not so very good at remembering exact terminology... but the > fingerprint I use in that post in PLS modeling is I think a circular > fingerprint... agreed? Not sure that I'd call signatures circular fingerprints - but the specific fingerprint terminology doesn't really matter. The point is many types of fingerprints are basically key,value pairs (even the current path fp's) > One important difference is that the length of this fingerprint is > undefined... Correct > and for every next molecule you calculate it for, the > fingerprint will grow. Why would subsequent molecules, cause the fingerprint to grow? Each molecule has it's own (variable length) fingerprint > Moreover, the order of the molecules for which > they are calculated determine indirectly the meaning of the bits... Not sure what you mean here. For the key,value type fp's bits do not come into the scene (unless you want to explicitly make a bit string) I don't see where a molecule order comes into this > I had a glance over your patch, but I think I would like to have the > two type of fingerprints to have different interfaces, rather than to > have unsupported methods, and methods returning a -1 size... I like a single interface since it keeps fingerprint implementations together. Furthermore, even some current binary fingerprints could make use of the two methods. The fact that a fp impl returns -1 for the size, is simply an indicator that it is of variable length. > On Fri, Oct 29, 2010 at 10:03 PM, Rajarshi Guha <raj...@gm...> wrote: >> I have implemented the refactoring and an example of a class for >> non-binary fp (LINGO's) at http://github.com/rajarshi/cdk/tree/newfp > > That sounds useful. I do not know these fingerprints yet, Well, actually I'm not so sure that they are all that useful - certainly the fragments generated are not necessarily chemically meaningful. It was quick to implement and highlights my proposal. (And one more fp impl can't hurt :) > but the > support for circular fingerprints is certainly interesting... I would > likely implement the interface for the fingerprint described in that > blog post too (if no one beats me to it...) How is that interface different from my proposal? > My main argument for a separate interface would be that circular > fingerprints are strongly tied to the data set they are calculated > for... I still don't see how circular fp's are dataset dependent. -- Rajarshi Guha NIH Chemical Genomics Center |
From: Egon W. <ego...@gm...> - 2010-10-30 07:47:25
|
Hi, On Sat, Oct 30, 2010 at 6:48 AM, Rajarshi Guha <raj...@gm...> wrote: >> I'm not so very good at remembering exact terminology... but the >> fingerprint I use in that post in PLS modeling is I think a circular >> fingerprint... agreed? > > Not sure that I'd call signatures circular fingerprints - Neither would I :) But with signature, you can create a circular FP. > but the > specific fingerprint terminology doesn't really matter. The point is > many types of fingerprints are basically key,value pairs (even the > current path fp's) But the [key,value] version of the current path-based FPs would remove the hashing. <snip> > Why would subsequent molecules, cause the fingerprint to grow? Each > molecule has it's own (variable length) fingerprint Ah, because you do not formalize an order in the Map keys... OK, that makes sense. >> Moreover, the order of the molecules for which >> they are calculated determine indirectly the meaning of the bits... > > Not sure what you mean here. For the key,value type fp's bits do not > come into the scene (unless you want to explicitly make a bit string) Indeed... if you consider keys not to be numbered... I see your point. A good one too; should have thought of that... I was looking at this from a QSAR modeling perspective where at some point you must introduce this order... <snip> >> I had a glance over your patch, but I think I would like to have the >> two type of fingerprints to have different interfaces, rather than to >> have unsupported methods, and methods returning a -1 size... > > I like a single interface since it keeps fingerprint implementations > together. Furthermore, even some current binary fingerprints could > make use of the two methods. The fact that a fp impl returns -1 for > the size, is simply an indicator that it is of variable length. Indeed. I just rather formalize that at an interface level. I do not like unimplemented methods. >> On Fri, Oct 29, 2010 at 10:03 PM, Rajarshi Guha <raj...@gm...> wrote: >>> I have implemented the refactoring and an example of a class for >>> non-binary fp (LINGO's) at http://github.com/rajarshi/cdk/tree/newfp >> >> That sounds useful. I do not know these fingerprints yet, > > Well, actually I'm not so sure that they are all that useful - > certainly the fragments generated are not necessarily chemically > meaningful. It was quick to implement and highlights my proposal. (And > one more fp impl can't hurt :) Indeed.... who knows... we might even compare them ;) >> but the >> support for circular fingerprints is certainly interesting... I would >> likely implement the interface for the fingerprint described in that >> blog post too (if no one beats me to it...) > > How is that interface different from my proposal? :) "implement the interface" -> in the Java terminology: write an class implementing that interface. >> My main argument for a separate interface would be that circular >> fingerprints are strongly tied to the data set they are calculated >> for... > > I still don't see how circular fp's are dataset dependent. I would argue this another point for having them split... BitSet's are just undefined for Map<String,Integer> and can only be defined if introducing this dataset dependency: the length of the BitSet version of a circular fingerprint depends on how many molecules it has been calculated for... or, in terms of the path-based fingerprint... more molecules means more unique paths, means more keys in the Map, means more bits in the BitSet... Egon -- Dr E.L. Willighagen Postdoctoral Research Associate University of Cambridge Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers |
From: Rajarshi G. <raj...@gm...> - 2010-10-30 13:58:51
|
On Sat, Oct 30, 2010 at 3:46 AM, Egon Willighagen <ego...@gm...> wrote: > Hi, > > On Sat, Oct 30, 2010 at 6:48 AM, Rajarshi Guha <raj...@gm...> wrote: >>> I'm not so very good at remembering exact terminology... but the >>> fingerprint I use in that post in PLS modeling is I think a circular >>> fingerprint... agreed? >> >> Not sure that I'd call signatures circular fingerprints - > > Neither would I :) But with signature, you can create a circular FP. > >> but the >> specific fingerprint terminology doesn't really matter. The point is >> many types of fingerprints are basically key,value pairs (even the >> current path fp's) > > But the [key,value] version of the current path-based FPs would remove > the hashing. I don't mean that we'd replace the binary, hashed version of the fingerprint. Just that the path,count representation (which I regard as the 'raw' representation) would/could be accessible. > Indeed... if you consider keys not to be numbered... I see your point. > A good one too; should have thought of that... I was looking at this > from a QSAR modeling perspective where at some point you must > introduce this order... Actually for QSAR, just ordering the keys would not be sufficient, due to the variable length. One approach, employed by Pipeline Pilot, is to convert key,value type fingerprints into a fixed length representation. This is indeed introduces the dataset dependency. See https://community.accelrys.com/message/2695 > Indeed. I just rather formalize that at an interface level. I do not > like unimplemented methods. Actually, the NotImplementedException was just get a working prototype out. For methods that cannot or do not want to return a key,value representation, they could return an empty Map > "implement the interface" -> in the Java terminology: write an class > implementing that interface. Got it :) > I would argue this another point for having them split... BitSet's are > just undefined for Map<String,Integer> and can only be defined if > introducing this dataset dependency: the length of the BitSet version > of a circular fingerprint depends on how many molecules it has been > calculated for... or, in terms of the path-based fingerprint... more > molecules means more unique paths, means more keys in the Map, means > more bits in the BitSet... Not necessarily - one can simply convert them to a fixed length fingerprint if desired by hashing each key into a fixed length bit string. This is exactly what the current hashed fingerprints (Standard, Extended, GraphOnly) are doing. The fact that one can also generate a dataset-dependent bit-string representation (via the link I provided) is a separate issue and can be a decision left to the user. The key thing is that that the user has access to the raw key,value representation. -- Rajarshi Guha NIH Chemical Genomics Center |
From: Egon W. <ego...@gm...> - 2010-10-30 14:21:46
|
On Sat, Oct 30, 2010 at 3:58 PM, Rajarshi Guha <raj...@gm...> wrote: > I don't mean that we'd replace the binary, hashed version of the > fingerprint. Just that the path,count representation (which I regard > as the 'raw' representation) would/could be accessible. Sure, but can't it just do: class StandardFingerprinter implements IFingerprinter, ICircularFingerprinter {} ? >> Indeed... if you consider keys not to be numbered... I see your point. >> A good one too; should have thought of that... I was looking at this >> from a QSAR modeling perspective where at some point you must >> introduce this order... > > Actually for QSAR, just ordering the keys would not be sufficient, due > to the variable length. One approach, employed by Pipeline Pilot, is > to convert key,value type fingerprints into a fixed length > representation. This is indeed introduces the dataset dependency. See > https://community.accelrys.com/message/2695 And likely similar in how I reported in my blog... >> Indeed. I just rather formalize that at an interface level. I do not >> like unimplemented methods. > > Actually, the NotImplementedException was just get a working prototype > out. For methods that cannot or do not want to return a key,value > representation, they could return an empty Map I'm not convinced yet... >> I would argue this another point for having them split... BitSet's are >> just undefined for Map<String,Integer> and can only be defined if >> introducing this dataset dependency: the length of the BitSet version >> of a circular fingerprint depends on how many molecules it has been >> calculated for... or, in terms of the path-based fingerprint... more >> molecules means more unique paths, means more keys in the Map, means >> more bits in the BitSet... > > Not necessarily - one can simply convert them to a fixed length > fingerprint if desired by hashing each key into a fixed length bit > string. This is exactly what the current hashed fingerprints > (Standard, Extended, GraphOnly) are doing. True. > The fact that one can also generate a dataset-dependent bit-string > representation (via the link I provided) is a separate issue and can > be a decision left to the user. The key thing is that that the user > has access to the raw key,value representation. What about having fingerprint implementations just implement two orthogonal interfaces? Egon -- Dr E.L. Willighagen Postdoctoral Research Associate University of Cambridge Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers |
From: Rajarshi G. <raj...@gm...> - 2010-10-30 14:41:44
|
On Sat, Oct 30, 2010 at 10:20 AM, Egon Willighagen <ego...@gm...> wrote: > On Sat, Oct 30, 2010 at 3:58 PM, Rajarshi Guha <raj...@gm...> wrote: >> I don't mean that we'd replace the binary, hashed version of the >> fingerprint. Just that the path,count representation (which I regard >> as the 'raw' representation) would/could be accessible. > > Sure, but can't it just do: > > class StandardFingerprinter implements IFingerprinter, ICircularFingerprinter {} > > ? I suppose so, but it just seems redundant. Most of your reasons for a separate interface arise mainly due to downstream usage of the fingerprint. >> The fact that one can also generate a dataset-dependent bit-string >> representation (via the link I provided) is a separate issue and can >> be a decision left to the user. The key thing is that that the user >> has access to the raw key,value representation. > > What about having fingerprint implementations just implement two > orthogonal interfaces? That is one way out - but the behavior is not completely orthogonal. The StandardFingerprint, for example, generates a set of String,Integer which are employed to generate the final binary fingerprint. Also, if I want to access the binary fp and the raw fp in general manner, how would I do that? Obviously, IFingerprinter fp = new LingoFingerprinter(); To access the raw, I have to cast to ICircularFingerprinter manually, which seems to defeat the purpose. Basically, I see the binary fp being derived from the raw representation - and I would say this is true for all fingerprints - keyed, path-based, circular etc -- Rajarshi Guha NIH Chemical Genomics Center |
From: Egon W. <ego...@gm...> - 2010-10-30 15:00:14
|
On Sat, Oct 30, 2010 at 4:41 PM, Rajarshi Guha <raj...@gm...> wrote: > That is one way out - but the behavior is not completely orthogonal. > The StandardFingerprint, for example, generates a set of > String,Integer which are employed to generate the final binary > fingerprint. What about substructure fingerprints? For those based on SMARTS you indeed still have a string... If the interfaces were split, we could also do things like write wrappers that use different approaches for mapping Map<String,Integer> onto BitSet, without having to implement the same mapping for each fingerprint type... ICircularFingerprinter cfp = new LingoFingerprinter(); IFingerprinter fp = new MD5Fingerprinter(cfp); > Also, if I want to access the binary fp and the raw fp in general > manner, how would I do that? > > Obviously, > > IFingerprinter fp = new LingoFingerprinter(); > > To access the raw, I have to cast to ICircularFingerprinter manually, > which seems to defeat the purpose. Why? If you would want to have access to the ICFP one, why not just do: ICircularFingerprinter cfp = new LingoFingerprinter(); Or do you mean that you would not have access to the other methods then? > Basically, I see the binary fp being derived from the raw > representation - and I would say this is true for all fingerprints - > keyed, path-based, circular etc What about this wrapping approach? (I'm going to keep making up arguments until I a have a good feeling about the outcome ;) Egon -- Dr E.L. Willighagen Postdoctoral Research Associate University of Cambridge Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers |
From: Rajarshi G. <raj...@gm...> - 2010-10-30 15:15:46
|
On Sat, Oct 30, 2010 at 10:59 AM, Egon Willighagen <ego...@gm...> wrote: > On Sat, Oct 30, 2010 at 4:41 PM, Rajarshi Guha <raj...@gm...> wrote: >> That is one way out - but the behavior is not completely orthogonal. >> The StandardFingerprint, for example, generates a set of >> String,Integer which are employed to generate the final binary >> fingerprint. > > What about substructure fingerprints? For those based on SMARTS you > indeed still have a string... That can be represented as a key,value representation: SMARTS,1 or SMARTS,0 > If the interfaces were split, we could also do things like write > wrappers that use different approaches for mapping Map<String,Integer> > onto BitSet, without having to implement the same mapping for each > fingerprint type... > > ICircularFingerprinter cfp = new LingoFingerprinter(); > IFingerprinter fp = new MD5Fingerprinter(cfp); But this does not require separate interfaces. It would work with my proposed (ie combined) interface. I like your approach in that it favors composition over hierarchy, but it seems uneconomical to have an interface for a single method - especially when that extra method can be considered common to all fingerprint types >> To access the raw, I have to cast to ICircularFingerprinter manually, >> which seems to defeat the purpose. > > Why? If you would want to have access to the ICFP one, why not just do: > > ICircularFingerprinter cfp = new LingoFingerprinter(); > > Or do you mean that you would not have access to the other methods then? Correct > (I'm going to keep making up arguments until I a have a good feeling > about the outcome ;) No problem :) Hopefully it will be more than just you and me :) -- Rajarshi Guha NIH Chemical Genomics Center |
From: Egon W. <ego...@gm...> - 2010-10-30 15:26:29
|
On Sat, Oct 30, 2010 at 5:15 PM, Rajarshi Guha <raj...@gm...> wrote: >> ICircularFingerprinter cfp = new LingoFingerprinter(); >> IFingerprinter fp = new MD5Fingerprinter(cfp); > > But this does not require separate interfaces. It would work with my > proposed (ie combined) interface. I like your approach in that it > favors composition over hierarchy, but it seems uneconomical to have > an interface for a single method - Well, but I think it beats having methods that are not implemented or not useful. IMHO. > especially when that extra method > can be considered common to all fingerprint types That I do not mind so much... it's the other way around that I do not like... fingerprints that are Map<String, Integer> only to have getSize() and getBitSet() methods... If a fingerprinter basically just have one "Map<String,Integer> calculate()" method, then I have little trouble with that... >> Or do you mean that you would not have access to the other methods then? > > Correct That would be overcome with the wrapping approach... (SomeIFingerprinter.getWrappedIFingerprinter() instanceof ICircularFingerprinter) == true Egon -- Dr E.L. Willighagen Postdoctoral Research Associate University of Cambridge Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers |
From: Rajarshi G. <raj...@gm...> - 2010-10-30 16:53:05
|
On Sat, Oct 30, 2010 at 11:25 AM, Egon Willighagen <ego...@gm...> wrote: > On Sat, Oct 30, 2010 at 5:15 PM, Rajarshi Guha <raj...@gm...> wrote: >>> ICircularFingerprinter cfp = new LingoFingerprinter(); >>> IFingerprinter fp = new MD5Fingerprinter(cfp); >> >> But this does not require separate interfaces. It would work with my >> proposed (ie combined) interface. I like your approach in that it >> favors composition over hierarchy, but it seems uneconomical to have >> an interface for a single method - > > Well, but I think it beats having methods that are not implemented or > not useful. IMHO. OK, but our whole descriptor package design is based on this idea :) Some descriptors have params and some do not. When a descriptor no takes no params, null values from the get param name/type methods indicate that is the case >> especially when that extra method >> can be considered common to all fingerprint types > > That I do not mind so much... it's the other way around that I do not > like... fingerprints that are Map<String, Integer> only to have > getSize() and getBitSet() methods... > > If a fingerprinter basically just have one "Map<String,Integer> > calculate()" method, then I have little trouble with that... My problem with having another level of wrappers is that complicates the API and its usage for a very commonly used function. Specialization and experimentation can easily be done via subclassing or wrapping, but common usage should be a single method call (at least for the purpose of fingerprint generation) -- Rajarshi Guha NIH Chemical Genomics Center |
From: Egon W. <ego...@gm...> - 2010-10-30 21:16:00
|
On Sat, Oct 30, 2010 at 6:52 PM, Rajarshi Guha <raj...@gm...> wrote: > On Sat, Oct 30, 2010 at 11:25 AM, Egon Willighagen > <ego...@gm...> wrote: >> On Sat, Oct 30, 2010 at 5:15 PM, Rajarshi Guha <raj...@gm...> wrote: >> Well, but I think it beats having methods that are not implemented or >> not useful. IMHO. > > OK, but our whole descriptor package design is based on this idea :) > > Some descriptors have params and some do not. When a descriptor no > takes no params, null values from the get param name/type methods > indicate that is the case I think having zero parameters is something different from, this does-not-apply-what-so-ever... Having a zero-length fingerprint is something else than a fingerprint which just doesn't have a size at all... >>> especially when that extra method can be considered common to all fingerprint types >> >> That I do not mind so much... it's the other way around that I do not >> like... fingerprints that are Map<String, Integer> only to have >> getSize() and getBitSet() methods... >> >> If a fingerprinter basically just have one "Map<String,Integer> >> calculate()" method, then I have little trouble with that... > > My problem with having another level of wrappers is that complicates > the API and its usage for a very commonly used function. > Specialization and experimentation can easily be done via subclassing > or wrapping, but common usage should be a single method call (at least > for the purpose of fingerprint generation) Well, what's not complicated about having to check if getStringIntegerMap() returns null, and then thinking if you could use a BitSet instead... (or the other way around)? Egon -- Dr E.L. Willighagen Postdoctoral Research Associate University of Cambridge Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers |
From: Rajarshi G. <raj...@gm...> - 2010-10-30 22:23:29
|
On Sat, Oct 30, 2010 at 5:15 PM, Egon Willighagen <ego...@gm...> wrote: > On Sat, Oct 30, 2010 at 6:52 PM, Rajarshi Guha <raj...@gm...> wrote: >> On Sat, Oct 30, 2010 at 11:25 AM, Egon Willighagen >> <ego...@gm...> wrote: >>> On Sat, Oct 30, 2010 at 5:15 PM, Rajarshi Guha <raj...@gm...> wrote: >>> Well, but I think it beats having methods that are not implemented or >>> not useful. IMHO. >> >> OK, but our whole descriptor package design is based on this idea :) >> >> Some descriptors have params and some do not. When a descriptor no >> takes no params, null values from the get param name/type methods >> indicate that is the case > > I think having zero parameters is something different from, this > does-not-apply-what-so-ever... Having a zero-length fingerprint is > something else than a fingerprint which just doesn't have a size at > all... I disagree. The current behavior of the descriptor package is such that if the descriptor takes no parameters it returns null. For the fingerprints, a size of -1 just indicates a variable length fingerprint. In both cases, we use an arbitrary condition to indicate something. >>> That I do not mind so much... it's the other way around that I do not >>> like... fingerprints that are Map<String, Integer> only to have >>> getSize() and getBitSet() methods... I think it maintains the generality - a given feature fingerprint could certainly have a a method to convert it to a bit string - why remove that possibility? > Well, what's not complicated about having to check if > getStringIntegerMap() returns null, and then thinking if you could use > a BitSet instead... (or the other way around)? The code just seems uncessary - but that's my personal preference -- Rajarshi Guha NIH Chemical Genomics Center |
From: Egon W. <ego...@gm...> - 2010-10-30 22:50:53
|
On Sun, Oct 31, 2010 at 12:23 AM, Rajarshi Guha <raj...@gm...> wrote: <snip> You got my chat message? If no one has anything to say, I'll accept you code in your form on Monday. Egon -- Dr E.L. Willighagen Postdoctoral Research Associate University of Cambridge Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers |
From: Rajarshi G. <raj...@gm...> - 2010-10-30 22:52:25
|
On Sat, Oct 30, 2010 at 6:50 PM, Egon Willighagen <ego...@gm...> wrote: > On Sun, Oct 31, 2010 at 12:23 AM, Rajarshi Guha <raj...@gm...> wrote: > > <snip> > > You got my chat message? No - what was in it? -- Rajarshi Guha NIH Chemical Genomics Center |
From: Egon W. <ego...@gm...> - 2010-10-30 22:55:49
|
On Sun, Oct 31, 2010 at 12:52 AM, Rajarshi Guha <raj...@gm...> wrote: > On Sat, Oct 30, 2010 at 6:50 PM, Egon Willighagen > <ego...@gm...> wrote: >> On Sun, Oct 31, 2010 at 12:23 AM, Rajarshi Guha <raj...@gm...> wrote: >> >> <snip> >> >> You got my chat message? > > No - what was in it? Google chat... might come as email later... Egon -- Dr E.L. Willighagen Postdoctoral Research Associate University of Cambridge Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers |
From: <nl...@us...> - 2010-10-30 23:01:58
|
Recently, I have been working on a similar problem. The solution I've implemented uses two interfaces, FeatureStorage and FeatureExtractor. FeatureStorage is implemented e.g. by a class HashKeyFingerprint or CountHashTable which essentially stores the "raw data" mentioned here. The FeatureExtractor interface is implemented by a PathExtractor, CycleExtractor and SubtreeExtractor. This way different data types can easily be created by combining different types of features defined by the user. The implementation does not require to store the memory- consuming raw data, if only a fingerprint is required. This technique works well for me and is quite flexible. However, I'm not sure if this approach is also applicable for the CDK. I just thought this might be of interest. You may check out the source code at: http://scaffoldhunter.svn.sourceforge.net/viewvc/scaffoldhunter/trunk/src/subsearch/index/features/ Best regards Nils On Friday 29 October 2010 19:40:55 Rajarshi Guha wrote: > Hi, the CDK has a decent collection of binary fingerprints. However > we are missing other useful ones such as circular etc. I know that > at least one group is working on a circular fingerprint, but more > generally, has there any thoughts on how we can represent > arbitrary fingerprints within the API? Using circular fingerprints > as an example, the fingerprint is essentially a list of integer or > string values. More generally, one could assume a fingerprint to > be a sequence of strings or integers, each with an associated > count. Note that, even the current hash based fingerprints can be > represented in this format. > > While one can always convert the general key,value representation > into a fixed length bit string, it's sometimes useful to directly > work with the key,value pairs. Thus, one possible refactoring of > IFingerprinter is to have the following methods > > BitSet calculate(IAtomContainer); > BitSet calculate(IAtomContainer, int); > HashSet<T,Integer> calculateRaw(IAtomContainer); > > For cases such as the StandardFingerpriner, there is no change to > the current behavior. On the other hand, for a circular > fingerprint, the first two methods could be used to convert the > raw circular fingerprint to a desired bitstring, while one can > always access the raw fingerprint with the third method. > > Comments welcome |