No problem,

There will be a blog post at some point but the main entry point is: HashGeneratorMaker.
You build up the configuration by chaining the parameters then the last argument defines what type of hash you want (per atom or molecule). The middle arguments define what you want properties you want to consider. The 'depth' indicates how much environment each atom should 'feel' (like signatures). The 'perturbed()' breaks ties by systematically testing equivalent atoms - this of course takes more time and a good approach is to have a simple hash followed by a more complete one to double check.

MoleculeHashGenerator generator = new HashGeneratorMaker().depth(8)
// generator's state if fixed, only need to make it once and then reuse
for (IAtomContainer m : ms) {
long hash = generator.generate(m);

You can then put these in a Map<Long,IAtomContainer>, or even Multimap<Long,IAtomContainer> so you could do the double checking if needed.

Of course, as it's a hash code you're playing a game of chance and so depending on your configuration/the odds you may have a collision. Strictly speaking you should check with an isomorphism test, unfortunately the current implementations are less specific then hash code (i.e. this can discriminate the 9 isomers of Inositol). I do mean to write a faster isomorphism checker (not general to subgraph) but haven't got round to that yet.

Oh and to print… 'Long.toHexString(hash);' is best or base64. 

The HashCodeScenarios provide a couple of use cases which will give some more details.


On 19 Jun 2013, at 15:48, Rajarshi Guha <> wrote:

On Wed, Jun 19, 2013 at 10:34 AM, John May <> wrote:
Yes, the hash code should help with this.

Could you point me to the relevant class / example?
Also, not exactly clear what Murcko provides but to get the ring systems you can use, This will give you biconnected components in linear time and separate isolated cycles from the fused systems.

I'll take a look at this 

Rajarshi Guha |
NIH Center for Advancing Translational Science