Re: [Open Babel] [OpenBabel-Devel] Symmetry versus OB's valence model

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Nov 28, 2006, at 8:55 PM, Craig A. James wrote:

>   CC(=O)[O-].[Na+]
> In nature, the two oxygens are symmetrically equivalent.  But the  
> valence model of chemistry has no way to represent half charges, so  
> Open Babel represents this as an asymmetrical molecule.

Well, the valence model of chemistry has no problem with half  
charges. Computer representations of the valence model? That's a  
different story. I think most chemists are happy to draw "1.5" bond  
orders and such. To some degree, that's the valence bond definition  
of conjugated bonds.

Yes, it's sometimes hard to put that in a computer. The valence bond  
model can become "tricky" because it's really a way of qualitatively  
describing quantum mechanics.

Your C(=O)[O-] example is only one. Another case is for nitro groups:  
N(=O)=O vs. [N+]([O-])=O. Any other conjugated system could cause  
problems.

> However, in OB's symmetry analysis, OBMol::GetGIDVector(), OB  
> considers the two oxygen atoms of this molecule to be symmetrically  
> equivalent.  ...  The result is two atoms that are declared  
> "identical" when they plainly are not.

You said this was a "philosophical" discussion. So let's stick to  
philosophy. Clearly the two oxygen atoms in this ion are *chemically*  
identical. The carboxylate is, in fact, symmetric. IMHO, GetGIDVector 
() is, in fact, returning the chemically correct answer.

> There is a disasterous practical consequence of this  
> "philosophical" debate.  Many algorithms "walk the graph" of a  
> molecule to find features, and use a symmetry analysis for  
> efficiency to cut down on redundant traversal.  Imagine, for  
> example, a fingerprinting algorithm that enumerates short paths.

OK, but in my humble philosophical opinion (IMHPO), if your algorithm  
is supposed to fingerprint molecules and it doesn't have some element  
of a chemical expert system, it may in fact make chemically incorrect  
statements.

> It walks down the C-C=O path and adds it to the fingerprint.  Then  
> it looks at the other oxygen and says, "hey, that's identical to  
> the one I just looked at, so I'll skip it!"  Then, the next time  
> this substructure shows up, it might just happen that the  
> fingerprinter walks down the "C-C-[O-]" path first, and skips the  
> "C-C=O" path.  You have two identical functional groups, but two  
> different fingerprints.

While I agree that using symmetry analysis can improve efficiency,  
you've hit on a key point. Sometimes forcing bonds to have integer  
orders and two end points results in asymmetric *bond*  
representations of symmetric molecules. (Which is why most QM  
programs ignore the concept of bonds altogether).

Now if you're saying that the OBBond class is really limited, I'd  
agree. (Let's not get into multi-center bonds, hydrogen bonds,  
agostic interactions...) If you're saying that the integer formal  
charges in OBMol are limited, I'd agree. But these are also common  
limitations of chemical formats (SMILES and SDF snap to mind,  
although these are hardly alone).

If you want my personal philosophy, it's that GetGIDVector() is, in  
fact, representing the correct chemistry and your fingerprint  
algorithm needs to be careful. For example, perform an initial pass  
to make sure carboxylate, nitro, or other groups are in a canonical  
form.

Just my $0.02, but it's a good discussion to continue.

Cheers,
-Geoff