Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#386 Atom types

Accepted
closed
master (162)
5
2012-10-28
2011-07-12
Asad
No

Please find attached missing atom type patches for the CDK and few correction have been made in the existing ones.

The patches reflect the blog http://chembioinfo.wordpress.com/2011/07/06/improved-atom-typing-cdk/

Discussion

  • Your patch seems to cause 16 regressions in the CDKAtomTypeMatcherTest after I applied it to cdk-1.4.x. Is it critical to apply to master instead of cdk-1.4.x?

     
  • Otherwise, I discussed the patch with Gilleain, in particular regarding splitting up the patch into smaller patches. Particularly, because the current patch also makes whitespace changes (don't make whitespace and {} changes in the same patch is functional changes), it is hard to see what is going on. For example, the patch feels like it changes atom type IDs? Why is that? That is going to break things...

    Please also put one atom type in one unit test, rather than all in one test method, such as is now with e.g. the method fix_Fe().

    This new test class should also be included in McoreTests.java.

    And several atom types miss lone pair count and/or hybridization information.

     
  • Asad
    Asad
    2011-07-14

    Updated patch as required

     
    Attachments
  • Asad
    Asad
    2011-07-14

    Hi Egon, thanks for the review.

    We have updated the patch.

    Whats DONE:

    a) Old IDs in the ontology files were reverted as suggested.
    b) New ids and the updated ontology was added
    c) New mol files were added for 128 test case for the missing atom types
    d) Test cases was synchronised with CDKAtomTypeMatcherTest.java (CDKAtomTypeMatcherTest.java extends CDKAtomTypeMatcherEBITest.java)
    e) Hence AtomTypeMatcher was also updated.

    Here are the test results from the result-core.txt

    Testsuite: org.openscience.cdk.modulesuites.McoreTests
    Tests run: 503, Failures: 0, Errors: 1, Time elapsed: 6.815 sec


    Testcase: test4Sulphur(org.openscience.cdk.atomtype.CDKAtomTypeMatcherSMILESTest): Caused an ERROR
    null
    java.lang.NullPointerException
    at org.openscience.cdk.atomtype.CDKAtomTypeMatcherSMILESTest.test4Sulphur(CDKAtomTypeMatcherSMILESTest.java:169)

    Note: This case is rightly received by the CDK atom type matcher as this configuration of sulphur is not possible. SO its an invalid error.

    Whats NOT Done:

    We have added 128 missing test cases and clubbed them as ~50 cases according to the element types. If you would like to break them into smaller pieces, please feel free to do so.

    Hope this helps.

    Many thanks,

    Asad

     
  • Thanx Asad!

    I know that fixing a patch of this size is a lot of work. That's why small patches, which you can submit soon, are so good. Easy to review, quick feedback, which you can use for later patches.

    I will look at this patch today.

    I think I have not been clear in what I wanted with CDKAtomTypeMatcherTest... in that class, I do not want any tests that depend on IO... I should add that to the class JavaDoc. The reason for that is that I want those unit tests to do their work with the least possible dependencies, to make sure that when they fail, it's because the atom typing fails, and not, for example, some ChemObjectReader got broken (which happened in the past).

    Anyway, I'll look at it as soon as possible.

     
  • Asad
    Asad
    2011-07-14

    Well testOla28 in CDKAtomTypeMatcherFilesTest uses IO...

    The other thing is many missing tests now exist, and all pass. "Tests run: 503, Failures: 0,"

     
  • Yes, CDKAtomTypeMatcherTest and CDKAtomTypeMatcherFilesTest are different. I am saying that tests with files are not good; they are. I just want them in a different class.

    But, at the same time, I also prefer to have all atom types to have unit tests without IO.

    (There is an easy way to convert MDL molfiles to CDK source code; you can do the same thing I do when creating unit tests based on PubChem entries... See http://chem-bla-ics.blogspot.com/2010/09/using-pubchem-to-create-cdk-unit-tests.html)

    But please wait for me to review the current patch.

    BTW, does you new patch you uploaded today also add the missing hybridization information?

     
  • Asad
    Asad
    2011-07-14

    We have fixed hybridization to best of our knowledge.

    To our my knowledge Hybridization is defined for atoms which are covalently connected with other atoms, so in cases like ionic bonds, metallic atoms(just the atom as it is) such as Li,Na,K,W,Ni.metallic,Ni.2plus,Pu,Th,Al,Fe.metallic,In,Co.metallic,Cd.metallic,Hg.metallic, Pb.neutral, Zn etc will not have any hybridization.

    While cases such as S.2minus without connected though belonging to P-block elements I am not sure whether they have any hybridization.

    Asad & Nimish

     
  • Asad
    Asad
    2011-07-14

    We have fixed hybridization to best of our knowledge.

    To our my knowledge Hybridization is defined for atoms which are
    covalently connected with other atoms, so in cases like ionic bonds,
    metallic atoms(just the atom as it is) such as
    Li,Na,K,W,Ni.metallic,Ni.2plus,Pu,Th,Al,Fe.metallic,In,Co.metallic,Cd.metallic,Hg.metallic,
    Pb.neutral, Zn etc will not have any hybridization.

    While in cases such as S.2minus without connected atoms (though it belongs to P-block
    elements), we are not sure if it has any hybridization.

    Asad & Nimish

     
  • Please file a bug report for the test4Sulphur() indicating you believe that test is wrong.

    The patch is too large for me to review tonight.. I already spent an hour on looking at it...

    What I did spot so far is:

    • String[] expectedTypes = {"Se.3", "H", "H"};
    • String[] expectedTypes = {"Se.2", "H", "H"}; // Se.3 changed to Se.2

    (there still is one atom type renamed)

    This causes a regression in the Sybyl atom type matcher.

    and comments that are no longer valid, now that many atom types are no longer renamed:

    • String[] expectedTypes = {"F", "I.minus.5", "F"};
    • String[] expectedTypes = {"F", "I.minus.5", "F"}; // I.minus.5 to I.minus.2

    I do not know if metallic atoms, or ions without any valence electrons show hybridization either.
    I spotted at least these atom types without hybridization which does have neighbors: N.minus, S.2minus

    BTW, and this is a request, not q requirement,... if you can remove those "// Fix" comments, that would be appreciated.

    Can you explain the hybridization of the S.sp3.4 atom type, which seems to have 12 electrons in four sp3 orbitals? Same for S.sp3d1 which has 12 electrons in five sp3d1 orbitals?

    I am not sure I understand the F.minus.1 atom type... can you explain that one? E.g. give a PubChem or ChemSpider entry with a compound containing such an atom type?

    I also like to hear why you changed the definition of this atom type:

    <at:AtomType rdf:ID="Cl.chlorate">
    <at:formalCharge>0</at:formalCharge>
    <at:hasElement rdf:resource="&amp;elem;Cl"/>
    <at:formalNeighbourCount>3</at:formalNeighbourCount>
    - <at:lonePairCount>0</at:lonePairCount>
    + <at:lonePairCount>1</at:lonePairCount>
    <at:piBondCount>2</at:piBondCount>
    - <at:hybridization rdf:resource="&amp;at;sp2"/>
    + <at:hybridization rdf:resource="&amp;at;sp3"/>
    </at:AtomType>

    OK, not done yet. But I am also wondering about the Ca.2plus.2 atom type... that doesn't look like a valid atom type to me... can you explain this one too?

    That leaves me with 16% of the patch to go...

     
  • OK, I just have been rethinking the situation... the patch is huge and ugly... that troubles me:

    1. I spend too much time on understanding what happens: whitespace is not the only problem, but also rearrangements
    2. it has temporary comments like "// fix X" which are meaning less without any information on what fix X is
    3. the whitespace changes ruin any change of merging atom type fixes easily from cdk-1.4.x to master (and worries me a lot, because we'll have those in parallel for at least 9 months)
    4. I rather see the patch aimed at the cdk-1.4.x branch. That way we can make many people happy soon with the new atom types, rather than in 9 months or more
    5. the patches not only add stuff, the change stuff too, which makes it rather difficult to oversee the consequences
    6. this code is so deep inside the CDK, I am extra hard on getting this right

    I really want to see small patches, without whitespace changes, without rearrangements, and without brackets that were not there before, and basically only add code.

    Please split the three patches in one patch for one element where that element already has atom types. Put all elements which did not have atom types yet into one big patch. And remove all comments which refer to your development process (like '// Fix foo').

    Please make such a patch for cupper or so, with the full test file. Make CDKAtomTypeMatcherTest not extends CDKAtomTypeMatcherEBITest. Instead, add a unit test for each atom type for the first without using IO (if you need help on automating creation of such unit tests for the test files you have, please let me know; I think I can do that in an hour or so, which is half the time I spent tonight on reviewing this huge patch).

    As a reference patch, look at this one:

    https://github.com/cdk/cdk/commit/90371fd74808b28d8a819e4c8bf9fd3788a15678

    This is what I like to have a patch look like: it only adds code. That way, it is crystal clear what changes, and then an atom type can be reviewed in a few minutes, including compiling and checking for remote regressions.

     
  • Asad
    Asad
    2011-07-22

    We have resubmitted the patches.

    KEGG Atom type patches - ID: 3374133

    Hope this is works out for the reviewers.

    Asad