peter murray-rust wrote:
> At 22:56 24/02/2006, Craig A. James wrote:
>> I just read the MDL CTFile Format spec, and the "M END" line is not
>> optional.
>
> I agree. We have worked extensively on this, including support from MDL,
> and this line is mandatory.
>
>> I would *strongly* recommend against implementing "forgiving" file
>> parsers that read incorrect file formats without complaining. Of if
>> we must do this (and sometimes we just need to get that molecule in,
>> right now!), we should use a "strict/nostrict" option that is "strict"
>> by default and rejects errors.
>
> The latest nightly builds of JUMBO now test a wide range of CML files
> for conformance - there are several hundred valid examples
> on
> http://cvs.sourceforge.net/viewcvs.py/cml/schema23/examples
> JUMBO now has ca 1700 unit tests which check the correctness of the
> schema, the namespaces and the examples. Although it has taken a long
> time to get there, it is now a usable system for validation. It can be
> used to validate CML input and output to OB and also serves as an
> example of conformance which could be incorporated into Babel.
> Ultimately I think that OB will have the ability to test for conformance
> - the difficulty is that only relatively few formats (CIF, CML...) use
> syntaxes which allow formal conformance although others (SMILES, MOL
> have enough documentation to come up with reasonable implementations.
>
>> I'm having a problem with OpenBabel right now because it accepts
>> SMILES that are plainly wrong, and the users don't get the answers
>> they expect. Type in "C18" into http://www.chmoogle.com, and what do
>> you get? The wrong answer, because OpenBabel accepts "C18" as
>> identical to "C", and ignores the "dangling" bond closures, when
>> plainly what I meant was some 18-carbon molecule. Or worse, type in
>> "c1cc2ccccc2cc". Clearly you meant naphthalene, but forgot the final
>> "2", but OpenBabel interprets this as "CCc1cccc1CC", which is just
>> plain wrong. Another example: "c1cccc1" is plainly wrong, it can't be
>> aromatic, but OpenBabel recasts it to cyclopentane, and the search
>> results are just plain wrong.
>>
>> In all of these cases, the user would be better served by an error
>> message stating what's wrong with the SMILES (or, in the case quoted
>> above, that the "M END" line is missing).
>>
>> Accepting incorrect file formats is a short-term hack that leads to
>> long-term problems. If we must do this, PLEASE add a "nostrict"
>> option, and only accept the erroneous file formats if the user
>> explicitely specifies "nostrict".
OpenBabel's approach has been to convert where it can, not insist
on strict conformance and be very sparing with error messages. I
would be in favour of continuing that approach because I think OB
should be a useful tool, not a means of correcting deviant
behaviour. I *do* agree that there should be more warning
messages, and how this could be done is discussed below.
The intent in sd files without M END lines is clear, and if they
are commonly found or produced by a mainstream program (as in this
case) I think OB should handle them. Hassan should not be
prevented from doing useful work because of Accelrys's
misjudgements on file formatting.
On a facetious note, if I were being "strict" I would have to
reject Craig James's paragraph on naphthalene as nonsense, since
"c1cc2ccccc2cc2" (after including "the final 2") is not
naphthalene. (It should be a final 1.) But, since I prefer a laxer
style of interpretation, I can see that he has made a good point.
Incidentally, OB will interpret "c1cc2ccccc2cc" as
"C=Cc1ccccc1C=C" because it is using a non-standard extension of
SMILES mainly aimed at representing radicals (c1ccccc1c is the
benzyl radical). Provided such extensions do not lead to ambiguity
I think they should be accepted on input, but of course would need
to be explicitly enabled on output, where as Peter points out, the
default should be "strict".
I doubt whether the MDL formats are sufficiently defined to be
used in a "strict" interpretation. In this group last year there
was discussion of 0D stereochemistry in mol files. The up and down
properties on bonds are now commonly used for tetrahedral
stereochemistry, although I suspect they were originally intended
for double bond cis/trans isomers. There is now apparently no way
to represent these in a mol file without atom coordinates. The
V3000 specification, which I presume was an attempt to clean up
the inconsistencies in V2000, even fails to say what the CFG
property on bonds means. OB has had to go along with the current
practice.
Error messages from parsers can often be daunting and
incomprehensible to a novice because they examine the details
rather than the meaning. It's often better that the quality
control should be the human checking the meaning, rather than the
computer being picky with the details. The novice user typing
"C18" into OB's Windows GUI would see that the converted
output(even in an unfamiliar format) was not what he wanted, since
the output is always displayed. Even better for chemists is a
graphical representaion of the molecule, as in the MAC GUI and
other interfaces under development. I think imaginative ways of
presenting the chemistry to the user for validation is the
preferred way. Modern cheminformatics programs do it this way...
> Here's an additional suggestion - a BabelTidy, similar to HTMLTidy. This
> would read broken formats (or non-broken ones) and issues warnings and
> errors. Then the user could have the option to heed them. The final
> output would ALWAYS be correct syntactically (although always the chance
> of corruption through the errors). For example:
> openbabel -tidy -ismi a/*.smi -osmi b/*.smi
> could be used to clean the input files into the output files.
> there could be options like:
> -strict (all output from OB should pass this!)
> -MDL_END add missing M END
> -MDL_dollars add missing $$$$
> etc.
> Ideally the output files would contain metadata to say they were
> converted, but only CML (possibly CIF) has a mechanism for adding metadata.
>
> This is a labour of love, but I think it would continue to establish OB
> as a leading high-quality component.
>
OpenBabel already has a multilevel warning/error message system
which gives the opportunity to provide comprehensive and possibly
verbose commentary on the quality of the input format, while
allowing the user the control on how much of this is displayed. I
think the current framework can handle the tidying proposed above
by just converting to the same format - the output is always
"strict" unless otherwise requested. So (for Windows):
babel a/*.smi b/*.smi --errorlevel 3
cleans up the files and describes the defects.
To make this properly useful, we need many more warning messages
about defects in input files (indeed a labour of love), rather
than just accepting or rejecting the file as is often the case at
present. Also possibly useful would be a refinement in the error
levels and maybe a maximum severity addition to the output
message, e.g. "2 molecules converted (with level 3 warnings)". The
user can be aware of how much a risk he is taking while not being
impeded in getting the work done.
Chris
|