Re: [Cdk-devel] RFC #15: what IO should and should not do

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello,

> This is a key point. But so often a new author creates their own format 
> because they don't know of the toolkits. It makes sense to add CDK/OB as 
> input and output adapters for new authors.
i've already worked with a JNI adapter to OELib/OpenBabel. So it's 
technical possible. But you get all additional dynamic library path 
problems also. O.k. you can read system properties using the Ant helper 
classes, but that's definitely a problem for standard users !
The object transparence of the JNI connection (Java/C/C++) is not that 
easy, but it's definitely possible with some work and avoiding JNI 
pitfalls.

I would favour to establish the PDB input under JOELib (OpenBabel 
analogue) (export more or less available) also, because it has the 
possibility to change dynamically IO types, which is not possible under 
OpenBabel.
I've established an alternative SDF reader/writer for my project parter, 
which can simply switched on/off only in the property file !!!
This includes/implies functionality for specific formats,parsing tasks, 
error logs and so on ...

Regards, Joerg

> 
>> And XYZ
>> is especially bad since it has so little actual data capabilities!
>>
>> I know of many stories similar to yours. And that's not counting the
>> untold hours people have had to "repair" molecular data that was misread
>> by some program.
>>
>> > unacceptable. I agree that it is not the primary role of a volunteer 
>> group
>> > to solve these problems when the software and informatics industry 
>> is so
>> > broken, but we should try to agree some level of acceptability.
>>
>> I agree. I think it's as good a place as any to start fixing the 
>> problems.
>> Writing files with non-published specs is risky. But there are some de
>> facto standards in certain circumstances (i.e. XYZ seems to have been set
>> by Babel 1.x as far as I can tell).
>>
>> > >  so I'm
>> > >making some educated guesses.
>> > This can be dangerous if the algorithms are not clear. The XML 
>> community
>> > took a very early decision that error recovery was never possible 
>> and that
>> > any error completely invalidates the file. This means the user must 
>> find a
>> > way of getting the errors corrected by the suppliers.
>>
>> Fair enough, but what if there are no suppliers?
> 
> 
> "suppliers" meant whoever produced the files. I agree they may have 
> disappeared!
> 
> 
>> > >  Check the output carefully."
>> > Unfortunately the errors often propagate in ways that are difficult 
>> to spot.
>>
>> True as well.
>>
>> But let's take an academic environment. Grad student A writes an in-house
>> FORTRAN program (source code now lost) that spits out
>> incorrectly-formatted PDB files. Grad student B, now 10 years later, is
>> supposed to resurrect those files and do some research with them.
>>
>> Should Babel or CDK simply say "no, I refuse to read that file -- it's 
>> not
>> really a PDB file?" or shouldn't it have a mode that allows the user to
>> say "I know I might get garbage output, but I want you to do your best?"
> 
> 
> Here's a suggestion. It's similar to HTML Tidy
> 
> *IFF* it is known that the o/p is from this program there is little 
> problem. It would be possible to convert from BrokenPDBA to proper PDB. 
> And maybe CDK/OB is the right way to do that.
> 
> *IF* it is known that the file is in a collection of files written by 
> the *same* program then it is acceptable to do document analysis to 
> guess the likely format and meaning
> 
> *IF* it is possible to recognise the format as a known mutant, then 
> inform the user (in much the same way as HTML tidy works (it guesses the 
> likely HTML spec). But Tidy can still mangle bad HTML (we had one 
> recently where 60 of the file disappeared - my fault).
> 
> ELSE you are on very dangerous ground. You can't guess the originator 
> and you can misinterpret atoms and atom properties so badly.
> 
> HOWEVER. Perhaps we could aim for "MOLTidy" along the lines of HTMLTidy. 
> The program would read the file and say:
> "This is conformant to PDB V2.0" - no problems.
> "This appears to be PDB-MSI" parse on that basis
> "This is proprietary PDB" - i.e. you take your chances.
> There could also be PDB-OB/CDK where OB/CDK/JOELib/otherOS agreed on a 
> subset of PDB that they would accept
> 
> I would be very happy to settle for MOLTidy. It's achievable - though 
> hard work.
> 
> 
>> These are not easy problems and there are no easy answers. But there are
>> many cases where "get correct files from your suppliers" isn't an option.
>> It's easy to do this with XML, which is setting a present standard. But
>> with PDB and XYZ chemical formats (among others), there's a vast 
>> amount of
>> chemical data in "non-compliant" files. We didn't create the problem, but
>> can we turn our backs on it?
>>
>> My $0.02 goes something like this:
>> * write standard files
>>  - where no published standards exist, search for a de facto standard
>> * read standard files
>> * test thoroughly
>>
>> Everybody agrees with this, as I can tell, Then we reach the messy 
>> region.
>> IMHO, we cannot simply refuse to read non-compliant files. We must 
>> educate
>> and inform, but if we refuse, then users will turn to other solutions
>> which may be worse than our best effort. Maybe "exit gracefully" is the
>> default, but we still need an override feature for informed users to say
>> "I know this is risky, but I can't let this data disappear."
>>
>> Don't get me wrong. Legacy data in sketchy formats can lead to disaster.
>> But I think there has to be a possibility for an informed user
>> cost/benefit analysis. Otherwise we're turning our backs on helping get
>> that data into reliable, modern, standard-compliant form.
>>
>> As I said, just my $0.02.
>>
>> Cheers,
>> -Geoff
> 
> 
> Best
> 
> P.
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by:Crypto Challenge is now open! Get 
> cracking and register here for some mind boggling fun and the chance of 
> winning an Apple iPod:
> http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en
> _______________________________________________
> Cdk-devel mailing list
> Cdk...@li...
> https://lists.sourceforge.net/lists/listinfo/cdk-devel
> 

-- 
Dipl. Chem. Joerg K. Wegner
Univ. Tuebingen, Computer Architecture, Sand 1, D-72076 Tuebingen, Germany
Tel. (+49/0) 7071 29 78970, Fax (+49/0) 7071 29 5091
E-Mail: mailto:we...@in...
WWW:    http://www-ra.informatik.uni-tuebingen.de