Re: [Joelib-devel] Cheminformatics meta project

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

>>Are we (QSAR, CDK, JOELib, Octet, Jumbo) trying to do too much at the same
>>time?
> Maybe, but I think that things are going fine as they go now... we approach 
> things step by step... I guess we are mostly just glueing existing tools 
> together...
Maybe ... step by step ... and we need at first a common merged 
interface, before any concrete implemention helps us to improve the 
actual design.

>>(3) Wouldn't it be even more useful if project we're planning interacted
>>with a single "standard" Java API for accessing and manipulating Molecular
>>information?
>>(4) Yes it would, 
 > focus on chemical entities only... very difficult to make the 'single
 > standard API'... Chemistry is too fuzzy, too broad...
 > But this does not mean that we can define 'a standard Java API' which 
 > glues together a few existing projects...
Let's start with the 'glued' interface, if people have plans to write 
their own implementation, they can do that. But at first we must find a 
common interface... combining actual available open source projects may 
be at a later stage be interesting.

>>but such a thing doesn't exist! How can we ensure that
>>the new API will be general enough, robust, and useful? 
> I don't think we can...
At the moment, i don't think we have time ... hey, these are open source 
projects, so in future we have the ability to refactor things ...

>>My point is this: would it be useful to tackle the problem of developing a
>>single standard Molecular API separately from the development of a QSAR
>>framework?
> Interesting, but I don't think we can easily come up with the solution to this 
> problem... (if it was easy, it was already done...)
Correct, of course is refactoring much more easy than developing 
functionality, but there are still some really nasty problems out there, 
so i'm optimistic that we can iterate to a common interface and a common 
API, but this will need time ... it's still hard enough to maintain the 
actual available projects, because there are still some open 
performance-problems or bad-designs in them.

And simply 'merging' the functionality is difficult, because it may 
demand a difficult refactoring. You surely know the actual LinesOfCode:
ChemicalMarkupLanguge: 30285
CDK: 43772
JOELib: 63761
http://pmd.sourceforge.net/scoreboard.html
So, assuming that a good developer reads 1000 LOC/day and understands 
them and all the dependencies, he will need 30+44+64=138 days (4 1/2) 
months to understand all the projects, then he can start with 
refactoring and testing, so ... hope you get paid for one year producing 
nothing :-) So are LOC a good measure for productivity ? No, but ... 
that's another problem, and out of the QSAR project focus.

> Interesting, too... OpenBabel is struggling with atom types in file conversion 
> (i.e., I think they still are...)... which indicates only part of the 
> problems...
I've discussed this topic with Geoff, but as always ... there are some 
other things to do, but we have exactly the same chemistry 'kernels', 
but this was checked 'by hand', because we have partially hard-coded 
assignment algorithms, so still suboptimal.

> Jakarta is a much simpler working area... all the results are artificial... 
> that is, they don't have to match with nature... so they don't really care on 
> how things should be interpreted, only that they work...
I agree ... chemoinformatics is still strongly connected to science, 
because we need still standards, which are in progress ... CML, 'expert 
systems', interfaces, ...

Unfortunately, as already critisized by Kubyini (or at least cited by 
him) the contribution of the pharmaceutical industry could be higher in 
helping to set a standard.
So, refactoring helps me not to publish papers and does not help 
pharmaceutical industry to reduce their data piles, of course for the 
future it can be helpfull, but financial pressure might be high for them 
and for us ... so who cares about a good hypothetical standard in the 
future which faciliates the maintenance ? So let's work with 
shell-scripts, they are fast and have an included copy protection, but 
that's unrealistic :-)

As already said by Egon ... let's iterate ... step by step ... nothing 
is exluded ... but also nothing should be included too early ...

Kind regards, Joerg

-- 
Dipl. Chem. Joerg K. Wegner
Center of Bioinformatics Tuebingen (ZBIT)
Department of Computer Architecture
Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany
Phone: (+49/0) 7071 29 78970
Fax: (+49/0) 7071 29 5091
E-Mail: mailto:we...@in...
WWW:    http://www-ra.informatik.uni-tuebingen.de
--
Never mistake motion for action.
                                     (E. Hemingway)

Never mistake action for meaningful action.
                                (Hugo Kubinyi,2004)