You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
(3) |
Oct
(1) |
Nov
|
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
(10) |
Mar
|
Apr
(2) |
May
(4) |
Jun
(1) |
Jul
(1) |
Aug
(13) |
Sep
(1) |
Oct
|
Nov
(4) |
Dec
|
2004 |
Jan
(5) |
Feb
(9) |
Mar
(13) |
Apr
(25) |
May
(10) |
Jun
(21) |
Jul
(13) |
Aug
(8) |
Sep
(6) |
Oct
(1) |
Nov
(5) |
Dec
(16) |
2005 |
Jan
(9) |
Feb
(15) |
Mar
(8) |
Apr
(8) |
May
(3) |
Jun
(1) |
Jul
(1) |
Aug
(1) |
Sep
|
Oct
(1) |
Nov
|
Dec
|
2006 |
Jan
(2) |
Feb
(2) |
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
(5) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(1) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Joerg W. <we...@in...> - 2004-04-19 16:54:07
|
Greetings, > well, then the first question - what about Weka performance ? > (It eats a lot of memory when working with large data sets) > > > R is similar and a long time ago i've used the interface under Java very > > shortly ... we're matlab based ! > > i like using matlab and it is quite usefull; > but matlab itself is not open source , it could be obstacle Same as for representation of molecules. WHY? 1111. Weka splits all into attributes and instances, also nominal and numeric attributes. This causes memory, but is quite usefull, because it is not clear from a series: 1,NaN,3,4,2,1 if this is a nominal classification or a numeric regression problem ! I understand your point, in fact i've implemented a DescriptorMatrix class for JOELib (joelib.desc.data) which holds only the matrix with descriptor names and molecules, but this causes a lot of problems for algorithm development, because the interface can not distinguish the above series by default. I used simply a matrix2weka mapping tool. That's why a student of mine developed a second interface was implemented to have both possibilities, which holds also the molecules in a weka related context directly. For my actual problem i need a wild mix between nominal and numeric and it is more clearly if the attributes holds this information already, so i must not implement always helper classes for both cases. 2222. In general it is usefull to cache data sets (already available as DescriptorMatrixCache) to avoid multiple entries in memory. The cross-validation can be catched from the cached versions. Furthermore optimization algorithms needs a common DB analogue interface or caching mechanism to load required data set s only once (singleton class interface) 3333. It is not possible to compete with fast matrix operations, there R or Matlab should be used, there specific optimized code is needed. Java has: Jama and COLT and some Weka-Add-Ons uses them, but this can never be compared to assembler optimized code. Kind regards, Joerg Joerg Kurt Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: Joerg W. <we...@in...> - 2004-04-19 16:08:14
|
Hi Egon, so add me as developer and if the other agree as admin. My focus lies on extensive documentation (DocBook) and joining some code, because my (internal) codes will go public anyway. We can also write an open source review about QSAR ... mhh ...fine, i like this idea. For DocBook i would prefer SGML (jade), because i'm more familiar with that. I know that you are using XML for Jmol. But let's discuss things in the QSAR list in more detail. Kind regards, Joerg Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: E.L. W. <eg...@sc...> - 2004-04-19 07:42:29
|
=2D----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Good morning all, Last friday I requested a new project, QSAR - http://www.sf.net/projects/qs= ar,=20 which got approved. This project's goal to is guide the development of the= =20 discussion and software development which has been discussed on the=20 cdk...@li... and joe...@li... list last wee= k. I would like to stress that this new SF project does not intend to reinvent= =20 the wheel at all, but aimed at: =2D - writing down a requirement analysis =2D - developing a GUI that uses CDK, JChemPaint, Jmol, JOELib (alphabetical order) and other projects for QSAR model building (More details are available in the thread and on the website soon.) =46urthermore, keep in mind that though I set up this project, it is not my= =20 intent to 'lead' the project such that my vote counts more than others. I've set up a mailing list (has still to be approved) to which can be=20 subscribed at this page: http://lists.sourceforge.net/lists/listinfo/qsar-devel I've you like to join (which I hope), please send me your SF account name, = so=20 that I can add you to the project. I would also like to repeat Peter's=20 suggestions to join the IRC chat channel (for newbies: XChat is a very good= =20 IRC client which runs on most platforms) at #cdk on the irc.freenode.net=20 server. (Note, that when joining a channel the '#' is part of the name.) Egon =2D --=20 eg...@sc... PhD on Molecular Representation in Chemometrics Nijmegen University http://www.cac.sci.kun.nl/people/egonw/ GPG: 1024D/D6336BA6 =2D----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (SunOS) iD8DBQFAg4LRd9R8I9Yza6YRAmwSAKC+C5EJj7R6dNAbVZAxXazhjdCHEgCeJmV/ HzB783klxhpPNoNb4mFbOsE=3D =3DqXKR =2D----END PGP SIGNATURE----- |
From: Joerg W. <we...@in...> - 2004-04-18 10:45:55
|
Hi all, > > I do not agree to open an own project, there is much code out there: > > Weka, YALE (includes Weka interface) and XML, Commercial stuff with Weka > > interface (Xalopy or what was the correct name ?) > A new project does not mean that available pieces cannot be used... So who can decide which classes are the best main classes ? Do you know a critical mass of Weka and JOELib classes you can use ? Do i know all CDK classes i can use ? What's with R, Yale, JavaNNS (SNNS successor), JavaEVA (EVA successor), libSVM, 'feature extraction', clustering ... > > I think, we do not want to invent an new data mining standard, such > > discussions are more usefull for the Weka mailing list and all > > avaliable Matlab algorithm providers (toolboxes !!!) ... > Not everyone prefer to work with Matlab... Matlab is not free, neither is the > PLS Toolbox... What's the URL for Weka? Google: Weka, Java, Data Mining That's irrelevant, i've plenty of 'feature extraction' methods, you must not buy=20 commercial toolboxes, there is a lot of free stuff, or use R ... ... the problem is mixing all together ... i use these things and i'm far from feeling experienced enough to define a common interface ! I think this is more a evolutionary process, use it and then you find way= s you can faciliate the usage, but a faciliated usage causes a more complex interface so ... every new API requires time to understand their approaches ... and can save development time ... > > - the MaximumCommonSubstructure (MCS) algorithms > Is this an improved algorithm, or similar to that in CDK? 1. I can assign different chemical graph labels=20 1.1. basic atom types 1.2. general PATTY 1.3. atom properties threshold 1.4. atom properties difference 2. MCS by clique detection 2.1. Bron-Kerbosch (exact) 2.2. DFMax (fast heuristic, non-exact) 3. multiple MCS 3.1. HSCS (Sheridan approach) 3.2. stochastic version 4. feature reduction step available for 1.1.-1.3. Beside these things, there exists also the incremental graph isomorphism algorithm for SMARTS matching (Ullmann variant with backtracking) > > Sorry, CDK for descriptors is not obvious to me, please explain. As you > > can mention, i do not agree for several reasons, as already discussed > > previously, e.g. missing atom typer and missing substructure search ! > CDK *has* substructure search, implemented in a rather flexible way. Graph isomorphism is not the same as substructure search ! (See definitio= n Subgraph/Substructure by R=FCcker/R=FCcker) Or which expert systems do you use to assign the graph labels of the 'attributed graph' ? (in general: things i critisize in my submitted pape= r !) In fact, nearly every software uses it's own 'labelling', so which one is correct ? standard ? The isomorphism is not the problem, because we talk about exact matching, of course there exists other kind of matchings, like ... here you will need an optimization algorithm, like our JavaEVA library ... > > Descriptor dependencies > > are NOT all linear 2D dependencies as already excellently mentioned b= y > > Nikolova/Jaworska. So where is the advantage to show them in 2D or 3D ? > > That's mainly irrelevant and misleading ! A 2D plot is only one > > possibility for the model quality, and not always the best one !!! > What kind of 2D are you talking about here? E.g. plain correlation plots between descriptorXYZ and predictedVALUE. Such things can be helpfull, but such an approach is similar to visual 'featur= e selection' on one feature and it is well know, that important features ar= e not the best ones, from the standpoint of generalization ability (see Eibe/Witten or my submitted paper, if accepted :-) > I have no idea what a data mining API is... data mining is a rather vague > term... like chemometrics API. That's the point !!! I'm more interested to implement all required methods and extensions in JOELib/CDK, because the hypothetical interface will access these methods anyway ! Furthermore i'm more interested to implement access/algorithms speed-ups. That's what i call 'maintenance' problems. The libraries are still complex, so i'm more interested to write more examples, more tutorial, including more literature references, ... Kind regards, Joerg Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) =20 Never mistake action for meaningful action. (Hugo Kubinyi,2004) = =20 |
From: Egon W. <eg...@sc...> - 2004-04-17 19:46:18
|
On Saturday 17 April 2004 19:25, Joerg Wegner wrote: > > > I suggest starting not with deciding what program to write but with > > > what the components of a QSAR system are and then deciding what who > > > wants to be involved, we have got and setting some realistic scope to > > > what is achievable > > Of course, i like QSAR .. but time is rare and who will implement things > ... you know that's my default comment ... > > Egon i've read your mail ... and yes i'm still in holiday ... and i do > check > e-mails and i work since 3 years on QSAR ... so holiday means i can read > fantasy books and can do thinks i like, e.g. read some QSAR papers !:-) > Holiday and spare-time are some curious things .. aren't they :-) :) > > It seems there is general agreement that an SF project in this area is > > valuable and I'll make a few comments which I hope are helpful. Please > > ignore if they aren't. > > I do not agree to open an own project, there is much code out there: > Weka, YALE (includes Weka interface) and XML, Commercial stuff with Weka > interface (Xalopy or what was the correct name ?) A new project does not mean that available pieces cannot be used... > I think, we do not want to invent an new data mining standard, such > discussions are more usefull for the Weka mailing list and all > avaliable Matlab algorithm providers (toolboxes !!!) ... Not everyone prefer to work with Matlab... Matlab is not free, neither is the PLS Toolbox... What's the URL for Weka? > ... and such discussions are not new (see Weka mailing-list) !!! > I think we are interested to provide the best useable appraoch > with implemented algorithms available, so let's use the already > available ones and extend them !!! Absolutely. If that has not been clear so far, I prefer to use existing stuff as much as possible, but I do prefer some tools over others, which is in general too, so we need to develop wrappers using a unified interface... > IMHO: > !!! The problem is not the missing 'data mining'-standard. The problem > is the misuse of > 1. a general molecular-structure-coding with these standard algorithms !!! > 2. applying these algorithms correctly > So let's focus this problem first !!! Not everyone agrees on how methods/algorithms should be applied... but I agree that there is plenty of weird use of methods in some QSAR research... I think that providing people with an easy to use, clear and well defined program will make it much easier to teach others what things should be taken into account when making models... > This is a problam of CDK and JOELib > and only if we have solved this, we can solve the next one. > Furthermore i will publish in the next time: > - the extended Weka interface Looking forward to reading that... > - the MaximumCommonSubstructure (MCS) algorithms Is this an improved algorithm, or similar to that in CDK? > - The Metric-Interface is still available and is used by the AtomPair- > descriptor > Weka-Clusterers with Molecular-Metrics are planned and will be > implemented next. The Cluster-Matlab-Molecule connection is to difficult > at the moment, because the similarity metric must be coded under Matlab > or we use indices ... Not sure what you mean here... > So again, i'm using a lot of interfaces and i do not like another one !!! Fine. I don't think we will need to reinvent what you did. I'm, and I guess others too, are fine with using interfaces similar or identical to yours... > Will it not be easier to add CDK- and JOELib-PlugIns. > Do not make the algorithms to easy for chemists, probably they think > hypothesis-testing is an easy tasks and the molecular structure is the > most important thing ... IMHO ... that's badly wrong !!! Mmm... not sure I agree here... chemists are our target... likely even biologists (no offense... :) > So force them > to read the data mining/interface manual carefully. Ok, can you explain me what the goal is here? I.e. what should they learn from understanding the interface? > Descriptor dependencies > are NOT all linear 2D dependencies as already excellently mentioned by > Nikolova/Jaworska. So where is the advantage to show them in 2D or 3D ? > That's mainly irrelevant and misleading ! A 2D plot is only one > possibility for the model quality, and not always the best one !!! What kind of 2D are you talking about here? > > A. Current QSAR practice has severe problems. They include: > > - almost all codes are closed. Many are not free. > > Exact: > Descriptors: Dragon, MolConnZ, ... > Algorithms: Often unpublished code with hiding most of the paramaters, > also important ones > > > - it is impossible to repeat any experiment. Therefore QSAR ceases to be > > scientific but relies on reputation, trust and power > > - the objects used are badly designed, irreproducible and have variable > > interpretation > > - data selection is arbitrary. There are few (no?) standard test sets. > > It is impossible to verify whether data have be modified consciously or > > unconsciously to increase apparent success > > - algorithms are closed, even if the data are well defined. > > Agree fully, four times ! > Oh, i've some nice slides i can present for these points ... :-) > > > B. The mainstream QSAR community is not taking effective steps to remedy > > the errors. Our current group believes that through an OpenSource > > approach > > > we can catalyse a change in thinking and practice. We do this by > > creating a > > > system and practice that demonstrates the increased **quality** > > available > > > through OpenSource. IMO quality is the most important - more so than > > platform, language, ease of use, performance, etc. If it is easier and > > faster to create more garbage on every platform what have we achieved? > > 1. Correct, but surely you know the No-Free-Lunch-Theorem ... i know that > not > everybody like this theorem (still apriori) ... BUT ... now we have a huge > amount of algorithms ... which one to pick ? It's 'easy' to find one > algorithm and one feature set to explain one data set perfectly ! > > 2. And we are not all algorithm developers, so use the existing libraries > which the main-stream user can use. There is still enough room to make > errors, also if we must not reimplement algorithms !!! > > 3. A QSAR framework is not easy, because there are a lot of different > opinions: Correct. Hence the proposed the new SF project to discuss and implement these things... > 3.1. how to present structures, e.g. CDK<->JOELib > 3.2. models (hypothesis building algorithms) are really abstract and do > not > forget the nested and highly interesting meta algorithms with > recursive > character, so let's forget the C++ libraries and concentrate on the > Java and Matlab (Java GUI) libraries (R?) with their flexible > reflection > mechanism! > 3.3. results ... uhhh ... cross-validation, feature selection, data set > splitting ? > Do not forget that we talk about molecular structures, so ... > 3.4. Big descriptor files with normalized descriptors, missing values, if > instable numeric descriptors or they depend on molecule size, ... > 3.5. Are we working in memory or on files ??? For hypothesis building we > are hopefully are working on memory, but the preprocessing steps do > not underly this restriction. Much of this has already be discussed in the thread. True nevertheless. > Sorry, CDK for descriptors is not obvious to me, please explain. As you > can mention, i do not agree for several reasons, as already discussed > previously, e.g. missing atom typer and missing substructure search ! CDK *has* substructure search, implemented in a rather flexible way. > (molecular-structure-coding ... is restricted to applied expert systems) > > Why do we need again a new project, As said above, a new project does not equal starting from scratch. > do we not have enough interface > maintenance 'problems' with the actual projects !? > 1. I think the standard should be a file format or CML, but this does not > help at all, this can only save time by using more space ! > You-Know: Time-Space-Complexity I have not seen the Heisenberg relation for this yet... > 2. Often on-the-fly calculations are required, so this will require > JOELib or CDK or > external JOELib module (which exists already: Corina, Petra, XLogP,...) > So we need a molecule data structure, so which one to use ? > Again implement a new interface ? Why ? I can't see the advantage ? See thread. > 2.1. Interface to Molecules: > - JOELib (available) > - CDK (available) > - Ghemical/Mopac (available in JOELib) > - OpenBabel (JNI, same object structure as JOELib, but is this > usefull ?) > - Tinker > > 2.2. Interface to data mining packages > - Weka (available in JOELib/JCompChem) > - JavaNNS (SNNS sucessor, available in JOELib/JCompChem) > - LibSVM (available in JOELib/JCompChem) > - Matlab and it's 1001 free-packages (available in JOELib/JCompChem) Too bad Matlab itself is not... > - Yale uses Weka > - Data mining API Let's use a chemometrics API. :) I have no idea what a data mining API is... data mining is a rather vague term... like chemometrics API. > - ... to much such stuff ... all mostly incompatible ... let's use > Weka, that's the most serious used OpenSource approach. > Data Miners will implement their algorithms for it, we can use them ! > - let's use Matlab and/or R Let's have that plugable. So that anyone can choose whatever program they like. > 3. Visualization: > 3.1. Molecules: Can be done with CDK and with JOELib also highlighted > SMARTS substructures: > 2D layout CDK > 3D layout JOElib (Corina, Ghemical, orYourInterface) > 3.2. Data: what, histograms, plots, 3D plots , ... > no interest to implement such things, that's boring and does not > help at all, because Weka, Matlab, R have all their own tools > and which one do you prefer ? > What's with independent packages, like libSVM, our JavaNNS > (SNNS successor), ... > So we nedd an interface for all, that's nearly impossible in a short > time period. > I use most often the Java->Matlab interface, this is nothing special > only the adapted JMatLink connection. > > ... and another advantage of holiday and weekeend ... i can write really > long e-mails :-) Thanx for this analysis. And don't spend to much of your holiday on these kinds of emails... though it is difficult not to respond. :) > Kind regards, Joerg Have a nice continuation of you holiday! Egon |
From: Joerg W. <we...@in...> - 2004-04-17 17:25:49
|
Hi all, > > I suggest starting not with deciding what program to write but with what > > the components of a QSAR system are and then deciding what who wants to be > > involved, we have got and setting some realistic scope to what is > > achievable Of course, i like QSAR .. but time is rare and who will implement things ... you know that's my default comment ... Egon i've read your mail ... and yes i'm still in holiday ... and i do check e-mails and i work since 3 years on QSAR ... so holiday means i can read fantasy books and can do thinks i like, e.g. read some QSAR papers !:-) Holiday and spare-time are some curious things .. aren't they :-) > It seems there is general agreement that an SF project in this area is > valuable and I'll make a few comments which I hope are helpful. Please > ignore if they aren't. I do not agree to open an own project, there is much code out there: Weka, YALE (includes Weka interface) and XML, Commercial stuff with Weka interface (Xalopy or what was the correct name ?) I think, we do not want to invent an new data mining standard, such discussions are more usefull for the Weka mailing list and all avaliable Matlab algorithm providers (toolboxes !!!) ... ... and such discussions are not new (see Weka mailing-list) !!! I think we are interested to provide the best useable appraoch with implemented algorithms available, so let's use the already available ones and extend them !!! IMHO: !!! The problem is not the missing 'data mining'-standard. The problem is the misuse of 1. a general molecular-structure-coding with these standard algorithms !!! 2. applying these algorithms correctly So let's focus this problem first !!! This is a problam of CDK and JOELib and only if we have solved this, we can solve the next one. Furthermore i will publish in the next time: - the extended Weka interface - the MaximumCommonSubstructure (MCS) algorithms - The Metric-Interface is still available and is used by the AtomPair- descriptor Weka-Clusterers with Molecular-Metrics are planned and will be implemented next. The Cluster-Matlab-Molecule connection is to difficult at the moment, because the similarity metric must be coded under Matlab or we use indices ... So again, i'm using a lot of interfaces and i do not like another one !!! Will it not be easier to add CDK- and JOELib-PlugIns. Do not make the algorithms to easy for chemists, probably they think hypothesis-testing is an easy tasks and the molecular structure is the most important thing ... IMHO ... that's badly wrong !!! So force them to read the data mining/interface manual carefully. Descriptor dependencies are NOT all linear 2D dependencies as already excellently mentioned by Nikolova/Jaworska. So where is the advantage to show them in 2D or 3D ? That's mainly irrelevant and misleading ! A 2D plot is only one possibility for the model quality, and not always the best one !!! > A. Current QSAR practice has severe problems. They include: > - almost all codes are closed. Many are not free. Exact: Descriptors: Dragon, MolConnZ, ... Algorithms: Often unpublished code with hiding most of the paramaters, also important ones > - it is impossible to repeat any experiment. Therefore QSAR ceases to be > scientific but relies on reputation, trust and power > - the objects used are badly designed, irreproducible and have variable > interpretation > - data selection is arbitrary. There are few (no?) standard test sets. It > is impossible to verify whether data have be modified consciously or > unconsciously to increase apparent success > - algorithms are closed, even if the data are well defined. Agree fully, four times ! Oh, i've some nice slides i can present for these points ... :-) > B. The mainstream QSAR community is not taking effective steps to remedy > the errors. Our current group believes that through an OpenSource approach > we can catalyse a change in thinking and practice. We do this by creating a > system and practice that demonstrates the increased **quality** available > through OpenSource. IMO quality is the most important - more so than > platform, language, ease of use, performance, etc. If it is easier and > faster to create more garbage on every platform what have we achieved? 1. Correct, but surely you know the No-Free-Lunch-Theorem ... i know that not everybody like this theorem (still apriori) ... BUT ... now we have a huge amount of algorithms ... which one to pick ? It's 'easy' to find one algorithm and one feature set to explain one data set perfectly ! 2. And we are not all algorithm developers, so use the existing libraries which the main-stream user can use. There is still enough room to make errors, also if we must not reimplement algorithms !!! 3. A QSAR framework is not easy, because there are a lot of different opinions: 3.1. how to present structures, e.g. CDK<->JOELib 3.2. models (hypothesis building algorithms) are really abstract and do not forget the nested and highly interesting meta algorithms with recursive character, so let's forget the C++ libraries and concentrate on the Java and Matlab (Java GUI) libraries (R?) with their flexible reflection mechanism! 3.3. results ... uhhh ... cross-validation, feature selection, data set splitting ? Do not forget that we talk about molecular structures, so ... 3.4. Big descriptor files with normalized descriptors, missing values, if instable numeric descriptors or they depend on molecule size, ... 3.5. Are we working in memory or on files ??? For hypothesis building we are hopefully are working on memory, but the preprocessing steps do not underly this restriction. Sorry, CDK for descriptors is not obvious to me, please explain. As you can mention, i do not agree for several reasons, as already discussed previously, e.g. missing atom typer and missing substructure search ! (molecular-structure-coding ... is restricted to applied expert systems) Why do we need again a new project, do we not have enough interface maintenance 'problems' with the actual projects !? 1. I think the standard should be a file format or CML, but this does not help at all, this can only save time by using more space ! You-Know: Time-Space-Complexity 2. Often on-the-fly calculations are required, so this will require JOELib or CDK or external JOELib module (which exists already: Corina, Petra, XLogP,...) So we need a molecule data structure, so which one to use ? Again implement a new interface ? Why ? I can't see the advantage ? 2.1. Interface to Molecules: - JOELib (available) - CDK (available) - Ghemical/Mopac (available in JOELib) - OpenBabel (JNI, same object structure as JOELib, but is this usefull ?) - Tinker 2.2. Interface to data mining packages - Weka (available in JOELib/JCompChem) - JavaNNS (SNNS sucessor, available in JOELib/JCompChem) - LibSVM (available in JOELib/JCompChem) - Matlab and it's 1001 free-packages (available in JOELib/JCompChem) - Yale uses Weka - Data mining API - ... to much such stuff ... all mostly incompatible ... let's use Weka, that's the most serious used OpenSource approach. Data Miners will implement their algorithms for it, we can use them ! - let's use Matlab and/or R 3. Visualization: 3.1. Molecules: Can be done with CDK and with JOELib also highlighted SMARTS substructures: 2D layout CDK 3D layout JOElib (Corina, Ghemical, orYourInterface) 3.2. Data: what, histograms, plots, 3D plots , ... no interest to implement such things, that's boring and does not help at all, because Weka, Matlab, R have all their own tools and which one do you prefer ? What's with independent packages, like libSVM, our JavaNNS (SNNS successor), ... So we nedd an interface for all, that's nearly impossible in a short time period. I use most often the Java->Matlab interface, this is nothing special only the adapted JMatLink connection. ... and another advantage of holiday and weekeend ... i can write really long e-mails :-) Kind regards, Joerg > C. The OpenSource community has made some small, useful steps in this > direction. They now wish to pool their efforts and produce a single point > of contact for their own development and to show to the world. This does > NOT necessarily mean a single program. IMO it is much more likely to mean > an infrastructure on which a variety of operations can be carried out > ("glueware"?). They wish to create a project at SF which leads to: > - active constructive discussion > - agreed representation of objects > * molecules, atoms, fragments, etc. > * descriptors > * properties > - creation, cataloguing, annotating, high-quality information objects: > * dictionaries > * properties (e.g. of atoms) > * datasets > - creation, cataloguing, annotation of algorithms related to QSAR > * chemical perception > * statistics, optimisation, etc > - creation of software: > * as toolkit components > * as demonstrators of the *quality* of the system > > That is as far as I have got... > > I think it's important to be inclusive and I would therefore suggest that > we review the current OpenSource efforts in this area. My knowledge extends to: > - CDK, etc. > - JOELib > - OpenBabel > - Weka > - Nina's work (does this have a label?) > > In projects of this sort everyone has something to contribute and also > something to give up. For example I did a lot of work on visual display of > CML (Jumbo3) - and some of this functionality is not provided by other > sources. Nevertheless I decided to give up JUMBO3 and use JCP and Jmol for > display. JUMBO4.3 has now developed in a more structured form as a flexible > XML DOM and Tools library which can be reconfigured easily and rapidly. It > is component based rather than application based. > > I suggest starting not with deciding what program to write but with what > the components of a QSAR system are and then deciding what who wants to be > involved, we have got and setting some realistic scope to what is achievable. > > Best > > P. > Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: Egon W. <eg...@sc...> - 2004-04-17 15:54:20
|
On Saturday 17 April 2004 15:02, Peter Murray-Rust wrote: > I suggest starting not with deciding what program to write but with what > the components of a QSAR system are and then deciding what who wants to be > involved, we have got and setting some realistic scope to what is > achievable. Agreed. This is why we need to set up a SF project where we can write these things down. Here's a list: - building a molecule database - read from file/internet - draw yourself one by one, or insert from smiles - browsing the database with 2D and possible 3D structures - associate activities/properties with those molecules - preprocessing - get mathematical (or other) descriptions of the molecules in the database - selection of wanted descriptions - ability to use external programs for this - descriptor value preprocessing - statistical analysis of the database (outliers, diversity, etc) - model building - chosing method, and method parameters - model validation - visual validation -> plots - statistical validation I've requested a new SF project ('qsar') yesterday after getting positive reactions to my proposal earlier. Joerg, I did not direct you personaly yet, because I vaguely remembered you stating to be on holiday (?), but I might very well be confused here... I see JOELib as an important part of the new program: it has many descriptors implemented, already uses CML2 for storing results, and has an interface to Weka. I also see an important part for CDK: 2D editing/display is a very important feature here. And, I expect, some descriptors will be implemented in CDK later this year, though this will likely not conflict with those in JOELib. The reason why I propose CDK's core classes must be obvious. Hopefully, the QSAR SF project will be approved early next week, and then I will start adding requirements, analyses, etc to documentation, hopefully together with the others interested. Then we will see how the available OS parts fit together. Egon |
From: Peter Murray-R. <pm...@ca...> - 2004-04-17 13:05:50
|
At 09:52 17/04/2004 +0200, Joerg Wegner wrote: >Dear Nina Nikolova, >Dear All, > >please reply also to the JOELib mailing list and ... i've already >published three papers about QSAR and our group has it's main focus on >data mining and optimization algorithms, so i think i've some experience >in this area, too. >http://www-ra.informatik.uni-tuebingen.de/ It seems there is general agreement that an SF project in this area is valuable and I'll make a few comments which I hope are helpful. Please ignore if they aren't. A. Current QSAR practice has severe problems. They include: - almost all codes are closed. Many are not free. - it is impossible to repeat any experiment. Therefore QSAR ceases to be scientific but relies on reputation, trust and power - the objects used are badly designed, irreproducible and have variable interpretation - data selection is arbitrary. There are few (no?) standard test sets. It is impossible to verify whether data have be modified consciously or unconsciously to increase apparent success - algorithms are closed, even if the data are well defined. B. The mainstream QSAR community is not taking effective steps to remedy the errors. Our current group believes that through an OpenSource approach we can catalyse a change in thinking and practice. We do this by creating a system and practice that demonstrates the increased **quality** available through OpenSource. IMO quality is the most important - more so than platform, language, ease of use, performance, etc. If it is easier and faster to create more garbage on every platform what have we achieved? C. The OpenSource community has made some small, useful steps in this direction. They now wish to pool their efforts and produce a single point of contact for their own development and to show to the world. This does NOT necessarily mean a single program. IMO it is much more likely to mean an infrastructure on which a variety of operations can be carried out ("glueware"?). They wish to create a project at SF which leads to: - active constructive discussion - agreed representation of objects * molecules, atoms, fragments, etc. * descriptors * properties - creation, cataloguing, annotating, high-quality information objects: * dictionaries * properties (e.g. of atoms) * datasets - creation, cataloguing, annotation of algorithms related to QSAR * chemical perception * statistics, optimisation, etc - creation of software: * as toolkit components * as demonstrators of the *quality* of the system That is as far as I have got... I think it's important to be inclusive and I would therefore suggest that we review the current OpenSource efforts in this area. My knowledge extends to: - CDK, etc. - JOELib - OpenBabel - Weka - Nina's work (does this have a label?) In projects of this sort everyone has something to contribute and also something to give up. For example I did a lot of work on visual display of CML (Jumbo3) - and some of this functionality is not provided by other sources. Nevertheless I decided to give up JUMBO3 and use JCP and Jmol for display. JUMBO4.3 has now developed in a more structured form as a flexible XML DOM and Tools library which can be reconfigured easily and rapidly. It is component based rather than application based. I suggest starting not with deciding what program to write but with what the components of a QSAR system are and then deciding what who wants to be involved, we have got and setting some realistic scope to what is achievable. Best P. Peter Murray-Rust Unilever Centre for Molecular Informatics Chemistry Department, Cambridge University Lensfield Road, CAMBRIDGE, CB2 1EW, UK Tel: +44-1223-763069 |
From: Joerg W. <we...@in...> - 2004-04-17 08:16:16
|
Dear Nina Nikolova, Dear All, yes of course, it is definitley recommended to introduce a model validation possibility, as already discussed in my last two papers. And also havily critisized by Agrafiotis. Model comparison without data comparison is not really usefull, so models and the data must be stored, i would prefer a benchmark database or at least public web page. I duscussed these topic also with the JCICS editor, but ... you know chemists and their data. SO, for a usefull model validation we need at first benchmark data sets, beause we can only compare the hypothesis, if the data sets are the same. Furthermore a basic 'guideline' must be available, to avoid over-/underfitted models, especially when applying feature selection: See feature selection papers: http://www-ra.informatik.uni-tuebingen.de/software/joelib/users.html The first paper contains two benchmark data sets with nearly 3000 descriptors ? For models i recommend to use Weka, because these models can be stored as Java-objects, this is transparent enough, or if possible a XML mapping tool can be used. For our JavaNNS interface there exists still a text export,also for the libSVM interface, ... For Matlab things can be stored in Matlab objects. No, sorry, i'm not on the ADMET-conference, but i'm going to: -Chemoinformatics, sheffield, Just to talk to others and -Analytica, Munich, Lecture: 'Model quality' !!! Kind regards, Joerg Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: Joerg W. <we...@in...> - 2004-04-17 07:52:28
|
Dear Nina Nikolova, Dear All, please reply also to the JOELib mailing list and ... i've already published three papers about QSAR and our group has it's main focus on data mining and optimization algorithms, so i think i've some experience in this area, too. http://www-ra.informatik.uni-tuebingen.de/ I've read your QSAR 'similarity review'-paper. Excellent !!! Furthermore i've already submitted a paper about critiziing a missing standard for the chemical expert systems and all folowing steps, like descriptor calculation, atom type assignment, ... Descriptor import is no problem with JOELib, e.g. MOE, MolConnZ, Petra, ... BTW, i do not like not transparent programs, like Dragon although their book is excellent ... Machine learning suites: Weka, YACC, Matlab, R, ... all fine ! For Weka and Matlab you can ask me, i've implemented interfaces and using them heavily. R is similar and a long time ago i've used the interface under Java very shortly ... we're matlab based ! The JOELib IDE has already basic statistic and entropy interfaces !!! A student of mine has extended the Weka-DataMining Model for Molecules, i recommend to use this data model ! Main interest is to use also user defined similarity metrics for clustering ! BTW, the JOELib-Weka version contains also some add-on packages ! For CML and descriptors i've already started a HTML page discussion at, please reply to CDK and JOELib discussion lists: http://www-ra.informatik.uni-tuebingen.de/mitarb/wegner/private/cml/cmlAndDescriptors.htm Please discuss things heavily and avoid inventing the wheel twice ... Kind regards, Joerg Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: Joerg W. <we...@in...> - 2004-04-15 08:37:04
|
Hi all, there is still a kekulizing-method in JOEMol available, which is now used for: - visualize structures without aromatic rings in kekule mode - image export and PDF export - MDL SDF export without aromatic bond flag The changes are added to CVS and the tutorial was updated. The flags in the properties file work now correctly: 1. joelib.gui.render.Renderer2DModel.useKekuleStructure=true 2. #SD Files joelib.io.types.MDLSD.writeAromaticityAsConjugatedSystem=false Kind regards, Joerg Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: Peter Murray-R. <pm...@ca...> - 2004-04-03 13:42:24
|
At 11:02 29/03/2004 +0200, Joerg K. Wegner wrote: >Hi all, > >which features do you prefer for the next JOELib release ? > >Voting: >1. Tautomers > - Based on SMARTS Transformation rules, analogue to pH > correction module in JOELib, just use combinatorial > generation and not only one rule. > So this is analogue to other approaches, like > Agent2 (no SMARTS?, hard coded?), Docking programs, > Daylight, CACTVS > - BTW, i would be happy if you can submit requested > tautomer patterns in SMARTS notation. Assuming this is (a) the only Open Java code for tautomers and (b) mirrors an accepted practice (e.g. SMARTS) I would favour this. P. >2. Rotamers > - Porting the OELib rotamer generation to Java, > 60-75% already finished. > > >No voting: >A. MCS based pharmacophore detection is finished and > will be probably published in quartal 4 of this year. > >Kind regards, Joerg > >-- >Dipl. Chem. Joerg K. Wegner >Center of Bioinformatics Tuebingen (ZBIT) >Department of Computer Architecture >Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany >Phone: (+49/0) 7071 29 78970 >Fax: (+49/0) 7071 29 5091 >E-Mail: mailto:we...@in... >WWW: http://www-ra.informatik.uni-tuebingen.de >-- >Never mistake motion for action. > (E. Hemingway) > >Never mistake action for meaningful action. > (Hugo Kubinyi,2004) > > > >------------------------------------------------------- >This SF.Net email is sponsored by: IBM Linux Tutorials >Free Linux tutorial presented by Daniel Robbins, President and CEO of >GenToo technologies. Learn everything from fundamentals to system >administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >_______________________________________________ >Joelib-help mailing list >Joe...@li... >https://lists.sourceforge.net/lists/listinfo/joelib-help Peter Murray-Rust Unilever Centre for Molecular Informatics Chemistry Department, Cambridge University Lensfield Road, CAMBRIDGE, CB2 1EW, UK Tel: +44-1223-763069 |
From: Joerg K. W. <we...@in...> - 2004-03-29 09:01:26
|
Hi all, which features do you prefer for the next JOELib release ? Voting: 1. Tautomers - Based on SMARTS Transformation rules, analogue to pH correction module in JOELib, just use combinatorial generation and not only one rule. So this is analogue to other approaches, like Agent2 (no SMARTS?, hard coded?), Docking programs, Daylight, CACTVS - BTW, i would be happy if you can submit requested tautomer patterns in SMARTS notation. 2. Rotamers - Porting the OELib rotamer generation to Java, 60-75% already finished. No voting: A. MCS based pharmacophore detection is finished and will be probably published in quartal 4 of this year. Kind regards, Joerg -- Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: E.L. W. <eg...@sc...> - 2004-03-24 13:27:43
|
=2D----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Friday 19 March 2004 12:33, Joerg K. Wegner wrote: > 4. CDK CML core wishlist (Egon): > 4.1. Please move (from MDL CDO) or add in any form of stereochemistry > support to core CML and the RSS plugin. This is important for > visualizing drugs! Elaborate... MDL style wedge bonds are read from CML... what are you referi= ng=20 too? CML fragment please... > 4.2. Please accept RSS entries without structures without throwing an > error.=20 Cannot reproduce that behaviour. I don't have that problem. Can you tell me= =20 how to reproduce the bug? > Please crosscheck the GPLed Java RSS viewer on SF.net: > https://sourceforge.net/projects/rssview/ > They can also add and edit RSS feed properties and have proxy support, > additionally all informations are stored in a XML file. Thanx for the ideas... this does not have a high priority ... > 4.3. Visualization for activity and ADMET data in this RSS feed example > (xsd:string, xsd:double, xsd:integer descriptors). Yes, that is being worked out... Will report later... Egon =2D --=20 eg...@sc... PhD on Molecular Representation in Chemometrics Nijmegen University http://www.cac.sci.kun.nl/people/egonw/ GPG: 1024D/D6336BA6 =2D----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (SunOS) iD8DBQFAYYyxd9R8I9Yza6YRApgHAKCF0AFHwQ0M6YT2YaT61BfaQLB03wCfe692 dvrhPTw6+TVijhwfHByzOuo=3D =3DyZi7 =2D----END PGP SIGNATURE----- |
From: Joerg K. W. <we...@in...> - 2004-03-23 19:49:40
|
Hi all, here is the proposed CML2 bugfix release. BTW, i've added: -Amber prep -Mopac out And fixed: - Reading Sybyl partial charges are now not overwritten by JOELib' partial charges, so you can store these infromations now in CML2. All partial charges in CML2 contain now a vendor information. Kind release regards, J=F6rg --=20 Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: Peter Murray-R. <pm...@ca...> - 2004-03-22 09:18:40
|
At 16:17 21/03/2004 +0100, Joerg Wegner wrote: >Hi all, > >1. I've added Amber,Mopac import to JOELib. >2. The partial charges entries in Mopac and Sybyl are now not any more >overwritten by JOELib's Gasteiger-Marsili partial charges. So they can be >forwarded to e.g. CML2. >I've added two test files. Please try: >sh convert.sh src/joelib/test/1bhf-ligand.mol2 sybyl.cml >sh convert.sh src/joelib/test/Ethanol.mopout mopac.cml Thanks very much. I reply before looking at the details... >Please attend that at the actual implementation the partial charges are >used to calculate the hashcode and the molecule ID for CML (hopefully >nearly unique), if you do not use the canonical SMILES hashcode, you will >get different results for different partial charges. >To avoid this, we simply use the canonicalized SMILES hashcode in >CMLIDCreator. > >So we get: ><cml:scalar units="units:electron" dataType="xsd:float" >dictRef="MOPAC">-0.0192</cml:scalar> This is valid CML but we are gently moving towards more structured dictRefs. I suggest: <molecule xmlns:MOPAC="http://www.tuebingen.de/dict/MOPAC">... ><cml:scalar units="units:electron" dataType="xsd:float" >dictRef="MOPAC:partialCharge73">-0.0192</cml:scalar> the MOPAC:partialCharge is an XML QName. You are in control of it. it acts as a pointer to an ID in a dictionary - i.e. somewhere you have something like: <dictionary id="mopacDict" dictRef="MOPAC" xmlns:MOPAC="http://www.tuebingen.de/dict/MOPAC"> <entry id="partialCharge73" term="partial charge from MOPAC"> <appinfo> <scalar type="xsd:float"/> </appinfo> <description>The partial charge as defined on p. 333 of the manual...</description> </entry> I am writing this without looking at the details but to give the idea. The appinfo stuff is new and still being developed. Your input will be important. The main thing is that you have a dictionary defining the concept. The actual ID doesn't matter - it could be JW:id376 as long as this was linked to a jJW namespace and there was an entry id376. ><cml:scalar units="units:electron" dataType="xsd:float" >dictRef="MMFF94_CHARGES">0.569</cml:scalar> I would suggest either JW:MMFF94_CHARGES or if you have a large list of MMFF94 terms MMFF94:charges >If no charges are defined we get, as previously: ><cml:scalar units="units:electron" dataType="xsd:float" >dictRef="joelib:partialCharge">-0.2686233439216685</cml:scalar> > >3. Peter: Can these dictRef entries be used ? If not what then, because >Sybyl itself contains a string line where the 'partial charge vendor' is >already given. I've simply replaced ' ' by '_'. If manufacturers have IDs that conform to XML we can use them. If not there has to be a mapping. In haste P. >Kind regards, Joerg > >Dipl. Chem. Joerg K. Wegner >Center of Bioinformatics Tuebingen (ZBIT) >Department of Computer Architecture >Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany >Phone: (+49/0) 7071 29 78970 >Fax: (+49/0) 7071 29 5091 >E-Mail: mailto:we...@in... >WWW: http://www-ra.informatik.uni-tuebingen.de >-- >Never mistake motion for action. > (E. Hemingway) > >Never mistake action for meaningful action. > (Hugo Kubinyi,2004) > > > > >------------------------------------------------------- >This SF.Net email is sponsored by: IBM Linux Tutorials >Free Linux tutorial presented by Daniel Robbins, President and CEO of >GenToo technologies. Learn everything from fundamentals to system >administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >_______________________________________________ >Joelib-help mailing list >Joe...@li... >https://lists.sourceforge.net/lists/listinfo/joelib-help Peter Murray-Rust Unilever Centre for Molecular Informatics Chemistry Department, Cambridge University Lensfield Road, CAMBRIDGE, CB2 1EW, UK Tel: +44-1223-763069 |
From: Joerg W. <we...@in...> - 2004-03-21 15:18:11
|
Hi all, 1. I've added Amber,Mopac import to JOELib. 2. The partial charges entries in Mopac and Sybyl are now not any more overwritten by JOELib's Gasteiger-Marsili partial charges. So they can be forwarded to e.g. CML2. I've added two test files. Please try: sh convert.sh src/joelib/test/1bhf-ligand.mol2 sybyl.cml sh convert.sh src/joelib/test/Ethanol.mopout mopac.cml Please attend that at the actual implementation the partial charges are used to calculate the hashcode and the molecule ID for CML (hopefully nearly unique), if you do not use the canonical SMILES hashcode, you will get different results for different partial charges. To avoid this, we simply use the canonicalized SMILES hashcode in CMLIDCreator. So we get: <cml:scalar units="units:electron" dataType="xsd:float" dictRef="MOPAC">-0.0192</cml:scalar> <cml:scalar units="units:electron" dataType="xsd:float" dictRef="MMFF94_CHARGES">0.569</cml:scalar> If no charges are defined we get, as previously: <cml:scalar units="units:electron" dataType="xsd:float" dictRef="joelib:partialCharge">-0.2686233439216685</cml:scalar> 3. Peter: Can these dictRef entries be used ? If not what then, because Sybyl itself contains a string line where the 'partial charge vendor' is already given. I've simply replaced ' ' by '_'. Kind regards, Joerg Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: Joerg K. W. <we...@in...> - 2004-03-20 16:11:16
|
Hi, sorry, the sequential CML2 reader (for uncompressed files ONLY!!!) contains a parsing bug for complex descriptors like arrays, matrices, atom-pair, and whatever ... I've fixed it and checked in. I will not build a new release, so checkout, if you want to try or compress your CML2 file before loading it :-) Or wait until the next release. BTW, i've updated the homepage and added P.M. Rust to the acknowledgements, because the valuable CML discussion. Kind regards, Joerg -- Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: Peter Murray-R. <pm...@ca...> - 2004-03-20 12:05:02
|
Greetings all, I have been away from my mail for 24 hours and this all seems to be exploding! Fantastic. This is written before reading everything... At 14:03 19/03/2004 +0100, E.L. Willighagen wrote: >-----BEGIN PGP SIGNED MESSAGE----- >Hash: SHA1 > >On Friday 19 March 2004 12:33, Joerg K. Wegner wrote: > > 1. I've again released a new JOELib version with sequential SAX parser > > for HUGE CML files for uncompressed and compressed (ZIP) files (i've > > found a workaround)!!! > >Excellent! > > > I think after this 'stable' CML standard support i can switch from SDF > > to CML2 as standard format, because it can be compressed and is much > > more verbose, which is recommended for storing descriptors (WITH KERNEL > > informations). > > https://sourceforge.net/projects/joelib/ > >/me is happy This sounds great. We are happy to explore how CML can support these type of applications. The challenge is not to write CML (which is easy) but to create a structure which is usable by others. So imagine that you got a CML file from someone you hadn't talked to - could you use it? A good example is properties. Let's say Joerg creates something that looks like: <molecule xmlns:joerg="http://de.uni-tuebingen.informatik/wegner/dict"> <atomsSnipped/> <propertyList> <property> <scalar type="xsd:float" dictRef="joerg:clogp" title="cLogP>1.23</scalar> </property> </propertyList> </molecule> what should RssViewer do? Options might be 1 display the molecule and omit the propertyList completely 2 notify the user that there were properties but that RSS viewer couldn't view them 3 list the values along with the dictRef, title *and* the xmlns. This identifies the property but doesn't explain it 4 visit joerg's website and retrieve the dictionary entry. I would suggest we aim for at least 2. However we are going into completely fresh semantic web territory - very exciting. > > 2. I've added the 'attributearray' CML writer which is conform to the > > OpenBabel array, only because its better layout in RSS files. I would be > > happy if you will add a CML generation section for JOELib (analogue to > > OpenBabel) to your CMLRSS documentation. Two properties in > > joelib.properties need to be set: > > joelib.io.types.ChemicalMarkupLanguage.output=attributearray > > ## a first step to 'reproducable' descriptor calculation algorithms > > > joelib.io.types.ChemicalMarkupLanguage.output.storeChemistryKernelInfo=false > > > > 3.1. I've published a 'QSAR, LBDD and SBDD'-RSS feed. I plan to release > > from time to time (monthly) some algorithmic and QSAR, LBDD and SBDD > > topics with literature references. This is wonderful. Joerg, please liaise with YY so that (a) we notify people and (b) can include your name in the RSS providers at our ACS presentation. [This applies to any other list members - if you have a CMLRSS feed and let YY/Henry know in the next 2-3 days we will promote it.) Again, we are promoting CMLRSS at the NeSC meeting - as announced earlier. Egon - please can you liaise with YY - within the next 3-4 days as to what version we can put on. (We can always announce a later one at the meeting). Hope to post more later P. Peter Murray-Rust Unilever Centre for Molecular Informatics Chemistry Department, Cambridge University Lensfield Road, CAMBRIDGE, CB2 1EW, UK Tel: +44-1223-763069 |
From: Joerg W. <we...@in...> - 2004-03-19 16:42:41
|
Hi Egon, Example is given at: http://joelib.sourceforge.net/rss/index.xml Please use a text viewer, not a browser ! For stereo i've used bondStereo and hope this is standard conform ? Anyway, if not, i can change to 'correctStereoTag'. Error in JChempaint, Jmol does not work as mentioned in previous mail: 'Model does not have bonds. Cannot depict contents.' [Ok-Button] [virtual mouse-click] :-) Kind regards, Joerg On Fri, 19 Mar 2004, E.L. Willighagen wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Friday 19 March 2004 12:33, Joerg K. Wegner wrote: > > 1. I've again released a new JOELib version with sequential SAX parser > > for HUGE CML files for uncompressed and compressed (ZIP) files (i've > > found a workaround)!!! > > Excellent! > > > I think after this 'stable' CML standard support i can switch from SDF > > to CML2 as standard format, because it can be compressed and is much > > more verbose, which is recommended for storing descriptors (WITH KERNEL > > informations). > > https://sourceforge.net/projects/joelib/ > > /me is happy > > > 2. I've added the 'attributearray' CML writer which is conform to the > > OpenBabel array, only because its better layout in RSS files. I would be > > happy if you will add a CML generation section for JOELib (analogue to > > OpenBabel) to your CMLRSS documentation. Two properties in > > joelib.properties need to be set: > > joelib.io.types.ChemicalMarkupLanguage.output=attributearray > > ## a first step to 'reproducable' descriptor calculation algorithms > > joelib.io.types.ChemicalMarkupLanguage.output.storeChemistryKernelInfo=false > > > > 3.1. I've published a 'QSAR, LBDD and SBDD'-RSS feed. I plan to release > > from time to time (monthly) some algorithmic and QSAR, LBDD and SBDD > > topics with literature references. > > http://joelib.sourceforge.net/rss/index.xml > > Please see also attached rssviewer.props > > > > 3.2. Two entries in the feed contain structures WITH solubility and > > activity data taken from the literature ! Because this are the most > > interesting values for pharmaceutical companies and medicinal research i > > think such entries could be really interesting for other companies and > > database supplieres also ! I would it find also interesting if there > > could be a larger supplier which can offer such an open RSS feed, where > > others can submit some things. > > Eventually an open Wiki approach with mathematical equations (latex > > support exists) AND chemical structures (not that i know) can be set up? > > > > 4. CDK CML core wishlist (Egon): > > 4.1. Please move (from MDL CDO) or add in any form of stereochemistry > > support to core CML and the RSS plugin. This is important for > > visualizing drugs! > > Ok, can you give me a CML fragment? That's much easier for me, than finding > the source code bits that deal with it... > > > 4.2. Please accept RSS entries without structures without throwing an > > error. Please crosscheck the GPLed Java RSS viewer on SF.net: > > https://sourceforge.net/projects/rssview/ > > It should, and did... what error do you get? > > > They can also add and edit RSS feed properties and have proxy support, > > additionally all informations are stored in a XML file. > > Adding is supported in the plugin too... info is taken from the feed itself... > Proxy support is very easy... do you need it? > > About storing the info... I think we are going to use OPML... which that > program might be using too... > > > 4.3. Visualization for activity and ADMET data in this RSS feed example > > (xsd:string, xsd:double, xsd:integer descriptors). > > Again, can you give me some CML fragment? Is this atom based? Then Jmol can do > it with a minor customization... > > Egon > > - -- > eg...@sc... > PhD on Molecular Representation in Chemometrics > Nijmegen University > http://www.cac.sci.kun.nl/people/egonw/ > GPG: 1024D/D6336BA6 > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.0.7 (SunOS) > > iD8DBQFAWu+Nd9R8I9Yza6YRAgHxAKCIEjwjqsvbtX4vLubFfQwJmJvKmQCeNo0L > RrVgzdM5aUMJKyO4UgqzMZI= > =yvea > -----END PGP SIGNATURE----- > > > > ------------------------------------------------------- > This SF.Net email is sponsored Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: E.L. W. <eg...@sc...> - 2004-03-19 13:03:22
|
=2D----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Friday 19 March 2004 12:33, Joerg K. Wegner wrote: > 1. I've again released a new JOELib version with sequential SAX parser > for HUGE CML files for uncompressed and compressed (ZIP) files (i've > found a workaround)!!! Excellent! > I think after this 'stable' CML standard support i can switch from SDF > to CML2 as standard format, because it can be compressed and is much > more verbose, which is recommended for storing descriptors (WITH KERNEL > informations). > https://sourceforge.net/projects/joelib/ /me is happy > 2. I've added the 'attributearray' CML writer which is conform to the > OpenBabel array, only because its better layout in RSS files. I would be > happy if you will add a CML generation section for JOELib (analogue to > OpenBabel) to your CMLRSS documentation. Two properties in > joelib.properties need to be set: > joelib.io.types.ChemicalMarkupLanguage.output=3Dattributearray > ## a first step to 'reproducable' descriptor calculation algorithms > joelib.io.types.ChemicalMarkupLanguage.output.storeChemistryKernelInfo=3D= false > > 3.1. I've published a 'QSAR, LBDD and SBDD'-RSS feed. I plan to release > from time to time (monthly) some algorithmic and QSAR, LBDD and SBDD > topics with literature references. > http://joelib.sourceforge.net/rss/index.xml > Please see also attached rssviewer.props > > 3.2. Two entries in the feed contain structures WITH solubility and > activity data taken from the literature ! Because this are the most > interesting values for pharmaceutical companies and medicinal research i > think such entries could be really interesting for other companies and > database supplieres also ! I would it find also interesting if there > could be a larger supplier which can offer such an open RSS feed, where > others can submit some things. > Eventually an open Wiki approach with mathematical equations (latex > support exists) AND chemical structures (not that i know) can be set up? > > 4. CDK CML core wishlist (Egon): > 4.1. Please move (from MDL CDO) or add in any form of stereochemistry > support to core CML and the RSS plugin. This is important for > visualizing drugs! Ok, can you give me a CML fragment? That's much easier for me, than finding= =20 the source code bits that deal with it... > 4.2. Please accept RSS entries without structures without throwing an > error. Please crosscheck the GPLed Java RSS viewer on SF.net: > https://sourceforge.net/projects/rssview/ It should, and did... what error do you get? > They can also add and edit RSS feed properties and have proxy support, > additionally all informations are stored in a XML file. Adding is supported in the plugin too... info is taken from the feed itself= =2E.. Proxy support is very easy... do you need it? About storing the info... I think we are going to use OPML... which that=20 program might be using too... > 4.3. Visualization for activity and ADMET data in this RSS feed example > (xsd:string, xsd:double, xsd:integer descriptors). Again, can you give me some CML fragment? Is this atom based? Then Jmol can= do=20 it with a minor customization... Egon =2D --=20 eg...@sc... PhD on Molecular Representation in Chemometrics Nijmegen University http://www.cac.sci.kun.nl/people/egonw/ GPG: 1024D/D6336BA6 =2D----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (SunOS) iD8DBQFAWu+Nd9R8I9Yza6YRAgHxAKCIEjwjqsvbtX4vLubFfQwJmJvKmQCeNo0L RrVgzdM5aUMJKyO4UgqzMZI=3D =3Dyvea =2D----END PGP SIGNATURE----- |
From: Joerg K. W. <we...@in...> - 2004-03-19 11:32:42
|
Hi all, 1. I've again released a new JOELib version with sequential SAX parser for HUGE CML files for uncompressed and compressed (ZIP) files (i've found a workaround)!!! I think after this 'stable' CML standard support i can switch from SDF to CML2 as standard format, because it can be compressed and is much more verbose, which is recommended for storing descriptors (WITH KERNEL informations). https://sourceforge.net/projects/joelib/ 2. I've added the 'attributearray' CML writer which is conform to the OpenBabel array, only because its better layout in RSS files. I would be happy if you will add a CML generation section for JOELib (analogue to OpenBabel) to your CMLRSS documentation. Two properties in joelib.properties need to be set: joelib.io.types.ChemicalMarkupLanguage.output=attributearray ## a first step to 'reproducable' descriptor calculation algorithms joelib.io.types.ChemicalMarkupLanguage.output.storeChemistryKernelInfo=false 3.1. I've published a 'QSAR, LBDD and SBDD'-RSS feed. I plan to release from time to time (monthly) some algorithmic and QSAR, LBDD and SBDD topics with literature references. http://joelib.sourceforge.net/rss/index.xml Please see also attached rssviewer.props 3.2. Two entries in the feed contain structures WITH solubility and activity data taken from the literature ! Because this are the most interesting values for pharmaceutical companies and medicinal research i think such entries could be really interesting for other companies and database supplieres also ! I would it find also interesting if there could be a larger supplier which can offer such an open RSS feed, where others can submit some things. Eventually an open Wiki approach with mathematical equations (latex support exists) AND chemical structures (not that i know) can be set up? 4. CDK CML core wishlist (Egon): 4.1. Please move (from MDL CDO) or add in any form of stereochemistry support to core CML and the RSS plugin. This is important for visualizing drugs! 4.2. Please accept RSS entries without structures without throwing an error. Please crosscheck the GPLed Java RSS viewer on SF.net: https://sourceforge.net/projects/rssview/ They can also add and edit RSS feed properties and have proxy support, additionally all informations are stored in a XML file. 4.3. Visualization for activity and ADMET data in this RSS feed example (xsd:string, xsd:double, xsd:integer descriptors). Kind CML2, RSS and developer regards, Joerg -- Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: Peter Murray-R. <pm...@ca...> - 2004-03-17 10:17:59
|
At 10:46 16/03/2004 +0100, Joerg K. Wegner wrote: >Hi Geoff, > >I've used Egon's code to establish CML2 support for JOELib. (Thank's again >!!!). This includes full stereochemistry support, also for MDL SD were the >missing lines were added to the writer (see attachments)! >pipe: original.sdf -> test.cml -> test.sdf > >As already discussed, this can be a step to 'reproducible' >conversion/descriptorCalculation results. > >CML (metainformations) and SDF (comment) contains now both the ID for the >used chemistry kernel (expert systems). >This number is the hash code for all hard- and soft-coded expert system >informations. >I've modified the JOEGlobalDataBase and all text file definitions. They >contain now all CVS tags: > >VENDOR: http://joelib.sf.net >RELEASE_VERSION: $Revision: 1.4 $ >RELEASE_DATE: $Date: 2004/03/15 13:33:42 $ > >So this is not platform independant, BUT if we can find a way to assign >the same vendor, version and date tag (independent standard organization >or just a combined standard web page for JOELib/OpenBabel!!!) we can get >the same hash codes ! >BTW, the hash code uses the standard hash code calculation for string in Java. This looks very exciting - haven't worked through the details. >So descriptors contain now also a reference to the used kernel, e.g.: > ><cml:scalar dataType="xsd:integer" dictRef="joelib:kernel:715333816" >title="Number_of_HBD_2">0</cml:scalar> Yes - apart from the dictRef - this is great. At present it is required to add integer if you wish to inform a generic processing engine that this is an integer. Otherwise it defaults to an xsd:string. JOElib itself knows the implied semantics and could omit it. The title is optional - it is for humans - machines simply replicate it. So if you are happy to refer to this by joelib:kernel:715333816 you have everything. The string joelib:kernel:715333816 is an XML QName, so it can't have two colons. The prefix has no semantic value and maps to a URI. So the full spec could be: <cml:cml xmlns:cml="http://www.xml-cml.org/schema/cml2/core" xmlns:jk="http:joelib.sf.net/joelib/kernel/dict"> <cml:scalar dataType="xsd:integer" dictRef="jk:a715333816" title="Number_of_HBD_2">0</cml:scalar> </cml:cml> Note that a QNname must have both components starting with a letter, hence: a715333816 >Peter, is this now correct ? > >I will release a new JOELib version the next hours, then i'm going home, >i'm sick. Hope you get better soon. Best wishes P. MORE comments below: >Kind regards, Joerg >-- >Dipl. Chem. Joerg K. Wegner >Center of Bioinformatics Tuebingen (ZBIT) >Department of Computer Architecture >Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany >Phone: (+49/0) 7071 29 78970 >Fax: (+49/0) 7071 29 5091 >E-Mail: mailto:we...@in... >WWW: http://www-ra.informatik.uni-tuebingen.de >-- >Never mistake motion for action. > (E. Hemingway) > >Never mistake action for meaningful action. > (Hugo Kubinyi,2004) > > > ><?xml version="1.0" encoding="ISO-8859-1"?> ><!DOCTYPE molecule SYSTEM "cml.dtd" []> ><cml:molecule xmlns:cml="http://www.xml-cml.org/schema/cml2/core" >title="Gonan derivate with stereochemistry" id="-2114331329"> IDs must start with a letter ><cml:metadataList title="generated automatically from JOELib"> ><cml:metadata name="dc:creator" content="Used JOELib chemistry kernel ID >is 715333816 and the used CML writer is >joelib.io.types.cml.MoleculeHuge(version 1.2)"></cml:metadata> ><cml:metadata name="dc:description" content="Conversion of legacy filetype >to CML"></cml:metadata> ><cml:metadata name="dc:identifier" content="unknown"></cml:metadata> ><cml:metadata name="dc:content"></cml:metadata> ><cml:metadata name="dc:rights" content="unknown"></cml:metadata> ><cml:metadata name="dc:type" content="chemistry"></cml:metadata> ><cml:metadata name="dc:contributor" content="see http://joelib.sf.net for >a full list of contributors"></cml:metadata> ><cml:metadata name="dc:date" content="16 Mar 2004 08:53:52 >GMT"></cml:metadata> ><cml:metadata name="cmlm:structure" content="yes"></cml:metadata> Looks great ></cml:metadataList> ><cml:scalar dataType="xsd:string" dictRef="joelib:kernel" >title="joelib:kernel:715333816:softCoded:joelib.data.JOEAromaticTyper" >id="joelib:kernel:single:1345017519">joelib.data.JOEAromaticTyper >http://joelib.sf.net joelib/data/plain/aromatic.txt 1.4 >2004-03-15_13-33-42</cml:scalar> This is a compound data field. We are working on how dictionaries can support this. The fields are, I think: classname URI id datetime XSD can support complexContent which describes and validates these. We are looking at how the dictionaries can support this ><cml:scalar dataType="xsd:string" dictRef="joelib:kernel" >title="joelib:kernel:715333816:softCoded:joelib.data.JOEAtomTyper" >id="joelib:kernel:single:331687181">joelib.data.JOEAtomTyper >http://joelib.sf.net joelib/data/plain/atomtype.txt 1.8 >2004-03-15_13-33-42</cml:scalar> ><cml:scalar dataType="xsd:string" dictRef="joelib:kernel" >title="joelib:kernel:715333816" >id="joelib:kernel:715333816">joelib:kernel:single:1345017519, >joelib:kernel:single:331687181, joelib:kernel:single:-349822152, >joelib:kernel:single:1305056389, joelib:kernel:single:2066602532, >joelib:kernel:single:1443728660, joelib:kernel:single:2056739992, >joelib:kernel:single:-465953241, joelib:kernel:single:-752030822, >joelib:kernel:single:-647783807, joelib:kernel:single:1562896312, >joelib:kernel:single:1380529384, joelib:kernel:single:2132350161, >joelib:kernel:single:-1630937171, joelib:kernel:single:862707252, >joelib:kernel:single:-94117937</cml:scalar> I would make this an array and use prefixes: ><cml:array dataType="xsd:QName dictRef="joelib:kernel" >title="joelib:kernel:715333816" id="joelib:kernel:715333816"> jks:a1345017519 jks:331687181 ...</cml:array> But it looks messy. Do you actually want to process these as integers? If so it can be redesigned >joelib:kernel:single:1345017519, joelib:kernel:single:331687181, >joelib:kernel:single:-349822152, joelib:kernel:single:1305056389, >joelib:kernel:single:2066602532, joelib:kernel:single:1443728660, >joelib:kernel:single:2056739992, joelib:kernel:single:-465953241, >joelib:kernel:single:-752030822, joelib:kernel:single:-647783807, >joelib:kernel:single:1562896312, joelib:kernel:single:1380529384, >joelib:kernel:single:2132350161, joelib:kernel:single:-1630937171, >joelib:kernel:single:862707252, joelib:kernel:single:-94117937</cml:scalar> ><cml:atom id="-2114331329:a1"> ><cml:string builtin="elementType">C</cml:string> ><cml:float builtin="x2">13.916500091552734</cml:float> ><cml:float builtin="y2">-7.770199775695801</cml:float> Looks good This is CML1 - I assume you can now use CML2 ><cml:scalar units="units:electron" dataType="xsd:float" >dictRef="joelib:partialCharge">-0.022420638985526313</cml:scalar> Yes. this is great. I am hoping to get some hierarchy for units so they don't need constant reiteration. And if space matters you can use CML2array ><cml:integer builtin="hydrogenCount">5</cml:integer> ></cml:atom> ><cml:atom id="-2114331329:a2"> ><cml:string builtin="elementType">C</cml:string> ><cml:float builtin="x2">13.916500091552734</cml:float> ><cml:name convention="trivial">Gonan derivate with stereochemistry</cml:name> ><cml:scalar dataType="xsd:double" dictRef="joelib:kernel:715333816" >title="Fraction_of_rotatable_bonds">4.3478260869565216E-2</cml:scalar> ><cml:scalar dataType="xsd:double" dictRef="joelib:kernel:715333816" >title="Geometrical_shape_coefficient">9.202824265150003</cml:scalar> ><cml:scalar dataType="xsd:double" dictRef="joelib:kernel:715333816" >title="Molecular_weight">2.452599936723709E2</cml:scalar> ><cml:scalar dataType="xsd:string" title="RDF"><![CDATA[Gasteiger_Marsili >50<1.5940199183282508E-16,3.5320351902369007E-12,1.0787560326377212E-8,4.562926679481138E-6,2.6872124085373483E-4,2.2069047752468125E-3,2.303245111884276E-3,-5.436356512771221E-4,-3.2478281395112424E-4,4.918651334382971E-5,1.536251324560617E-3,2.277435169742313E-3,1.300816511980075E-3,6.126140107135232E-4,-4.1824716100300705E-4,-1.368863814540861E-3,-8.208339192751625E-4,5.3376461771753E-4,1.534072562138197E-3,1.7908914501277423E-3,5.507175720486667E-6,-5.763211262243748E-4,1.4132473231521047E-4,8.686437677244833E-4,1.5090055350379477E-3,1.6546296948737362E-3,8.647678978194836E-4,2.500403657505191E-4,-8.328990776669339E-5,3.4797339389094526E-4,1.0482630671704465E-3,3.57933385516601E-4,-9.243630247315901E-5,-1.6374118268106688E-4,-6.744686206966648E-4,-7.3563670708480915E-6,1.0783852781926621E-3,1.1659437942465932E-3,5.3669116363927E-4,-1.1314694263345528E-4,-6.928623677582942E-4,-4.119554183357553E-4,2.042136965559031E-5,2.119917715225437E-4,2.94531451807724E-4,-6.505797814441567 E >-4,-2.9527826356305804E-4,6.187696472975766E-5,1.1298076477827559E-4,9.012712526200434E-5>]]></cml:scalar> You could use array and xsd:float for this. CDATA isn't needed unless you expect < or & in your content ><cml:scalar dataType="xsd:integer" dictRef="joelib:kernel:715333816" >title="Number_of_rotatable_bonds">1</cml:scalar> ><cml:scalar dataType="xsd:integer" dictRef="joelib:kernel:715333816" >title="Number_of_HBD_2">0</cml:scalar> ><cml:scalar dataType="xsd:integer" dictRef="joelib:kernel:715333816" >title="Number_of_HBD_1">0</cml:scalar> ><cml:scalar dataType="xsd:double" dictRef="joelib:kernel:715333816" >title="MolarRefractivity">9.199900000000002E1</cml:scalar> ><cml:scalar dataType="xsd:integer" dictRef="joelib:kernel:715333816" >title="Number_of_aromatic_bonds">0</cml:scalar> ><cml:scalar dataType="xsd:double" dictRef="joelib:kernel:715333816" >title="Zagreb_group_index_2">1.8E2</cml:scalar> ><cml:scalar dataType="xsd:double" dictRef="joelib:kernel:715333816" >title="Zagreb_group_index_1">1.47E2</cml:scalar> ><cml:scalar dataType="xsd:double" dictRef="joelib:kernel:715333816" >title="PolarSurfaceArea">0.0</cml:scalar> This needs structuring... ><cml:scalar dataType="xsd:string" title="Topological_atom_pair"><![CDATA[42 >1 >Atom_valence >C >1.0 >5.0 >C >3.0 >2 >C >2.0 I have deleted the rest for convenience but will be happy to help with the design. Much of this can be made compact and semantically rich with array and table. It's somewhat inconvenient working in email - would be better to have attachments. This is actually an excellent thing for a Wiki Best P. Peter Murray-Rust Unilever Centre for Molecular Informatics Chemistry Department, Cambridge University Lensfield Road, CAMBRIDGE, CB2 1EW, UK Tel: +44-1223-763069 |
From: Joerg K. W. <we...@in...> - 2004-03-16 10:43:56
|
Hi all, a new JOELib release with ZIP import/export is available. This accepts also multiple files in one ZIP file, except multiple CML files, because the SAX parser closes the stream after the first CML file . If anyone finds a workaround i would be happy if you will tell me. So, this a really GREAT (or SMALL :-) possibility to save disk space for SDF descriptor files ! Kind regards, Joerg -- Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |
From: Joerg K. W. <we...@in...> - 2004-03-16 09:45:55
|
Hi Geoff, I've used Egon's code to establish CML2 support for JOELib. (Thank's again !!!). This includes full stereochemistry support, also for MDL SD were the missing lines were added to the writer (see attachments)! pipe: original.sdf -> test.cml -> test.sdf As already discussed, this can be a step to 'reproducible' conversion/descriptorCalculation results. CML (metainformations) and SDF (comment) contains now both the ID for the used chemistry kernel (expert systems). This number is the hash code for all hard- and soft-coded expert system informations. I've modified the JOEGlobalDataBase and all text file definitions. They contain now all CVS tags: VENDOR: http://joelib.sf.net RELEASE_VERSION: $Revision: 1.4 $ RELEASE_DATE: $Date: 2004/03/15 13:33:42 $ So this is not platform independant, BUT if we can find a way to assign the same vendor, version and date tag (independent standard organization or just a combined standard web page for JOELib/OpenBabel!!!) we can get the same hash codes ! BTW, the hash code uses the standard hash code calculation for string in Java. So descriptors contain now also a reference to the used kernel, e.g.: <cml:scalar dataType="xsd:integer" dictRef="joelib:kernel:715333816" title="Number_of_HBD_2">0</cml:scalar> Peter, is this now correct ? I will release a new JOELib version the next hours, then i'm going home, i'm sick. Kind regards, Joerg -- Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |