Re: [Cdk-devel] Biopolymers in CML

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Greetings and thanks for your interest. I have copied this to the CML list 
as well.

Historically CML (10 years ago) addressed biopolymers. JUMBO1 and 3 had 
working PDB readers. We stopped effort on this about 3-4 years ago because 
we thought that mmCIF and other things were better suited for the 
community. However CML can currently manage many aspects of biopolymers 
quite well.

At 16:39 03/05/2005, Ola Spjuth wrote:
>Ctd on CDK-devel:
>
>On Tue, 2005-05-03 at 12:32, Egon Willighagen wrote:
> > Peter, Gemma, and Martin,
> >
> > I've copied you in with questions related to this cdk-user thread.
>Please
> > read, and look for your name. Thanx!
> >
> > On Tuesday 03 May 2005 11:59 am, Ola Spjuth wrote:
> > > I have not looked into the produced CML yet but intend to.
> >
> > Yes, please do, and post a CML atom output here...
>
>OK, here is one example:
>
>Line in PDB:
>---------------
>ATOM      1  O5*   A A   1      12.718  25.492  40.619  1.00 
>20.74     1ZAA  61

I comment after each line.

>In CML:
>---------
><atom elementType="O" id="O5*" x3="12.718" y3="25.492" z3="40.619">

IDs cannot carry punctuation. There will be many atoms with this string, 
which is a label. Every atom must have a unique id (within the molecule 
which should only use [A-Za-z-_.]. It would be possible to call it (say) 
id="gly23.7"

><scalar title="pdb.record"
>xmlns="http://www.xml-cml.org/schema/cml2/core">ATOM      1  O5*   A A
>1      12.718  25.492  40.619  1.00 20.74      1ZAA  61</scalar>

You shouldn't need this at all. You can build it from the other components.

><scalar title="pdb.chainID"
>xmlns="http://www.xml-cml.org/schema/cml2/core">A</scalar>

You shouldn't need the namespace. It should be on molecule. However the way 
of describing the semantics is through a dictRef, which is namespaced. So 
you should write:

<molecule xmlns="http://www.xml-cml.org/schema/cml2/core" 
xmlns:pdb="http://www.farmbio.uu.se/dict/pdb"
   <atomArray>
>    <atom elementType="O" id="a12345" x3="12.718" y3="25.492" z3="40.619">
     <label dictRef="pdb:atomlabel" value="O5*"/>
     <scalar dictRef="pdb:chainID">A</scalar>
     <scalar dictRef="pdb:serial">1</scalar>
...

You need to create your own dictionary (namespace prefix pdb) which defines 
the semantics of your components. Maybe at a later stage these terms will 
be defined by PDB.

><scalar title="pdb.serial"
>xmlns="http://www.xml-cml.org/schema/cml2/core">1</scalar>
><scalar title="hetatm"
>xmlns="http://www.xml-cml.org/schema/cml2/core">0</scalar>
><scalar title="pdb.resName"
>xmlns="http://www.xml-cml.org/schema/cml2/core">A</scalar>
><scalar title="pdb.resSeq"
>xmlns="http://www.xml-cml.org/schema/cml2/core">1</scalar>
><scalar title="pdb.altLoc"
>xmlns="http://www.xml-cml.org/schema/cml2/core"></scalar>
><scalar title="org.jmol.adapter.cdk.ATOM_SET_INDEX"
>xmlns="http://www.xml-cml.org/schema/cml2/core">0</scalar>
><scalar title="InvariancePair"
>xmlns="http://www.xml-cml.org/schema/cml2/core">1       </scalar>

Note that whitespace in XML is fragile. Do you really need the spaces? If 
so, suggest a different mechanism

><scalar title="pdb.iCode"
>xmlns="http://www.xml-cml.org/schema/cml2/core"></scalar>
><scalar title="pdb.tempFactor"
>xmlns="http://www.xml-cml.org/schema/cml2/core">20.74</scalar>
><scalar title="CanonicalLable"
>xmlns="http://www.xml-cml.org/schema/cml2/core">1</scalar>
><scalar title="pdb.charge"
>xmlns="http://www.xml-cml.org/schema/cml2/core">61</scalar>
><scalar title="pdb.element"
>xmlns="http://www.xml-cml.org/schema/cml2/core"></scalar>
><scalar title="pdb.occupancy"
>xmlns="http://www.xml-cml.org/schema/cml2/core">1.0</scalar>
><scalar title="pdb.segID"

Occupancy is already in CML so you don't need it

>xmlns="http://www.xml-cml.org/schema/cml2/core">1ZAA</scalar>
><scalar title="oxt"
>xmlns="http://www.xml-cml.org/schema/cml2/core">0</scalar>
><scalar title="pdb.name"
>xmlns="http://www.xml-cml.org/schema/cml2/core">O5*</scalar>
></atom>
>
>Some things seem wrong to me, like the pdb.charge=61 (where 61 is the
>line number in the pdb-file?).
>Some important parts of the PDB-header are missing, like Helix and Sheet
>information. I guess this must be taken into account if we are to
>display the protein in Jmol.
>
>The current solution makes files VERY large. My 112K PDB-file is 2.3M in
>CML. Hopefully we can come up with a more condensed solution.

Yes. The suggestions about reduce it somewhat (the namespace needs only 
declared once). However there is an array format which is actually more 
compact than PDB itself. It would look like:

<atomArray @elementTypeArray="N C O S N N C O ..." x3Array="1.2 2.3 3.4 
-5.6 ...">
   <!-- put here the other non-standard material -->
   <array dictRef="pdb:resname">A A A G G G G Y Y ...</array>
etc.

This requires a modest amount of parsing but is compact and precise. If you 
want whitespace in (atom types) use delimiters
   <array dictRef="pdb:atomname" delim="/">/CA/ CA/</array>
would be a calcium and a carbon

>Note: This CML might have been constructed with the CMLWriter in
>Jmol.jar (according to the line with
>"org.jmol.adapter.cdk.ATOM_SET_INDEX") since it is conflicting with my
>cdk-libio-cml.jar. Convertor is from the latter though, compiled with
>JDK1.5.
>
> >
> > > > > To sum up: I/O of molecules and biopolymers in CML is central to
>our
> > > > > chemo/bio-informatic application so we really hope this will be
>given a
> > > > > priority.

This is great news. We have recently agreed with MSD that the ligands will 
be available in CML. So that should save you considerable trouble. As you 
know we work very closely with CDK and see it as a reference 
implementation. Thus if it doesn't already manage the suggestions above I 
am sure the group will be sympathetic!

P.

>We also hope that CML will be extended to comprise
> > > > > Biopolymers (preferably with strands according to Martin Eklunds
> > > > > submitted patch)
> > > >
> > > > That patch has been added to CVS.
> > >
> > > That was indeed good news. How do I check out and apply the patch?
> >
> > You'll have to check out CDK from CVS.
>
>Already done. Didn't realize that it was there. Will look at it.
>
> >
> > > We would be happy to test it.
> >
> > That would be very nice.
> >
> > > And is the CdkJmolAdapter also updated so we
> > > can send Biopolymers to Jmol?
> >
> > Haven't tested yet.
>
>Hmmm, this doesn't seem to work. Something for Martin to look into.
>
> >
> > Martin?
> >
> > > > > in
> > > > > the near future, but this is obviously out-of-list here :) This
> > > > > together would make CDK and CML very powerful for bioinformatics
> > > > > applications as well!
> > > >
> > > > Ok, good to hear that you are making progress... I haven't had
>time yet
> > > > to convert a PDB to CML, and see how the PDB fields are put into
>CML...
> > > > It would be nice to define how a PDB atom would look in CML, i.e.
>will
> > > > all PDB fields... when we have done this, the Convertor can be
> > > > modified...
> > >
> > > Maybe we can assist you with this. Or does it require deep knowledge
> > > about CML? That we have none...
> >
> > No, it does not require much intelligence what so ever...
> >
> > In PDB an atom is stored as something like:
> >
> > ATOM      1  N   THR     1      17.047  14.099   3.625  1.00 13.79
> >
> > And all this info needs to be transfered into CML, so we need
>something like
> >
> > <atom elementType='N' x3='17.047' y3='14.099' z3='3.625'/>
> >   <scalar dataType="xsd:string" dictRef="pdb:residueType">THR</scalar>
> > </atom>
> >
> > Here the <scaler> element defines extra information that cannot be
>stored
> > in other more core CML things. The element defines the content type, a
>string,
> > and a reference to a dictionary (pdb) item (residueType)...
> >
> > And defining this dictionary is the main thing we need to do...
> >
> > Gemma, do you have a dictionary for PDB information, like found in PDB
>files?
> >
> > > >From what I remember, all PDB fields required to import CML as a
> > >
> > > Biopolymer are put in CML but are not used by the CMLReader or
> > > Convertor. Martin Eklund created a class to convert a Molecule to a
> > > Biopolymer, but then you need to know that it is a biopolymer
> > > beforehand. Also, Strands are not supported in this since CML doesnt
> > > support it (yet).
> >
> > I propose we go forward in the way PDB stores this information too:
> > have a field associated with the atom's to define to which strand they
> > belong...
> >
> > Alternatively, but that involves more thinking, is this setup in CML:
> >
> > <molecule dictRef="pdb:compound" id="1CRN">
> >   <molecule dictRef="pdb:strand" id="strand1">
> >   </molecule>
> >   <molecule dictRef="pdb:strand" id="strand2">
> >   </molecule>
> >   <!-- etc -->
> > </molecule>
> >
> > Peter, are there other initiatives for PDB into CML-X conversion?
> >
> > Egon
>
>
>Cheers
>
>    .../Ola
>
>
>
>-------------------------------------------------------
>This SF.Net email is sponsored by: NEC IT Guy Games.
>Get your fingers limbered up and give it your best shot. 4 great events, 4
>opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
>win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
>_______________________________________________
>Cdk-devel mailing list
>Cdk...@li...
>https://lists.sourceforge.net/lists/listinfo/cdk-devel

Peter Murray-Rust
Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069 Fax: +44 1223 763076