From: Peter Murray-R. <pm...@ca...> - 2004-02-02 10:25:09
|
This sounds great. Shouldn't be too difficult. We have parsed a number of pages for chemical content. And I know that Henry has. So we could put together a collaborative list of sites that we can scrape for chemistry. The results would all be in CML with appropriate metadata. This would solve most of the remaining technical problems - it is then only the legal ones. P. At 10:34 02/02/2004 +0100, E.L. Willighagen wrote: >-----BEGIN PGP SIGNED MESSAGE----- >Hash: SHA1 > >On Monday 02 February 2004 10:17, Peter Murray-Rust wrote: > > At 22:23 01/02/2004 +0100, Egon Willighagen wrote: > > >dadml://pdb/?1CRN > > > > > >or dadml://any/pdbid?1CRN > > > > > >The second will try any mirror that can return information based on the > > >pdbid... > > > > Presumably someone enters these mirrors and keeps their addresses and > > templates up to date. > >Yes. The nice thing about the DADML system is, that the maintainance can be >done by website developers, much like the real domain name server system... > > > Is there a cascade - if mirror 1 fails does mirror2 > > get called? And what is returned - the actual file? > > > > If so we have something like: > > > > User -> PDBCode -> server > > server -> munged URL (format1)-> mirror1 -> success/error > > success -> PDB file -> user > > failure > > server -> munged URL (format2)-> mirror2 -> success/error > > and so on > > > > is this the model? > >Yes, more or less. A HTTP 404 is easily detected, but the system can also >detect things like a returned webpage which states that no information is >available... > > > >The DADML system also support retrieving information in other formats, not > > >just chemical/x-pdb or chemical/x-cml, but also text/html etc.. > > >I'm not sure if we want to be able to do that sort of things too, so for > > > now it only supports reading chemical formats... > > > > The attraction of chemical/x-* is that the information contained within > > each is (relatively?!) consistent and structured. For an arbitrary web site > > producing HTML the structure could be anything and a separate parser has to > > be written for each. (For example we have written parsers for 2 of the main > > sites offering small molecule information and they obviously are completely > > different. > >DADML does not deal with interpretation of the returned format... the >cdk.internet.dadml.DADMLReader does a bit... it can read molecules from >chemical/x-mdl-mol and chemical/x-cml and others... actually, it completely >disregards the MIME system, and just uses the cdk.io.ReaderFactory and looks >at the contents of the stream... > > > Moreover the structure of the pages changes regularly. For > > example the *text/html* on the RCSB site will be completely different from > > that on the EBI site even though the actual PDB file is presumably the same > > or closely related. It is the consistency of chemical/x-* that makes it > > useful for machines to parse. > >Sofar the DADML has only been used to read clear chemical formats, and >display >HTML as is... without any interpretation step... It would be very nice to >have a web service at WWMM that accepts an URL or DADML URI >(dadml://nist-html/cas/50-00-0) and converts the HTML into a CML stream... > >Something like: dadml://wwmm-nist-bridge/cas/50-00-0 > >Egon > >- -- >eg...@sc... >PhD on Molecular Representation in Chemometrics >Nijmegen University >http://www.cac.sci.kun.nl/people/egonw/ >GPG: 1024D/D6336BA6 > >-----BEGIN PGP SIGNATURE----- >Version: GnuPG v1.0.7 (SunOS) > >iD8DBQFAHhm2d9R8I9Yza6YRAnBNAJwICKAnGbYiu0lOSQvQuk/FySQxGACgp8aT >HR1eqfmcCDb6D4uCpzE7GD0= >=Idqz >-----END PGP SIGNATURE----- Peter Murray-Rust Unilever Centre for Molecular Informatics Chemistry Department, Cambridge University Lensfield Road, CAMBRIDGE, CB2 1EW, UK Tel: +44-1223-763069 |