At 15:02 +0000 29/10/06, S Page wrote:
> >I wanted to define an attribute Smiles:=string, where the string could contain
>>quite a few characters (its more or less 1 character per atom in a
>>some molecules can easily have 1000+ atoms). Is there limit to how long
>>such a string could be?
>Yes, the Type:String code limits strings to 255. There's a warning in the factbox when you exceed this. It was discussed earlier, http://sourceforge.net/mailarchive/message.php?msg_id=36371763 ; although you could probably work around the MediaWiki limitations if you edit the PHP code and alter your database, there's a performance hit.
>Can you explain or link to the underlying property for "Smiles" that you're trying to represent, is it a chemical formula like NaCl? It's nice to understand what people are trying to do.
With the advent of XML-serialisation of chemistry, CML (a project Peter
Murray-Rust and I started in 1995, shortly after the inaugural WWW
conference in CERN, and following discussions with TimBL and
DR) the need for (globally) unique chemical IDs became apparent. SMILES
had been the first successful attempt to serialise a molecular structure into
a unique canonical string. Unfortunately, it was proprietary, and the
Company that owned it, although they published the original spec,
changed that spec in their own implementation. Around 8 "varieties"
of SMILES emerged as a result. The newer InChI (International
chemical identifier) has emerged as a solution to this problem.
Typically, a "small" molecule is defined as having less than
1000 atoms. The InChI generated for such molecules has
up to 2000 characters. It could be hashed down to a much shorter
string of course, but this brings with it other problems, including
information loss (the InChI currently can be used to restore
the atom/connectivity of the original molecule).
So SMILES, and now InChI represent an attribute of a molecule
which constitutes a unique canonical (algorithmic) identifier for it. We use the
InChI in an earlier semantic chemical project (using Sesame as the
triple store/logic engine), and were hoping to use Semediawiki as
the authoring/input environment for eg Sesame via SPARQL endpoints.
The InChI in particular is key to this.
I understand the reason for limiting an attribute to 256 characters.
Perhaps we should define an entirely new datatype (InChI) which
expands this limit (and put up with the performance hit?). Is
this a sensible approach? (we can think of a number of other chemical
datatypes; for example molecular formula).
+44 (020) 7594 5774 (Voice); +44 (0870) 132 3747 (eFax); rzepahs@... (iChat)
http://www.ch.ic.ac.uk/rzepa/ Dept. Chemistry, Imperial College London, SW7 2AZ, UK.
(Voracious anti-spam filter in operation for received email.
If expected reply not received, please phone/fax).