From: Egon W. <ego...@gm...> - 2009-04-29 07:49:47
|
On Thu, Apr 23, 2009 at 11:26 AM, Paula de Matos <pm...@eb...> wrote: > Thanks very much for pointing us to this dictionary. > > One of our curators have had a look at your dictionary and the consensus is > that we would very much like to incoporate your synonyms list. Great. I have emailed with the two other former contributors of the project, Micha and Geert, and will send a separate email about approval in a sec... > I have a couple of questions. Regarding the encoding, do you have a plain > UTF-8 encoding file - there seems to be some unknown unicode characters? Yes, that single woclist.xml file is actually generated form individual files... I'll look them up today. > Also is there a name for the project. In all data that we source externally > we indicate where its come from through the source attribute. In this case > what would this project be called? "Woordenboek Organische Chemie" >> I am not sure that matching the English name is the best method here but >> why not try. There are also CAS numbers and SMILES. But there are some >> concepts which definitely have no CAS or structure, e.g. "ion" or >> "functional groups". >> >> A few things I've noticed: >> >> 9 <LANG ID="EN">1,1,1-trichloorethane</LANG> >> >> Wrong! There is no double "o" in 1,1,1-trichloroethane My apologies for typo's. The r and o had been switched. > I think where we can't do an exact match to an English name, we should not > import it. I would love to receive a list of things it cannot match, so that I can fix possible problems. >> 17895 <NAME CLASS="TRIVIAL"> >> 17896 <LANG ID="NL">isoalkanen</LANG> >> 17897 <LANG ID="EN">isoalkenes</LANG> >> 17898 <LANG ID="DE">Isoalkane</LANG> >> 17899 </NAME> >> >> "isoalkanes" and "isoalkenes" are not synonyms. Oops, no sure. Well, even curators make mistakes? >> 27578 <ITEM ID="WOC00000866" NAME="pyrethro�den" >> CODE="pyrethroiden"> >> >> Unknown Unicode character in the NAME >> >> I see there are *lots* of these, afaics these appear instead of accented >> vovels. I wonder if the xml file with "real" unicodes is available. Originally we had each WOC item in a separate XML file. I will look these up. > Yeh we noticed these when we were looking at the file. We would need to > convert them to our special character based on the UTF-8. >> >> Some names are classed as "IUPAC" and some as "TRIVIAL". E.g. >> >> 16189 <NAME CLASS="TRIVIAL"> >> 16190 <LANG ID="NL">heptaan</LANG> >> 16191 <LANG ID="EN">heptane</LANG> >> 16192 <LANG ID="DE">Heptan</LANG> >> 16193 <LANG ID="FY">heptaan</LANG> >> 16194 </NAME> >> >> (heptane IS an IUPAC name) The convention we used here was that there was one single prefered IUPAC name, and all other general TRIVIAL name. It was a bit arbitrary, but we did not want to list more than one IUPAC name. (Not sure, though, why heptane was not our prefered IUPAC name... this too may be an artifact in the script to create the single woclist.xml... I'll get back on the individual .xml files) >> but >> >> 16233 <NAME CLASS="IUPAC"> >> 16234 <LANG ID="EN">heptanoic acid</LANG> >> 16235 <LANG ID="NL">heptaanzuur</LANG> >> 16236 </NAME> > > I don't think he claims that they are official IUPAC Names, I think they > might just be source IUPAC. He specifically mentioned that they were not > official IUPAC Names. The IUPAC rules are indeed not clean enough that one cannot find multiple valid IUPAC names for a single compound. There is no such thing as a unique, official IUPAC name. >>> Let me know if this is of good enough quality to go directly into ChEBI >>> and if so what status. >>> >> >> I think these are valuable and it could be good to have them in ChEBI. (I >> am talking only about Dutch terms now; not sure about German and French >> but they probably also may come in unless we already have them.) What will >> be "source" of these? We used various sources, a couple of books. The german names come from the books: "Abitur Wissen - Chemie", F. Gold, A. Hammann, O. Vogl, Fisher Taschenbaum Verlag GmbG, Frankfurth am Main, 1992 "Bio-organische Chemie" J-H. Fuhrhop, Georg Thieme Verlag Stuffgart, New York, 1982 French and Frysian translations were sourced from the internet. > I agree, I say we import the Dutch names matched exactly to the English as > CHECKED and the German and French as OK. > > Thats a good point, perhaps just the project name like WOC? > https://sourceforge.net/projects/woc/ This actually brings in an interesting project... translating ChEBI in general... Have you considered using crowdsourcing for this? For example, using the '.po' file mechanism used in OpenSource project to translate software? Translation can then be done via a website, e.g. as we do for JChemPaint: https://translations.launchpad.net/jchempaint/trunk/+pots/keys Egon -- Post-doc @ Uppsala University http://chem-bla-ics.blogspot.com/ |
From: Geert J. <Gee...@da...> - 2009-04-29 10:19:03
|
Hi Egon, Paula, Some minor comments to lengthy response of Egon.. (inserted below) > > I have a couple of questions. Regarding the encoding, do you have a > > plain > > UTF-8 encoding file - there seems to be some unknown > unicode characters? > > Yes, that single woclist.xml file is actually generated form > individual files... I'll look them up today. I don't think that the separate source files are in UTF-8, but in ISO-8859-1. Or at least a mixture of both. They may have been read with wrong encoding accidentily when joined together. That would explain unexisting numeric character references for simple characters like i with umlaut. > > Yeh we noticed these when we were looking at the file. We > would need > > to convert them to our special character based on the UTF-8. > >> > >> Some names are classed as "IUPAC" and some as "TRIVIAL". E.g. > >> > >> 16189 <NAME CLASS="TRIVIAL"> > >> 16190 <LANG ID="NL">heptaan</LANG> > >> 16191 <LANG ID="EN">heptane</LANG> > >> 16192 <LANG ID="DE">Heptan</LANG> > >> 16193 <LANG ID="FY">heptaan</LANG> > >> 16194 </NAME> > >> > >> (heptane IS an IUPAC name) > > The convention we used here was that there was one single > prefered IUPAC name, and all other general TRIVIAL name. It > was a bit arbitrary, but we did not want to list more than > one IUPAC name. > > (Not sure, though, why heptane was not our prefered IUPAC > name... this too may be an artifact in the script to create > the single woclist.xml... I'll get back on the individual .xml files) I recall that IUPAC has one officially preferred term for each chemical compound, less preferred ones, and also some trivial names that have been promoted to 'official' IUPAC alternative names as well. In this particular case one could argue that the official name is n-heptane and not heptane, since there is also iso-heptane for instance. But to be honest, it was nearly 10 years ago that I did real chemistry for the last time. (I am in the IT business since, though I am still doing lots of XML processing..) > >> but > >> > >> 16233 <NAME CLASS="IUPAC"> > >> 16234 <LANG ID="EN">heptanoic acid</LANG> > >> 16235 <LANG ID="NL">heptaanzuur</LANG> > >> 16236 </NAME> > > > > I don't think he claims that they are official IUPAC Names, I think > > they might just be source IUPAC. He specifically mentioned > that they > > were not official IUPAC Names. > > The IUPAC rules are indeed not clean enough that one cannot > find multiple valid IUPAC names for a single compound. There > is no such thing as a unique, official IUPAC name. Oh great, that contradicts my remarks above. IUPAC does give direction to come to a consistent name, like starting with the longest chain of C-atoms for instance. But indeed, there are compounds in which this still leaves room for multiple names. (no more comments) Kind regards, Geert Josten Drs. G.P.H. Josten Consultant http://www.daidalos.nl/ Daidalos BV Source of Innovation Hoekeindsehof 1-4 2665 JZ Bleiswijk Tel.: +31 (0) 10 850 1200 Fax: +31 (0) 10 850 1199 http://www.daidalos.nl/ KvK 27164984 De informatie - verzonden in of met dit emailbericht - is afkomstig van Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan dit bericht kunnen geen rechten worden ontleend. |