From: Egon W. <ego...@gm...> - 2009-04-29 07:49:47
|
On Thu, Apr 23, 2009 at 11:26 AM, Paula de Matos <pm...@eb...> wrote: > Thanks very much for pointing us to this dictionary. > > One of our curators have had a look at your dictionary and the consensus is > that we would very much like to incoporate your synonyms list. Great. I have emailed with the two other former contributors of the project, Micha and Geert, and will send a separate email about approval in a sec... > I have a couple of questions. Regarding the encoding, do you have a plain > UTF-8 encoding file - there seems to be some unknown unicode characters? Yes, that single woclist.xml file is actually generated form individual files... I'll look them up today. > Also is there a name for the project. In all data that we source externally > we indicate where its come from through the source attribute. In this case > what would this project be called? "Woordenboek Organische Chemie" >> I am not sure that matching the English name is the best method here but >> why not try. There are also CAS numbers and SMILES. But there are some >> concepts which definitely have no CAS or structure, e.g. "ion" or >> "functional groups". >> >> A few things I've noticed: >> >> 9 <LANG ID="EN">1,1,1-trichloorethane</LANG> >> >> Wrong! There is no double "o" in 1,1,1-trichloroethane My apologies for typo's. The r and o had been switched. > I think where we can't do an exact match to an English name, we should not > import it. I would love to receive a list of things it cannot match, so that I can fix possible problems. >> 17895 <NAME CLASS="TRIVIAL"> >> 17896 <LANG ID="NL">isoalkanen</LANG> >> 17897 <LANG ID="EN">isoalkenes</LANG> >> 17898 <LANG ID="DE">Isoalkane</LANG> >> 17899 </NAME> >> >> "isoalkanes" and "isoalkenes" are not synonyms. Oops, no sure. Well, even curators make mistakes? >> 27578 <ITEM ID="WOC00000866" NAME="pyrethro�den" >> CODE="pyrethroiden"> >> >> Unknown Unicode character in the NAME >> >> I see there are *lots* of these, afaics these appear instead of accented >> vovels. I wonder if the xml file with "real" unicodes is available. Originally we had each WOC item in a separate XML file. I will look these up. > Yeh we noticed these when we were looking at the file. We would need to > convert them to our special character based on the UTF-8. >> >> Some names are classed as "IUPAC" and some as "TRIVIAL". E.g. >> >> 16189 <NAME CLASS="TRIVIAL"> >> 16190 <LANG ID="NL">heptaan</LANG> >> 16191 <LANG ID="EN">heptane</LANG> >> 16192 <LANG ID="DE">Heptan</LANG> >> 16193 <LANG ID="FY">heptaan</LANG> >> 16194 </NAME> >> >> (heptane IS an IUPAC name) The convention we used here was that there was one single prefered IUPAC name, and all other general TRIVIAL name. It was a bit arbitrary, but we did not want to list more than one IUPAC name. (Not sure, though, why heptane was not our prefered IUPAC name... this too may be an artifact in the script to create the single woclist.xml... I'll get back on the individual .xml files) >> but >> >> 16233 <NAME CLASS="IUPAC"> >> 16234 <LANG ID="EN">heptanoic acid</LANG> >> 16235 <LANG ID="NL">heptaanzuur</LANG> >> 16236 </NAME> > > I don't think he claims that they are official IUPAC Names, I think they > might just be source IUPAC. He specifically mentioned that they were not > official IUPAC Names. The IUPAC rules are indeed not clean enough that one cannot find multiple valid IUPAC names for a single compound. There is no such thing as a unique, official IUPAC name. >>> Let me know if this is of good enough quality to go directly into ChEBI >>> and if so what status. >>> >> >> I think these are valuable and it could be good to have them in ChEBI. (I >> am talking only about Dutch terms now; not sure about German and French >> but they probably also may come in unless we already have them.) What will >> be "source" of these? We used various sources, a couple of books. The german names come from the books: "Abitur Wissen - Chemie", F. Gold, A. Hammann, O. Vogl, Fisher Taschenbaum Verlag GmbG, Frankfurth am Main, 1992 "Bio-organische Chemie" J-H. Fuhrhop, Georg Thieme Verlag Stuffgart, New York, 1982 French and Frysian translations were sourced from the internet. > I agree, I say we import the Dutch names matched exactly to the English as > CHECKED and the German and French as OK. > > Thats a good point, perhaps just the project name like WOC? > https://sourceforge.net/projects/woc/ This actually brings in an interesting project... translating ChEBI in general... Have you considered using crowdsourcing for this? For example, using the '.po' file mechanism used in OpenSource project to translate software? Translation can then be done via a website, e.g. as we do for JChemPaint: https://translations.launchpad.net/jchempaint/trunk/+pots/keys Egon -- Post-doc @ Uppsala University http://chem-bla-ics.blogspot.com/ |