From: Giuseppe C. <g.c...@gm...> - 2006-03-30 14:05:56
|
I have an off topic question to propose at this list. I'm working with eXist to store terminology record that contains terms even in arabic or chinese language. With the UTF-8 coding i have no problem in this xml files, but i have to export this kind of record in text files to import it in another proprietary application. I do this with an xslt transformation, but when i import this text files th= e arabic and chinese terms are all transformed in unrecognized characters and rendered by a series of ? . I know that a text file is not encoded in UTF-8 but is in a machine dependant coding, but how i can mantain the original character sequence even for the arabic and chinese terms ? This is my xslt: <?xml version=3D"1.0" encoding=3D"UTF-8"?> <xsl:stylesheet xmlns:xsl=3D"http://www.w3.org/1999/XSL/Transform" version= =3D" 2.0"> <xsl:output method=3D"text" version=3D"2.0" indent=3D"no" encoding=3D"U= TF-8"/> <xsl:strip-space elements=3D"*"/> <xsl:variable name=3D"newline"> <xsl:text> </xsl:text> </xsl:variable> <xsl:template match=3D"/items"> <xsl:for-each select=3D"termEntry"> <xsl:apply-templates select=3D"."/> </xsl:for-each> </xsl:template> <xsl:template match=3D"termEntry"> <xsl:text>**</xsl:text> <xsl:for-each select=3D"descrip"> <xsl:apply-templates select=3D"."/> </xsl:for-each> <xsl:for-each select=3D"langSet"> <xsl:sort select=3D"@xml:lang"/> <xsl:apply-templates select=3D"."/> </xsl:for-each> <xsl:text>**</xsl:text> <xsl:value-of select=3D"$newline"/> </xsl:template> <xsl:template match=3D"descrip"> <xsl:if test=3D"node()!=3D''"> <xsl:text><</xsl:text> <xsl:value-of select=3D"@type"/> <xsl:text>></xsl:text> <xsl:apply-templates/> <xsl:value-of select=3D"$newline"/> </xsl:if> </xsl:template> <xsl:template match=3D"langSet"> <xsl:for-each select=3D"tig"> <xsl:sort select=3D"term"/> <xsl:apply-templates select=3D"term"> <xsl:with-param name=3D"lang" select=3D"parent::langSet/@xm= l :lang"/> </xsl:apply-templates> <xsl:apply-templates select=3D"descrip"/> </xsl:for-each> </xsl:template> <xsl:template match=3D"term"> <xsl:param name=3D"lang"/> <xsl:choose> [ ... ] <xsl:when test=3D"$lang =3D 'en'"> <xsl:text><English></xsl:text> </xsl:when> <xsl:when test=3D"$lang =3D 'ar'"> <xsl:text><Arabic></xsl:text> </xsl:when> <xsl:when test=3D"$lang =3D 'zh'"> <xsl:text><Chinese></xsl:text> </xsl:when> </xsl:choose> <xsl:apply-templates/> <xsl:value-of select=3D"$newline"/> </xsl:template> </xsl:stylesheet> Thanks for any help, Giuseppe Corrarello. |
From: Gary L. <gar...@en...> - 2009-04-14 14:03:41
|
Hi, Im not sure how I end up with non UTF-8 characters a query result. ? (0x96), ë (0xEB) Is there a default XQuery encoding, or does it need to be defined explicitly? declare option exist:serialize encoding=UTF-8; Thanks for any help since I have no Dutch examples to test right now. If its not that then it must be occurring during the load (using the XML;DB API). gary |
From: Gary L. <gar...@en...> - 2009-04-14 16:10:49
|
This is most likely happening somewhere in Cocoon. Im creating dynamic CForms from the query results and it looks like its dropping the encoding after the query though I dont know where yet. Sorry, gary _____ From: Gary Larsen [mailto:gar...@en...] Sent: Tuesday, April 14, 2009 10:05 AM To: exi...@li... Subject: [Exist-open] character encoding Hi, Im not sure how I end up with non UTF-8 characters a query result. ? (0x96), ë (0xEB) Is there a default XQuery encoding, or does it need to be defined explicitly? declare option exist:serialize encoding=UTF-8; Thanks for any help since I have no Dutch examples to test right now. If its not that then it must be occurring during the load (using the XML;DB API). gary |
From: Pierrick B. <pie...@cu...> - 2006-03-30 14:19:41
|
Hi, Giuseppe Corrarello a =E9crit : > I have an off topic question to propose at this list. Indeed. > I know that a text file is not encoded in UTF-8=20 Why not ? A text file can be encoded in UTF-8, of course. Isn't your problem rather related to the text editor you use to check=20 your text file ? Please make sure before we investigate further. Cheers, --=20 Pierrick Brihaye, informaticien Service r=E9gional de l'Inventaire / DRAC Bretagne mailto:pie...@cu... / t=E9l : +33 (0)2 99 29 67 78 Avez-vous lu http://usenet-fr.news.eu.org/fr-chartes/rfc1855.html ? |
From: Giuseppe C. <g.c...@gm...> - 2006-03-30 14:44:15
|
Hi Pierrik, Andy and all, I know that in a text editor, maybe not all, i can't "see" the arabic and chinese character, but the proprietary application import text files and can show all of this character. This application read only text files encoded in ANSI or ASCII format, but = i have tried to set the xsl:output element's encoding attribute to US-ASCII and in this way i have lost all the character like =E0, =E8 etc. With the UTF-8 attribute i have the correct visualization of this character but not yet for the chinese and arabic. Thank's for your help, Giuseppe. |
From: Pierrick B. <pie...@cu...> - 2006-03-30 14:44:15
|
Hi, Giuseppe Corrarello a =E9crit : > This application read only text files encoded in ANSI or ASCII format,=20 > but i have tried to set the xsl:output element's encoding attribute to=20 > US-ASCII and in this way i have lost all the character like =E0, =E8 et= c. Sure. > With the UTF-8 attribute i have the correct visualization of this=20 > character but not yet for the chinese and arabic. IMHO, the character encoding is correct. But Your text editor seems to=20 be unable to represent the non-latin characters. Maybe an output encoded in ISO-8859-6 would give some results with arabic= ? Find some test documents there ;-) http://cvs.savannah.nongnu.org/viewcvs/aramorph/src/java/gpl/pierrick/bri= haye/aramorph/test/testdocs/?root=3Daramorph Cheers, --=20 Pierrick Brihaye, informaticien Service r=E9gional de l'Inventaire / DRAC Bretagne mailto:pie...@cu... / t=E9l : +33 (0)2 99 29 67 78 Avez-vous lu http://usenet-fr.news.eu.org/fr-chartes/rfc1855.html ? |
From: Giuseppe C. <g.c...@gm...> - 2006-03-30 15:03:21
|
> > > > With the UTF-8 attribute i have the correct visualization of this > > character but not yet for the chinese and arabic. > > IMHO, the character encoding is correct. But Your text editor seems to > be unable to represent the non-latin characters. I think too that this is the correct encoding, but in fact isn't, anyway i don't open my result file in a text editor, i use it for importing the terminology records in an old version of Multiterm by Trados software. Maybe an output encoded in ISO-8859-6 would give some results with arabic ? I have tried to set this encoding but nothing is changed i can't see the arabic characters. Find some test documents there ;-) > > http://cvs.savannah.nongnu.org/viewcvs/aramorph/src/java/gpl/pierrick/bri= haye/aramorph/test/testdocs/?root=3Daramorph Tank's again, Giuseppe. |
From: Thomas Z. <tho...@un...> - 2006-03-31 09:49:55
|
Giuseppe Corrarello wrote: > > > With the UTF-8 attribute i have the correct visualization of this > > character but not yet for the chinese and arabic. > > IMHO, the character encoding is correct. But Your text editor seems= to > be unable to represent the non-latin characters. > > > I think too that this is the correct encoding, but in fact isn't, > anyway i don't open my result file in a text editor, > i use it for importing the terminology records in an old version of=20 > Multiterm by Trados software.=20 > > Maybe an output encoded in ISO-8859-6 would give some results with > arabic ? > > > I have tried to set this encoding but nothing is changed i can't see=20 > the arabic characters. > > Find some test documents there ;-) > http://cvs.savannah.nongnu.org/viewcvs/aramorph/src/java/gpl/pierri= ck/brihaye/aramorph/test/testdocs/?root=3Daramorph > <http://cvs.savannah.nongnu.org/viewcvs/aramorph/src/java/gpl/pierr= ick/brihaye/aramorph/test/testdocs/?root=3Daramorph> > > > Tank's again, > > Giuseppe. Arabic *and* chinese characters in one file ..?? Puh, I don't know if=20 there is any text-editor or text-component which is able to display=20 something like that correctly ... Am I right, arabic-language is read=20 from right to left and chines from top to bottom ... ? But, perhaps some useful hints: Did you tried to open your file in a webbrowser, like Firefox, setting=20 the encodind manualy via "View / Character Encoding"? Try to use a font which realy contains all Unicode-characters: http://scripts.sil.org/cms/scripts/page.php?site_id=3Dnrsi&id=3Dencore-ip= a Greetings, Tom --=20 Thomas Zastrow Seminar fuer Sprachwissenschaft Universitaet Tuebingen http://www.sfs.uni-tuebingen.de/dialectometry/ Wilhelm Str. 19 D-72074 T=FCbingen Tel.: 07071/29-73968 Fax: 07071/29-5214 |
From: Giuseppe C. <g.c...@gm...> - 2006-03-31 11:17:02
|
Hi Thomas, Arabic *and* chinese characters in one file ..?? Puh, I don't know if > there is any text-editor or text-component which is able to display > something like that correctly ... Am I right, arabic-language is read > from right to left and chines from top to bottom ... ? yes , you are rigth but this is not my problem. I try to explain better what is my problem. I have an application ( Multiterm ) that store in a proprietary format terminology records in all language even chinese arabic russian etc. This application can export and import terminology record *only* in text files. When i export a record from Multiterm this is the result : ** <Entry Number>59877 <Status>Temporary <Reliability>1 <Source Language>en <Category>Terminology <Subject>AGRICULTURE <English>site-specific management <Remarks>Short denomination. <English>SSCM <Form>Abbreviation <French>agriculture de pr=E9cision <Remarks>D=E9nomination alternative. <Source>^INRA^, France, 2005. <French>am=E9nagement sp=E9cifique =E0 un site <Remarks>D=E9nomination alternative. <Source>^INRA^, France, 2005. <Spanish>manejo sitio espec=EDfico <Source>Sitio Web de la Agricultura de precisi=F3n, 2005 ( http://www.agriculturadeprecision.org/presfut/GlosarioAgPrec.htm) <Spanish>ordenaci=F3n espec=EDfica para cada lugar <Source>Unasylva No. 177, FAO, 1994 (T2230) <Spanish>agricultura de precisi=F3n <Arabic>=D2=D1=C7=DA=C9 =E3=F5=CD=FA=DF=E3=C9 <Arabic Source>International Centre for Agricultural Research in the Dry Areas, ICARDA, FAO Terminology Project, 2003. <Chinese>=BE=AB=D7=BC=C5(c)=D2=B5 <Chinese>=BE=AB=C8=B7=C5(c)=D2=B5 <Italian>agricoltura di precisione <Source>Istituto sperimentale per la meccanizzazione agricola (ISMA), Italia, 2005 ( http://www.ingegneriaagraria.it/home/index.php?interno=3Dricerca.php&idrice= rca=3D19); GB, FAO, 2005. <Italian>coltivazione in funzione della superficie <Remarks>Traduzione. <Source>GB, FAO, 2005. ** in this case the Arabic and the Chinese entry are *not* visible in a text editor, but i can see strage character because the encoding of the text editor is wrong. Well, now I have a terminology database stored in xml format and encoded in utf-8. I have defined my xslt transformation to import my xml records in Multiterm= , but after the transformation the record that i see in a text editor and in Multeterm too is in this form : ** <Entry Number>59877 <Status>Temporary <Reliability>1 <Source Language>en <Category>Terminology <Subject>AGRICULTURE <English>site-specific management <Remarks>Short denomination. <English>SSCM <Form>Abbreviation <French>agriculture de pr=E9cision <Remarks>D=E9nomination alternative. <Source>^INRA^, France, 2005. <French>am=E9nagement sp=E9cifique =E0 un site <Remarks>D=E9nomination alternative. <Source>^INRA^, France, 2005. <Spanish>manejo sitio espec=EDfico <Source>Sitio Web de la Agricultura de precisi=F3n, 2005 ( http://www.agriculturadeprecision.org/presfut/GlosarioAgPrec.htm) <Spanish>ordenaci=F3n espec=EDfica para cada lugar <Source>Unasylva No. 177, FAO, 1994 (T2230) <Spanish>agricultura de precisi=F3n <Arabic>???????????? <Arabic Source>International Centre for Agricultural Research in the Dry Areas, ICARDA, FAO Terminology Project, 2003. <Chinese>???????? <Chinese>????????? <Italian>agricoltura di precisione <Source>Istituto sperimentale per la meccanizzazione agricola (ISMA), Italia, 2005 ( http://www.ingegneriaagraria.it/home/index.php?interno=3Dricerca.php&idrice= rca=3D19); GB, FAO, 2005. <Italian>coltivazione in funzione della superficie <Remarks>Traduzione. <Source>GB, FAO, 2005. ** I know that the question marks are not *really* question marks anyway this record is different from the first one and this diofference is noticed by Multiterm that display all the arabic and chinese entry with ??????. Maybe now my problem is more clear, thank's a lot Giuseppe. |
From: Pierrick B. <pie...@cu...> - 2006-03-31 10:24:27
|
Hi, Giuseppe Corrarello a =E9crit : > Arabic *and* chinese characters in one file ..?? Puh, I don't know = if > there is any text-editor or text-component which is able to display > something like that correctly ... Am I right, arabic-language is re= ad > from right to left and chines from top to bottom ... ? >=20 >=20 > yes , you are rigth but this is not my problem. We're definitely getting off-topic.... > When i export a record from Multiterm this is the result : > <Arabic>=D2=D1=C7=DA=C9 =E3=F5=CD=FA=DF=E3=C9 > <Arabic Source>International Centre for Agricultural Research in the Dr= y=20 This one is OK in CP-1256 (not in ISO-8859-6). > Well, now I have a terminology database stored in xml format and encode= d=20 > in utf-8. > I have defined my xslt transformation to import my xml records in=20 > Multiterm, but after the transformation the record that i see in a text= =20 > editor and in Multeterm too is in this form : > <Arabic>???????????? No way. > I know that the question marks are not *really* question marks I'm afraid they are. For some reason, a software seems to have replaced=20 "unknown" characters by question marks. An hexadecimal dump (for example) would say. Cheers, --=20 Pierrick Brihaye, informaticien Service r=E9gional de l'Inventaire / DRAC Bretagne mailto:pie...@cu... / t=E9l : +33 (0)2 99 29 67 78 Avez-vous lu http://usenet-fr.news.eu.org/fr-chartes/rfc1855.html ? |
From: <sju...@ko...> - 2006-03-31 12:25:56
|
[To read this e-mail correctly, your e-mail client must be able to =20 handle and display Unicode (UTF-8) correctly] Den 31. mar. 2006 kl. 13.16 skrev Giuseppe Corrarello: > Hi Thomas, > > Arabic *and* chinese characters in one file ..?? Puh, I don't know if > there is any text-editor or text-component which is able to display > something like that correctly ... Am I right, arabic-language is read > from right to left and chines from top to bottom ... ? Not correctly in the strict sense, obeying all typographical features =20= of the script, but the characters represented by the corresponding =20 codepoint? That shouldn't be a problem for any application that can =20 deal with Unicode. Seeing the characters in the character stream as =20 such is not the same as displaying them correctly, I know, but still =20 a lot better than a row of question marks:-) (the output below was =20 given by a standard MacOS X editor) > <Arabic>=C3=92=C3=91=C3=87=C3=9A=C3=89 =C3=A3=C3=B5=C3=8D=C3=BA=C3=9F=C3= =A3=C3=89 > <Chinese>=C2=BE=C2=AB=C3=97=C2=BC=C3=85=C2=A9=C3=92=C2=B5 By taking these strings as Windows Latin 1 representations of the =20 byte sequences making up the original string, I asked my editor to =20 reinterpret the byte sequence as Arabic (Windows) and Chinese (GB =20 18030) respectively. The result is the following: =D8=B2=D8=B1=D8=A7=D8=B9=D8=A9 =D9=85=D9=8F=D8=AD=D9=92=D9=83=D9=85=D8=A9 =E7=B2=BE=E5=87=86=E5=86=9C=E4=B8=9A I am not able to judge whether this is correct Arabic or Chinese, but =20= it looks pretty close to me:-) (and what you see depends on your e-=20 mail client - cross your fingers!) (I don't know what Arabic (Windows) means - there's no further info =20 in my editor - but you can probably give it an educated guess?) That is: Multiterm is mixing encodings in the same (exported) =20 document, and you need to convert the encoding of each element =20 separately into UTF-8. By the way: the data you quoted isn't XML, but you're probably aware =20 of that. Regards, Sjur |