From: klaus e. w. <ke...@po...> - 2003-04-25 09:05:38
|
Dear eXist folks, I'm working with eXist for about 1 week and like it very much. Now I found, by chance, that UTF-8 encoded files load nicely, but that the response on the web (standard default) web interface and also on the XML:DB GUI ignores the Unicode, changing a LATIN SMALL LETTER SHARP S "ß" (U+000DF aka German-SZ) into the ignominious "Ã?" ... I'm talking about queries both using the index and not (//record[lcn[contains(., 'ß')]] and //record[lcn[contains(., '*ß*')]]). Does anybody knows if this the fault of some bad configuration on my part - maybe one of the stylesheets returning the xml as visible html - or an unresolved issue? I did not find any information on the exist-db.org website (or I was too stupid to detect it ...). Thanks for your response, -- klaus e. werner <ke...@po...> »Ohne Frage hat die Einführung des Computers in unsere bereits hochtechnisierte Gesellschaft [ ] lediglich die früheren Zwänge verstärkt und erweitert, die den Menschen zu einer immer rationalistischeren Auffassung seiner Gesellschaft und zu einem immer mechanistischeren Bild von sich selbst getrieben haben.« Joseph Weizenbaum, Computer power and human reason (1976) |
From: Wolfgang M. <me...@if...> - 2003-04-28 12:06:59
|
Hi, I frequently had difficulties with character encodings when using eXist a= nd=20 Cocoon. I still do not understand completely how Cocoon determines the fi= nal=20 character encoding it uses for output. However, to correctly display Germ= an=20 characters on my Linux box, I sometimes had to set LANG=3Dde in bin/start= up.sh=20 (in particular with Tomcat) or set the file.encoding system property=20 explicitely (see my previous message). Concerning the XML:DB GUI, I have not yet checked it with German encoding= s,=20 but I will try... Wolfgang On Friday 25 April 2003 11:05, klaus e. werner wrote: > Dear eXist folks, > > I'm working with eXist for about 1 week and like it very much. = Now > I found, by chance, that UTF-8 encoded files load nicely, but that the > response on the web (standard default) web interface and also on the XM= L:DB > GUI ignores the Unicode, changing a LATIN SMALL > LETTER SHARP S "=DF" (U+000DF aka German-SZ) into the ignominio= us > "=C3?" ... > > I'm talking about queries both using the index and not > (//record[lcn[contains(., '=DF')]] and //record[lcn[contains(., > '*=DF*')]]). > > Does anybody knows if this the fault of some bad configuration = on > my part - maybe one of the stylesheets returning the xml as visible htm= l - > or an unresolved issue? > > I did not find any information on the exist-db.org website (or = I > was too stupid to detect it ...). > > Thanks for your response, > > -- > klaus e. werner <ke...@po...> |
From: klaus e. w. <ke...@po...> - 2003-04-28 15:10:53
|
Dear Wolfgang, many thanks for your tip! As for Cocoon and Tomcat, although being a newbie, I got it running quite well with full Unicode encoding (we have a lot of texts with transcripts of arabic titles, slavic books etc.). Actually, basing every XSLT on UTF-8 encoding did help. Windows box, that is. Unfortunately. Many thanks for the great program! -- klaus e. werner <ke...@po...> »Ohne Frage hat die Einführung des Computers in unsere bereits hochtechnisierte Gesellschaft [ ] lediglich die früheren Zwänge verstärkt und erweitert, die den Menschen zu einer immer rationalistischeren Auffassung seiner Gesellschaft und zu einem immer mechanistischeren Bild von sich selbst getrieben haben.« Joseph Weizenbaum, Computer power and human reason (1976) |
From: Michael B. <mbn...@mb...> - 2003-04-28 12:24:31
|
> Now I > found, by chance, that UTF-8 encoded files load nicely, but that the > response on the web (standard default) web interface and also on the > XML:DB GUI ignores the Unicode, changing a LATIN SMALL > LETTER SHARP S "" (U+000DF aka German-SZ) into the ignominious "Ã?" ... > In this instance the Unicode isn't being "ignored" and nothing's being "changed". You see this whenever correctly-encoded utf-8 is rendered by a browser or editor that's expecting ISO-8859-n. So the data is OK. You need to tell your browser that the encoding is utf-8 and the ß will appear. This has nothing to do with eXist or its associated packages, and is quite different from cases (referred to by Wolfgang) where either Java's output libraries or Cocoon's serializer decide for one reason or another to transcode utf-8 to something else. Michael Beddow |
From: klaus e. w. <ke...@po...> - 2003-04-28 14:28:46
|
Dear Michael, many thanks and I understand well what you're saying. But the error (at least this monday morning) seems to be something else. Could you try to make a query for "ß" (german double s, ALT+0223 on Win systems) on Wolfgang's Lbrary test page? http://130.83.186.203:8080/exist/library/bibquery.xml?field1=any&term1=%C3%9F&mode1=contains&operator=and&field2=any&term2=&mode2=contains&howmany=15 I get the following result: Query: document(*)//rdf:Description[.&='Ã?' ]. Found 1 hits in 0 ms. Transzendenz von e und ã by Hessenberg, Gerhard which is definitely wrong. Is it possible that higher encodings get garbled up on system *input*? Sorry if I take up your time, folks, -- klaus e. werner <ke...@po...> »Ohne Frage hat die Einführung des Computers in unsere bereits hochtechnisierte Gesellschaft [ ] lediglich die früheren Zwänge verstärkt und erweitert, die den Menschen zu einer immer rationalistischeren Auffassung seiner Gesellschaft und zu einem immer mechanistischeren Bild von sich selbst getrieben haben.« Joseph Weizenbaum, Computer power and human reason (1976) |
From: Michael B. <mbn...@mb...> - 2003-04-28 15:25:45
|
> > Could you try to make a query for "ß" > (german double s, ALT+0223 on Win systems) > on Wolfgang's Lbrary test page? > > http://130.83.186.203:8080/exist/library/bibquery.xml?field1=any&term1=%C3%9 F&mode1=contains&operator=and&field2=any&term2=&mode2=contains&howmany=15 > > I get the following result: > > Query: document(*)//rdf:Description[.&='Ã?' ]. > Found 1 hits in 0 ms. > Transzendenz von e und ã > by Hessenberg, Gerhard > > which is definitely wrong. It sure is. Something is obviously broken in that particular wrap-up of the eXist core. But I can assure you that eXist itself behaves very well with pure utf-8. Such problems as there are come when people insist on keeping their documents in other encodings, and even there I suspect Wolfgang has now squashed all the eXist specific bugs that were around that area. Michael Beddow |
From: klaus e. w. <ke...@po...> - 2003-04-29 08:22:24
Attachments:
Clipboard.jpg
|
Dear Michael, dear Wolfgang, thank you for trying to understand me. I'm sure we'll get to know the reason for this particular behaviour. You're right about the eXist core, which very well supports UTF-8: I patched the startup.bat of the XML:DB GUI with the encoding parameters given me by Wolfgang and the small editor window shows Unicode encodings perfectly. Best wishes, -- klaus e. werner <ke...@po...> »Ohne Frage hat die Einführung des Computers in unsere bereits hochtechnisierte Gesellschaft [ ] lediglich die früheren Zwänge verstärkt und erweitert, die den Menschen zu einer immer rationalistischeren Auffassung seiner Gesellschaft und zu einem immer mechanistischeren Bild von sich selbst getrieben haben.« Joseph Weizenbaum, Computer power and human reason (1976) |
From: Wolfgang M. <me...@if...> - 2003-04-28 15:40:53
|
I just tried your query and you're right: somehow, the url-encoded reques= t=20 parameter sent by the form is decoded using the wrong character encoding,= so=20 the query engine doesn't get the original query string. I finally found that the file.encoding property in bin/startup.sh is not=20 correctly set on my server. If I change it to -Dfile.encoding=3DISO-8859-= 1=20 (line 43 in startup.sh) and restart the server, queries containing sz=20 suddenly work: http://130.83.186.203:8080/exist/library/bibquery.xml?field1=3Dany&term1=3D= au%DFen*&mode1=3Dcontains&operator=3Dand&field2=3Dany&term2=3D&mode2=3Dco= ntains&howmany=3D15 Obviously, the servlet engine expects request parameters to be encoded wi= th=20 the system's default encoding. Wolfgang |
From: Michael B. <mbn...@mb...> - 2003-04-28 15:58:16
|
> If I change it to -Dfile.encoding=ISO-8859-1 > (line 43 in startup.sh) and restart the server, queries containing sz > suddenly work: http://130.83.186.203:8080/exist/library/bibquery.xml?field1=any&term1=au%DF en*&mode1=contains&operator=and&field2=any&term2=&mode2=contains&howmany=15 Hmm, but in that example the url-encoded query term is itself in ISO-8859-1 (%DF) In the example Klaus posted the url-encoded value was utf-8 (%C3%9F) If I repost Klaus's query to your restarted server, I get the same garbage out, whereas your query in ISO-8859-1 does indeed return the correct value (in utf-8) Aren't encoding problems fun?? Michael Beddow |
From: jon w. <jo...@sh...> - 2003-04-30 00:35:20
|
i've been intermittently getting cocoon errors like this: Exception in ServerPagesGenerator.generate() More precisely: org.apache.cocoon.ProcessingException: Exception in ServerPagesGenerator.generate(): java.lang.RuntimeException: org.exist.storage.NativeBroker@1609c13: tid 0 not found on page: 28; file = dom.dbx; address = 1d000; page header = 64; data start = 1d040. Loading -1 and then the expected output follows (also intermittent) ??? :j |
From: jon w. <jo...@sh...> - 2003-04-30 00:44:53
|
exist doesn't have a problem with long filenames or anything, right? :j |