From: <ja...@av...> - 2004-04-09 05:12:10
|
I'm working on UTF-8 support. On a correctly set up UTF-8 enabled linux box (running JOE in a UTF-8 xterm for example), do you get UTF-8 encoded characters from the keyboard (I hope...)? Or do you get the old latin1 characters in the 160 - 255 range as single characters, which I then have to translate to UTF-8 (yuck)? If you type 'cat' and start entering characters from the local character set, do they echo back to you on the screen in a UTF-8 xterm? How about the linux console screen? I'm hoping it's all UTF-8: then cut and paste will even work. Otherwise it's a big mess. Sorry about this basic question: everything here is english and I have no experience running linux in a different language. [I'll probably check in the first go at UTF-8 level 1 support tomorrow]. |
From: Koblinger E. <eg...@uh...> - 2004-04-09 05:59:18
|
Hi, > I'm working on UTF-8 support. Great! Nice to hear it! > On a correctly set up UTF-8 enabled linux box > (running JOE in a UTF-8 xterm for example), do you get UTF-8 encoded > characters from the keyboard (I hope...)? Yes. > If you type 'cat' and start entering characters from the local character > set, do they echo back to you on the screen in a UTF-8 xterm? How about the > linux console screen? There's no concept of "local character set" in UTF-8 world. You can type anything, kanjis, hebrew text(*) etc... and they are echoed back exactly as they are, so you see everything is okay. (*) = I have no clue how right-to-left is handled. I have no experience with linux console and xterm, I use gnome-terminal from Gnome 2.4. Here it is just one mouse click to convert the terminal's behaviour from locale-specific 8bit to utf-8 and back, even at runtime. Great feature, I often use it. > I'm hoping it's all UTF-8: then cut and paste will even work. Otherwise > it's a big mess. In gnome-terminal it works clearly in UTF-8, not only inside the terminal, but between all UTF-8 aware applications (e.g. full Gnome and KDE). > Sorry about this basic question: everything here is english and I have no > experience running linux in a different language. I use my Linux in Hungarian which is not even Latin-1 but Latin-2, so I've seen lots of funny things already :-)) Feel free to ask, I'll try to answer to the best I can, since I'd _really_ need UTF-8 support in joe. Please note: default glibc installation doesn't support UTF-8 locales, e.g. it has hu_HU but doesn't have hu_HU.UTF-8. Hence some distros don't ship these, while some others do. Some UTF-8 aware applications don't need it, however, some other need it. E.g. if you're planning automatic 8-bit/utf8 support to joe depending the ctype category of the locale, you will most likely need it. Note #2: many gnu utilities support UTF-8 if your glibc does so, e.g. I was surprised just a few days ago that "wc -L" (the length of the longest line in the file) counts utf-8 byte sequences as one column if you run it with an UTF-8 locale. Nice ;) -- Egmont |
From: Koblinger E. <eg...@uh...> - 2004-04-09 10:05:05
|
Hi, Just a few words about UTF-8, lest you do the wrong thing (which is easy for those who are new to unicode :-))) First of all, this is a very good article explaining how unicode is not just one more character set, but is a completely new way of thinking: let's not think in bytes, but in human-readable letters, characters, symbols etc... http://joelonsoftware.com/articles/Unicode.html it's a must-read for everyone coding anything related to charsets or Unicode :-)) In the old way of thinking, and in current joe, letters = bytes, that is, you read a byte from terminal and put exactly that byte into the file. Or read a byte from a file and put that one into the screen. No conversion is done. The strange is that if you plan proper unicode support, you won't only need one, but you'll need two conversions until the keypress arrives at the file, or the content of the file appears on the screen. Considering a graphical text editor, e.g. gedit, kwrite, kate, gvim etc. when you save a file you have the choice to choose an encoding (latin-1, utf-8 etc.). This is clear. Also, when you open a previously saved file, you can choose too. Here the program may help you in the decision based on heuristics (some character-frequency checks) or e.g. the fact that not all the files are valid UTF-8. So, a graphical text editor has to convert at one place. From the file charset to unicode (ucs4, utf8 god knows which one the internal representation of the file is), and the inverse of this conversion when saving the file. In a terminal text editor there's one more place where a conversion has to happen, this one corresponds to the terminal's behaviour: whether it sends you latin1 or latin2 or utf8 characters, and whether it expects latin1 or latin2 or utf8 etc. for displaying. I don't know if this property can be asked from the terminal itself (I don't think so), but this is exactly what the ctype locale category is for. Every user/distribution/etc. is expected to set ctype (LC_ALL, LC_CTYPE, LANG env variables in this order) corresponding to the old-fashioned 8-bit behaviour of applications. For example, running an 8-bit terminal with LC_CTYPE=hu_HU is okay. Running an UTF-8 terminal with LC_CTYPE=hu_HU.UTF-8 is also okay. Mixing these (hu_HU and an UTF-8 terminal, or hu_HU.UTF-8 and a Latin-2 terminal) is a misconfiguration of the system where applications are allowed to misbehave. That is, any application can assume that setlocale() followed by an nl_langinfo(CODESET) returns the encoding (both input and output) the terminal uses. +-------------+ +--------------------------+ +----------+ | file stored | -----> | internal representation | -----> | | | on the disk | <----- | of the file in joe | <----- | terminal | | | | most likely ucs4 or utf8 | | | +-------------+ +--------------------------+ +----------+ All the arrows here show a place where a character set conversion might be necessary. To explain this whole stuff in another way: I may want to edit a Latin-2 encoded file in a Latin-2 terminal. I may want to edit a UTF-8 encoded file in a Latin-2 terminal. (Some characters might not show up correctly, but those that are part of Latin-2 must be shown properly.) I may want to edit a Latin-2 encoded file in an UTF-8 terminal. I may want to edit a UTF-8 encoded file in an UTF-8 terminal. In all these cases the application has to perform its best to display all the possible characters correctly. bye, Egmont Ps. Based on your email address, don't you happen to be the original author of joe? |
From: <ja...@av...> - 2004-04-09 19:15:31
|
Thanks for this. I just checked in the first version of UTF-8 support for JOE. Display works, input is broken (you get control characters displayed until the full UTF-8 character is entered, then you have to hit refresh screen). Likely there are many bugs. Joe keeps the files in UTF-8 format internally. When UTF-8 mode is enabled, it will emit UTF-8 characters and keep track of character widths properly. Basically JOE can now do this: >I may want to edit a Latin-2 encoded file in a Latin-2 terminal. -asis mode >I may want to edit a UTF-8 encoded file in an UTF-8 terminal. -utf8 mode But it can not do these, but probably should: >I may want to edit a UTF-8 encoded file in a Latin-2 terminal. (Some > characters might not show up correctly, but those that are part of > Latin-2 must be shown properly.) >I may want to edit a Latin-2 encoded file in an UTF-8 terminal. I need to research conversion between byte-wide and UTF-8. Also, JOE can not do this: >when you save a file you have the choice to choose an encoding (latin-1, >utf-8 etc.). It seems like there must be UNIX command line utility which does this? >bye, >Egmont >Ps. Based on your email address, don't you happen to be the original >author of joe? Yes I am. |
From: Preston A. E. <pr...@ne...> - 2004-04-09 20:14:22
|
On Fri, 2004-04-09 at 15:16, ja...@av... wrote: > It seems like there must be UNIX command line utility which does this? There is, its called 'recode', it will translate between quite a number of encoding styles. http://www.gnu.org/software/recode/ -- PreZ Founder The Neuromancy Society http://www.neuromancy.net |
From: Koblinger E. <eg...@uh...> - 2004-04-09 21:33:14
|
On Fri, 9 Apr 2004 ja...@av... wrote: > It seems like there must be UNIX command line utility which does this? Utility: iconv and recode are available, iconv is part of libc and is a simple wrapper around the iconv() library call, recode is a more complex, more feature-rich, but actually less standard application. Recode 3.6 is a little bit buggy, it's recommended to apply this patch, otherwise you might have false output: http://cvs.mandrakesoft.com/cgi-bin/cvsweb.cgi/SPECS/recode/recode-3.6-various-fix.patch In joe, you'll probably need the libc interface instead of external applications, so see the man page of iconv(), iconv_open() and iconv_close(). The libc info pages provide even more details. iconv() is a little bit overcomplicated (do everything in one function) call, but I could manage to understand the manpage when I read it for about the tenth time :-)) Also, glib provides nice conversion functions with automatic memory management and stuff like that which make life much easier, but add another runtime dependency which isn't a nice thing for a small&fast&ultimate text editor. > >Ps. Based on your email address, don't you happen to be the original > >author of joe? > > Yes I am. Oh, it's nice to hear from you... you seemed to disappear for the last several years... but that's your business, not mine, so whatever you did, I'm really glad that you're back and working on joe :-)) -- Egmont |