From: Koblinger E. <eg...@cs...> - 2004-04-20 12:19:18
|
Hi, (sorry, I haven't yet tried the utf8 version from cvs, only thinking about how I imagige utf8 support...) Previously I wrote sentences that sounded like "unicode changed the way of thinking: let's not think in bytes but rather in human-readable letters", "I can't see the reason for separating utf8 mode from non-utf8 mode", "non-utf8 is not a charset" and stuff like this... Now that I've thinked more on it, I have to revise my opinion, it seems to me that Allen was right about separating utf8 and nonutf8 mode. There are two independent stories: the way the terminal behaves (utf8 or classical) and the way the files should be interpreted. The terminal is the easier story. Actually I believe that the world will shift to utf8 real soon. Some Linux distributions have already switched their terminal to utf8. It'll take 2-3 years, maybe 5, till all the modern systems have utf8 terminal. At text files the change will probably be much slower, imho we'll keep on seeing plenty of non-utf8 text files even 20 years later. And the other side of the story I simply forgot about is that joe is not only a wonderful text editor, it's also a wonderful binary editor. And of course binary files will never be valid utf8 and should not be tried to interpreted as text files. So now I believe that joe should achieve two goals: be a good text editor (knowing about old-fashioned charsets and utf8) and in the mean time a good binary editor. When editing utf8 text files, I expect ^K space to report the code of the multibyte sequence the cursor stands over (either in U+0151 like way or in the UTF-8 code as a sequence of some decimal or hexadecimal numbers). When in overtype mode, if I stand on an "a" and type "=E1" to replace it, the size of the file changes, but who cares... When editing binary files, I expect every byte to be shown as a glyph, ^K space should also show one byte, in overtype mode I shouldn't be able to change the size of the file etc... exactly how old joe works. We can call these two modes utf8 and nonutf8 mode, or text and binary mode. Let's stick to utf8 and nonutf8. The four basic possibilities: - nonutf8 mode over a nonutf8 terminal. This is how joe works now. This behavior should remain reachable. - nonutf8 mode over an utf8 terminal. All that has to be done is to emulate an old-fashioned terminal in an utf8 one. IMHO it's not a big job. Every time a character should be drawn, it's iconv()ed to its multibyte sequence, and multiple bytes are put to stdout, while expecting the cursor has only moved one character. Similar reverse conversion for the input. If the received utf8 character doesn't fit the old-fashioned locale, telling so in the bottom line would be nice (just as when you try to type to a read-only file). The local charset to use could default to the one determined by Allen's hack (remove .UTF-8 from the locale name and find its charset), but it's nice if it's selectable from command line and even changeable run-time, on a per-buffer basis. A use-case scenario for this: I have LANG=3Dhu_HU.UTF-8. I try to edit a French latin-1 text. When I open it, joe assumes it is in latin-2 hence put wrong accents on the top of some characters. I tell joe that it's latin-1 instead of latin-2. Joe now refreshes the screen, this time french accents are shown correctly. If I try to type a hungarian accented letter, these are the possibilities I can imagine: - joe complains that the input doesn't fit latin-1 and the file remains the same. When I save it, it is saved in latin-1 encoding. - joe offers me to convert the buffer to utf8 mode. If I accept it, a proper conversion is done and joe keeps on working in utf8 mode and at the end the file will be saved as utf8. - utf8 mode over an utf8 terminal. There's not much to explain about it... As stated above, I expect that ^K space reports the UCS or UTF-8 code of the character I'm standing above. Good question whether the file offset or the number of characters should be shown. - utf8 mode over a nonutf8 terminal. We have to emulate an utf8 terminal over a nonutf8 one. Actually I don't mind if this possibility is not implemented at all. Here joe should try to display utf8 files in a nonutf8 terminal, which implies that lots of letters are displayed as question marks or other symbols. IMHO not really suitable for any serious work. And as I stated above, IMHO terminals will quickly turn to utf8 everywhere, so if this mode is not implemented, it won't be a problem after some years. Anyway, it's not that hard to implement it if utf8 mode over utf8 terminal is done: input characters are converted from 8bit to utf8, output characters are converted from utf8 to locale-specific, or if it's not possible, a special symbol (e.g. question mark with bright color) is shown. To summarize: previously I was wrong when I thought than an UTF-8 mode can emulate everything, since UTF-8 is all about human-readable text, but there are not only text files in the world, there are binary files as well, where the old way of thinking in bytes (rather in glyphs) has to be preserved. bye, Egmont |