From: Koblinger E. <eg...@cs...> - 2004-04-20 12:19:18
|
Hi, (sorry, I haven't yet tried the utf8 version from cvs, only thinking about how I imagige utf8 support...) Previously I wrote sentences that sounded like "unicode changed the way of thinking: let's not think in bytes but rather in human-readable letters", "I can't see the reason for separating utf8 mode from non-utf8 mode", "non-utf8 is not a charset" and stuff like this... Now that I've thinked more on it, I have to revise my opinion, it seems to me that Allen was right about separating utf8 and nonutf8 mode. There are two independent stories: the way the terminal behaves (utf8 or classical) and the way the files should be interpreted. The terminal is the easier story. Actually I believe that the world will shift to utf8 real soon. Some Linux distributions have already switched their terminal to utf8. It'll take 2-3 years, maybe 5, till all the modern systems have utf8 terminal. At text files the change will probably be much slower, imho we'll keep on seeing plenty of non-utf8 text files even 20 years later. And the other side of the story I simply forgot about is that joe is not only a wonderful text editor, it's also a wonderful binary editor. And of course binary files will never be valid utf8 and should not be tried to interpreted as text files. So now I believe that joe should achieve two goals: be a good text editor (knowing about old-fashioned charsets and utf8) and in the mean time a good binary editor. When editing utf8 text files, I expect ^K space to report the code of the multibyte sequence the cursor stands over (either in U+0151 like way or in the UTF-8 code as a sequence of some decimal or hexadecimal numbers). When in overtype mode, if I stand on an "a" and type "=E1" to replace it, the size of the file changes, but who cares... When editing binary files, I expect every byte to be shown as a glyph, ^K space should also show one byte, in overtype mode I shouldn't be able to change the size of the file etc... exactly how old joe works. We can call these two modes utf8 and nonutf8 mode, or text and binary mode. Let's stick to utf8 and nonutf8. The four basic possibilities: - nonutf8 mode over a nonutf8 terminal. This is how joe works now. This behavior should remain reachable. - nonutf8 mode over an utf8 terminal. All that has to be done is to emulate an old-fashioned terminal in an utf8 one. IMHO it's not a big job. Every time a character should be drawn, it's iconv()ed to its multibyte sequence, and multiple bytes are put to stdout, while expecting the cursor has only moved one character. Similar reverse conversion for the input. If the received utf8 character doesn't fit the old-fashioned locale, telling so in the bottom line would be nice (just as when you try to type to a read-only file). The local charset to use could default to the one determined by Allen's hack (remove .UTF-8 from the locale name and find its charset), but it's nice if it's selectable from command line and even changeable run-time, on a per-buffer basis. A use-case scenario for this: I have LANG=3Dhu_HU.UTF-8. I try to edit a French latin-1 text. When I open it, joe assumes it is in latin-2 hence put wrong accents on the top of some characters. I tell joe that it's latin-1 instead of latin-2. Joe now refreshes the screen, this time french accents are shown correctly. If I try to type a hungarian accented letter, these are the possibilities I can imagine: - joe complains that the input doesn't fit latin-1 and the file remains the same. When I save it, it is saved in latin-1 encoding. - joe offers me to convert the buffer to utf8 mode. If I accept it, a proper conversion is done and joe keeps on working in utf8 mode and at the end the file will be saved as utf8. - utf8 mode over an utf8 terminal. There's not much to explain about it... As stated above, I expect that ^K space reports the UCS or UTF-8 code of the character I'm standing above. Good question whether the file offset or the number of characters should be shown. - utf8 mode over a nonutf8 terminal. We have to emulate an utf8 terminal over a nonutf8 one. Actually I don't mind if this possibility is not implemented at all. Here joe should try to display utf8 files in a nonutf8 terminal, which implies that lots of letters are displayed as question marks or other symbols. IMHO not really suitable for any serious work. And as I stated above, IMHO terminals will quickly turn to utf8 everywhere, so if this mode is not implemented, it won't be a problem after some years. Anyway, it's not that hard to implement it if utf8 mode over utf8 terminal is done: input characters are converted from 8bit to utf8, output characters are converted from utf8 to locale-specific, or if it's not possible, a special symbol (e.g. question mark with bright color) is shown. To summarize: previously I was wrong when I thought than an UTF-8 mode can emulate everything, since UTF-8 is all about human-readable text, but there are not only text files in the world, there are binary files as well, where the old way of thinking in bytes (rather in glyphs) has to be preserved. bye, Egmont |
From: <ja...@av...> - 2004-04-20 15:05:57
|
Koblinger Egmont <eg...@cs...> wrote: >Hi, >(sorry, I haven't yet tried the utf8 version from cvs, only thinking about >how I imagige utf8 support...) Try it! :-) >When editing utf8 text files, I expect ^K space to report the >code of the multibyte sequence the cursor stands over (either in U+0151 >like way or in the UTF-8 code as a sequence of some decimal or hexadecimal >numbers). When in overtype mode, if I stand on an "a" and type "?" to >replace it, the size of the file changes, but who cares... It does this. Also you can hit ` x fe00 <return> to enter an ISO-10646 character. Which leads me to a question: do people use "digraphs" to enter characters (for example, in VIM you can type ^K Co to get the copyright symbol), or are all of the character on the keyboard? Basically does JOE need to support digraphs? >When editing binary files, I expect every byte to be shown as a glyph, >^K space should also show one byte, in overtype mode I shouldn't be able >to change the size of the file etc... exactly how old joe works. The one problem I'm working on is mal-formed UTF-8 sequences in UTF-8 files. >The four basic possibilities: > - nonutf8 mode over a nonutf8 terminal. This is how joe works now. This > behavior should remain reachable. Yes. > - nonutf8 mode over an utf8 terminal. Yes, but only the "hack" determined encoding is used. I still need to add a prompt to set the encoding. > - utf8 mode over an utf8 terminal. Yes. > - utf8 mode over a nonutf8 terminal. We have to emulate an utf8 terminal > over a nonutf8 one. I use Iconv for this. It works OK if you have a latin-1 terminal and UTF-8 file which only has latin-1 characters in it. Basically it does whatever iconv() does. |
From: Koblinger E. <eg...@cs...> - 2004-04-20 15:22:34
|
On Tue, 20 Apr 2004 ja...@av... wrote: > >(sorry, I haven't yet tried the utf8 version from cvs, only thinking about > >how I imagige utf8 support...) > > Try it! :-) Yes, this seems to be the easiest way :-))) as soon as I have some time I'll try it. > The one problem I'm working on is mal-formed UTF-8 sequences in UTF-8 files. Nice problem... I've just found this site a few hours ago: http://www.cl.cam.ac.uk/~mgk25/unicode.html and here the "UTF-8 decoder stress test file" link which points to a really nice invalid UTF-8 file... I immediately sent a bugreport to the opera folks as opera doesn't really seem to be able to cope with that file. ;) bye, Egmont |
From: Koblinger E. <eg...@uh...> - 2004-04-20 16:51:29
|
On Tue, 20 Apr 2004, Carlos wrote: > > characters (for example, in VIM you can type ^K Co to get the copyright > > symbol), or are all of the character on the keyboard? Basically does J= OE > > need to support digraphs? > > In my case, I can input almost all the diacritics and common > typographic symbols from my keyboard, so for the special cases like > greek characters and things like that, it's the same for me to > remember the digraph or the hex code. I don't know if there are people > who routinely input characters not in their keyboards (for these, > digraphs would be useful). My $0.02: There are plenty of symbols not yet contained in my keyboard. For example the official Hungarian layout knows about the euro symbol, but doesn't know about copyright, em-dash, opening and closing quotation marks etc. This is strange, since we won't have euro as official money for a while, but em-dash and quotation marks have been part of our written language since a long ago. Most likely because all these "can think better than humans" type word processors automagically convert simple - or " characters to these. Ouch. I hate them... Is "^K Co" configurable or hard-wired in vim? If you plan to hard-wire (or at least provide a reasonable amont of default bindings) then you can never stop, people will ask for more and more symbols, making joe's config file grow bigger and bigger... If I have to manually configure my favorite ones, then I'd rather set my keyboard layout, as it has two advantages: faster to type (altGr + a letter is faster than ` Co or similar) and it even works outside joe. AFAIK current systems provide various input methods. I don't know any details about unicode-aware console, xterm or KDE konsole, I only know about gnome-terminals. Here I have at least three ways to enter strange characters that are not present on my keyboard, and these methods work in all gtk2 applications. I can right-click, choose Input Method -> Unicode charmap to get a chartable where I can choose any unicode character. I can press Ctrl+Shift and type the Unicode code in hexadecimal. I can copy-paste from other applications (and there are some utilities for those I can teach my favourite characters and just click on them to copy them to the clipboard)... Many many possibilities. Joe would just add another one, which is fine to have, easier to use if I don't use my usual machine, but IMHO it shouldn't have high priority :-)))) Actually, I'd sometimes prefer a "compose" feature so that I type a' and it becomes =E1, while a` becomes an a with an accent grave and so on... --=20 Egmont |
From: Carlos <an...@qu...> - 2004-04-20 17:48:34
|
[Koblinger Egmont <eg...@uh...>, 2004-04-20 18.51 CEST] [...] > Is "^K Co" configurable or hard-wired in vim? If you plan to hard-wire (or > at least provide a reasonable amont of default bindings) then you can > never stop, people will ask for more and more symbols, making joe's config > file grow bigger and bigger... If I have to manually configure my favorite > ones, then I'd rather set my keyboard layout, as it has two advantages: > faster to type (altGr + a letter is faster than ` Co or similar) and it > even works outside joe. The digraphs are probably hardwired in vim, but they are documented in RFC 1345. I suppose joe would use the same combinations. http://www.faqs.org/rfcs/rfc1345.html > AFAIK current systems provide various input methods. I don't know any > details about unicode-aware console, xterm or KDE konsole, I only know > about gnome-terminals. Here I have at least three ways to enter strange > characters that are not present on my keyboard, and these methods work in > all gtk2 applications. I can right-click, choose Input Method -> Unicode > charmap to get a chartable where I can choose any unicode character. I can > press Ctrl+Shift and type the Unicode code in hexadecimal. I can > copy-paste from other applications (and there are some utilities for those > I can teach my favourite characters and just click on them to copy them to > the clipboard)... Many many possibilities. Joe would just add another one, > which is fine to have, easier to use if I don't use my usual machine, but > IMHO it shouldn't have high priority :-)))) Actually, I'd sometimes prefer > a "compose" feature so that I type a' and it becomes á, while a` becomes > an a with an accent grave and so on... Well, I'm against that proposal because my quote keys are already silent :). ('a → á, `a → à, "a → ä, [AltGr =] o → ő, etc.). I concur in that digraphs are not high priority, at least for me. In fact, I'm very happy with UTF8 as it is working right now :) (I compiled joe only a few hours ago, so no bug has hit me yet). Greetings. |
From: Koblinger E. <eg...@uh...> - 2004-04-20 20:32:56
Attachments:
kword.desktop.gz
|
> Try it! :-) I tried it :-)))) Some bugs, just to begin with something... ;) Take the attached file, gunzip it, open with joe -utf8 on an utf8 terminal (I have a 80x24 size gnome-terminal from Gnome 2.4 switched to utf8 mode, LANG=hu_HU.UTF-8), and press pagedown 2 or 3 times. Two-letter country codes somehow get four-letter long. Pressing ^R causes the screen to refresh correctly, that is, half of the screen goes two characters to the left so that everything gets okay. Seems to me that the GenericName[lo] line is causing the problem. In this file, I copy one or more double-width characters with my mouse from the GenericName[zh_**] lines and then paste them with mouse or shift-insert. Pasting is usually okay, but if the cursor stands at the end of a line, then it's not shown correctly, it appears under the cursor inside on its left, and a space is shown on the left side of the cursor. ^R repairs the screen. I made the top bar show the percent where I'm inside the file. Currently it counts the number of bytes in UTF-8 encoding. I'd rather see it counting the number of characters. bye, Egmont |
From: <ja...@av...> - 2004-04-20 22:26:06
|
Koblinger Egmont <eg...@uh...> wrote: >Some bugs, just to begin with something... ;) Thanks, this is very helpful. >Take the attached file, gunzip it, open with joe -utf8 on an utf8 terminal >(I have a 80x24 size gnome-terminal from Gnome 2.4 switched to utf8 mode, >LANG=hu_HU.UTF-8), and press pagedown 2 or 3 times. Two-letter country >codes somehow get four-letter long. Pressing ^R causes the screen to >refresh correctly, that is, half of the screen goes two characters to the >left so that everything gets okay. Seems to me that the GenericName[lo] >line is causing the problem. The problem is that gnome-terminal and JOE do not agree on the widths of the characters after GenericName[lo]. JOE works on uxterm, and VIM has the same problem as JOE for this file (hit A in VIM on that line and you'll see). This is going to be an issue: Uxterm, JOE and VIM use mk_wcwidth() from Markus Kuhn (the guy who wrote the UTF-8 FAQ) to get character widths (wcwidth in glibc actually crashed). This is basically stupid: really I should query the font to get character widths, but there is no way to do it from a terminal emulator. I'm guessing that this is what gnome-terminal does. But it's strange, because the default font in gnome-terminal didn't have those characters... did the version of gnome-terminal you use have them? (I'm using version 2.2.1 which is what was on Slackware 9.0). >In this file, I copy one or more double-width characters with my mouse >from the GenericName[zh_**] lines and then paste them with mouse or >shift-insert. Pasting is usually okay, but if the cursor stands at the end >of a line, then it's not shown correctly, it appears under the cursor >inside on its left, and a space is shown on the left side of the cursor. >^R repairs the screen. I fixed this and checked it in. When you type at the ends of lines (the most common activity), JOE tries to avoid doing a full screen update and just sends the character to the screen instead, but the code was assuming all characters were 1 wide. >I made the top bar show the percent where I'm inside the file. Currently >it counts the number of bytes in UTF-8 encoding. I'd rather see it >counting the number of characters. Unfortunately this is a big change (but not impossible). I'll put it on the list. |
From: Koblinger E. <eg...@uh...> - 2004-04-20 22:54:54
|
> The problem is that gnome-terminal and JOE do not agree on the widths of the > characters after GenericName[lo]. JOE works on uxterm, and VIM has the same > problem as JOE for this file (hit A in VIM on that line and you'll see). Yes, it seems that the two characters that joe believes to be 0-width are 1-width for gnome-terminal. > This is going to be an issue: Uxterm, JOE and VIM use mk_wcwidth() from > Markus Kuhn (the guy who wrote the UTF-8 FAQ) to get character widths > (wcwidth in glibc actually crashed). This is basically stupid: really I > should query the font to get character widths, but there is no way to do it > from a terminal emulator. I'm guessing that this is what gnome-terminal > does. Yes, this is fundamentally stupid... Isn't it somehow possible to query the cursors position? Then just print a character and see what the cursor does. > But it's strange, because the default font in gnome-terminal didn't > have those characters... did the version of gnome-terminal you use have > them? (I'm using version 2.2.1 which is what was on Slackware 9.0). I do have this [lo] language's font in gnome-terminal. In uxterm I just see squares, but in gnome-terminal I see a font whose look resembles me to hebrew text (it is completely different, though). I have gnome-terminal 2.4.2, but AFAIK this doesn't really matter, gnome-terminal is just a GUI around the terminal emulator widget called 'vte' (optionally zvt in earlier gnome editions). My vte is 0.11.10. konsole from kde 3.2 seems to do the right job. It shows this text properly, zero-width characters are really zero-width. So, most likely we're facing a gnome-terminal/vte bug? > >I made the top bar show the percent where I'm inside the file. Currently > >it counts the number of bytes in UTF-8 encoding. I'd rather see it > >counting the number of characters. > > Unfortunately this is a big change (but not impossible). I'll put it on the > list. Okay, this one is not important at all :-)) bye, Egmont |
From: Koblinger E. <eg...@uh...> - 2004-04-21 18:45:58
|
On Tue, 20 Apr 2004 ja...@av... wrote: Hi, > >In this file, I copy one or more double-width characters with my mouse > >from the GenericName[zh_**] lines and then paste them with mouse or > >shift-insert. Pasting is usually okay, but if the cursor stands at the end > >of a line, then it's not shown correctly, it appears under the cursor > >inside on its left, and a space is shown on the left side of the cursor. > >^R repairs the screen. > > I fixed this and checked it in. When you type at the ends of lines (the > most common activity), JOE tries to avoid doing a full screen update and > just sends the character to the screen instead, but the code was assuming > all characters were 1 wide. This one is really fixed, but I just found a similar bug (maybe it is introduced by this fix): when the cursor stands right from the last column in a row (you press End in a longer line and then use up or down) then newly inserted characters (even single-width ones) are shows incorrectly. bye, egmont |
From: <ja...@av...> - 2004-04-21 19:49:11
|
Yeah, I screwed up the previous fix. It should be working now. Koblinger Egmont <eg...@uh...> wrote: >This one is really fixed, but I just found a similar bug (maybe it is >introduced by this fix): when the cursor stands right from the last column >in a row (you press End in a longer line and then use up or down) then >newly inserted characters (even single-width ones) are shows incorrectly. |
From: <ja...@av...> - 2004-04-22 05:01:29
|
Egmont, I fixed the bug you entered last year concerning interaction of overtype in wordwrap mode. Also for overtype mode: Now TAB only inserts when you are past end of line, otherwise it's just a cursor motion key. Also Enter is only a cursor motion key unless it's at end of file. |
From: Koblinger E. <eg...@uh...> - 2004-04-22 18:41:43
|
On Tue, 20 Apr 2004 ja...@av... wrote: > This is going to be an issue: Uxterm, JOE and VIM use mk_wcwidth() from > Markus Kuhn (the guy who wrote the UTF-8 FAQ) to get character widths > (wcwidth in glibc actually crashed). Just for curiosity (it's not joe-related): man wcwidth says: The behaviour of wcwidth depends on the LC_CTYPE category of the current locale. (and same for wcswidth) Why does it depend on LC_CTYPE? Isn't wide char locale-independent? ("Function Index" of the glibc info page lack these functions.) -- Egmont |
From: Pawel K. <pk...@be...> - 2004-04-22 18:51:04
|
On Thu, 22 Apr 2004, Koblinger Egmont wrote: > On Tue, 20 Apr 2004 ja...@av... wrote: > > > This is going to be an issue: Uxterm, JOE and VIM use mk_wcwidth() from > > Markus Kuhn (the guy who wrote the UTF-8 FAQ) to get character widths > > (wcwidth in glibc actually crashed). > > Just for curiosity (it's not joe-related): > > man wcwidth says: > The behaviour of wcwidth depends on the LC_CTYPE category of the > current locale. > (and same for wcswidth) > > Why does it depend on LC_CTYPE? Isn't wide char locale-independent? My understanding is that it's not. It depends on your locale, because eg. Chinese locales may result in wchar_t of size 3 or 4. And it is not true for Polish locale (which has wchar_t either 1 or 2). pkot -- mailto:pk...@be... http://www.gnokii.org/ |
From: Koblinger E. <eg...@uh...> - 2004-04-22 19:07:59
|
On Thu, 22 Apr 2004, Pawel Kot wrote: > My understanding is that it's not. It depends on your locale, because eg. > Chinese locales may result in wchar_t of size 3 or 4. And it is not true > for Polish locale (which has wchar_t either 1 or 2). The size of wchar_t is a compile time issue. LANG is run-time. So I don't think it's okay. And the main point in stuff like unicode and wide char: even if you have Polish locale, you can still use Chinese characters with no problem. -- Egmont |
From: <ja...@av...> - 2004-04-22 19:23:39
|
The only thing I could guess is that it returns -1 for characters which are not part of the locale. I'll take a look at the glibc source at some point. |
From: Pawel K. <pk...@be...> - 2004-04-22 21:16:07
|
On Thu, 22 Apr 2004, Koblinger Egmont wrote: > On Thu, 22 Apr 2004, Pawel Kot wrote: > > > My understanding is that it's not. It depends on your locale, because eg. > > Chinese locales may result in wchar_t of size 3 or 4. And it is not true > > for Polish locale (which has wchar_t either 1 or 2). > > The size of wchar_t is a compile time issue. I mean effective size -- the result of wcwidth(). Sorry for miswording. > LANG is run-time. So I don't > think it's okay. And the main point in stuff like unicode and wide char: > even if you have Polish locale, you can still use Chinese characters with > no problem. This is not really true. If I use pl_PL.utf8 locale, I can use Chinese characters, but is I use pl_PL.iso-8859-2 (default locale), Chinese characters do not make sense then. See wctomb(3). I think wcwidth() is the size of the resulting multibyte sequence. pkot -- mailto:pk...@be... http://www.gnokii.org/ |
From: Koblinger E. <eg...@uh...> - 2004-04-23 09:06:32
|
I asked the maintainer of the man-pages package, and he replied this: Who says that wide characters are Unicode characters? LC_CTYPE may well specify an entirely different encoding. Andries ... so there are other multibyte encodings than unicode. -- Egmont |