[joe] utf8 again...

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

(sorry, I haven't yet tried the utf8 version from cvs, only thinking about
how I imagige utf8 support...)

Previously I wrote sentences that sounded like "unicode changed the way of
thinking: let's not think in bytes but rather in human-readable letters",
"I can't see the reason for separating utf8 mode from non-utf8 mode",
"non-utf8 is not a charset" and stuff like this...

Now that I've thinked more on it, I have to revise my opinion, it seems to
me that Allen was right about separating utf8 and nonutf8 mode.

There are two independent stories: the way the terminal behaves (utf8 or
classical) and the way the files should be interpreted.

The terminal is the easier story. Actually I believe that the world will
shift to utf8 real soon. Some Linux distributions have already switched
their terminal to utf8. It'll take 2-3 years, maybe 5, till all the modern
systems have utf8 terminal.

At text files the change will probably be much slower, imho we'll keep on
seeing plenty of non-utf8 text files even 20 years later.

And the other side of the story I simply forgot about is that joe is not
only a wonderful text editor, it's also a wonderful binary editor. And of
course binary files will never be valid utf8 and should not be tried to
interpreted as text files.

So now I believe that joe should achieve two goals: be a good text editor
(knowing about old-fashioned charsets and utf8) and in the mean time a
good binary editor.

When editing utf8 text files, I expect ^K space to report the
code of the multibyte sequence the cursor stands over (either in U+0151
like way or in the UTF-8 code as a sequence of some decimal or hexadecimal
numbers). When in overtype mode, if I stand on an "a" and type "=E1" to
replace it, the size of the file changes, but who cares...

When editing binary files, I expect every byte to be shown as a glyph,
^K space should also show one byte, in overtype mode I shouldn't be able
to change the size of the file etc... exactly how old joe works.

We can call these two modes utf8 and nonutf8 mode, or text and binary
mode. Let's stick to utf8 and nonutf8.

The four basic possibilities:

 - nonutf8 mode over a nonutf8 terminal. This is how joe works now. This
   behavior should remain reachable.

 - nonutf8 mode over an utf8 terminal. All that has to be done is to
   emulate an old-fashioned terminal in an utf8 one. IMHO it's not a big
   job. Every time a character should be drawn, it's iconv()ed to its
   multibyte sequence, and multiple bytes are put to stdout, while
   expecting the cursor has only moved one character. Similar reverse
   conversion for the input. If the received utf8 character doesn't fit
   the old-fashioned locale, telling so in the bottom line would be nice
   (just as when you try to type to a read-only file). The local charset
   to use could default to the one determined by Allen's hack (remove
   .UTF-8 from the locale name and find its charset), but it's nice if
   it's selectable from command line and even changeable run-time, on a
   per-buffer basis.

   A use-case scenario for this: I have LANG=3Dhu_HU.UTF-8. I try to edit a
   French latin-1 text. When I open it, joe assumes it is in latin-2 hence
   put wrong accents on the top of some characters. I tell joe that it's
   latin-1 instead of latin-2. Joe now refreshes the screen, this time
   french accents are shown correctly. If I try to type a hungarian
   accented letter, these are the possibilities I can imagine:
    - joe complains that the input doesn't fit latin-1 and the file
      remains the same. When I save it, it is saved in latin-1 encoding.
    - joe offers me to convert the buffer to utf8 mode. If I accept it, a
      proper conversion is done and joe keeps on working in utf8 mode and
      at the end the file will be saved as utf8.

 - utf8 mode over an utf8 terminal. There's not much to explain about
   it... As stated above, I expect that ^K space reports the UCS or UTF-8
   code of the character I'm standing above. Good question whether the
   file offset or the number of characters should be shown.

 - utf8 mode over a nonutf8 terminal. We have to emulate an utf8 terminal
   over a nonutf8 one. Actually I don't mind if this possibility is not
   implemented at all. Here joe should try to display utf8 files in a
   nonutf8 terminal, which implies that lots of letters are displayed as
   question marks or other symbols. IMHO not really suitable for any
   serious work. And as I stated above, IMHO terminals will quickly turn
   to utf8 everywhere, so if this mode is not implemented, it won't be a
   problem after some years. Anyway, it's not that hard to implement it if
   utf8 mode over utf8 terminal is done: input characters are converted
   from 8bit to utf8, output characters are converted from utf8 to
   locale-specific, or if it's not possible, a special symbol (e.g.
   question mark with bright color) is shown.

To summarize: previously I was wrong when I thought than an UTF-8 mode can
emulate everything, since UTF-8 is all about human-readable text, but
there are not only text files in the world, there are binary files as
well, where the old way of thinking in bytes (rather in glyphs) has to be
preserved.

bye,

Egmont