EXMH - Extensible MH Email Interface / Bugs / #8 UTF-8 messages not composed correctly

#8 UTF-8 messages not composed correctly

Milestone: Bug_(general_misbehavior)

Status: open

Owner: nobody

Labels: Sedit (mail composition) (4)

Priority: 7

Updated: 2001-11-20

Created: 2001-05-06

Creator: Markus Kuhn

Private: No

A) When I reply to a UTF-8 message and quote its
content, I get garbled text displayed in sedit that
looks like UTF-8 text decoded as ISO 8859-1 text. If I
send the message, it get's sent out labeled as
ISO-8859-1 and will appear of the recipient's screen as
garbles as in the sedit window. If I manually append ";
charset=UTF-8" to text/plain in the MIME header, the
resulting message will still be correct.

B) When I cut&paste non-ASCII text characters from a
received UTF-8 message into an sedit windows,then they
are displayed correctly. Apparently, Tk is able to
cut&paste arbitrary UTF-8 text between its widgets
correctly. However, when I send off the message, sedit
appends "charset=us-ascii" in the MIME header, and in
the outgoing
message, all the non-ASCII characters are replaced by
question
marks. If I add manually "; charset=UTF-8" to the MIME
header first, the
non-ASCII characters still get replaced by question
marks.

In other words, Unicode characters that are correctly
displayed in sedit cannot be sent out correctly as a
UTF-8 message.

I am not sure exactly what is going on, but I suspect
that exmh applies
mechanisms that might have been appropriate when TCL
encoded strings in some selectable 8-bit encoding,
before it switched everything in version 8.1 everything
to Unicode. It seems the way exmh handles character
sets needs serious reconsideration in the light of TCLs
Unicode support.

What should happen today when an sedit message is sent
is the following:

a) Check, what Unicode characters are found in the
buffer.

b) Test if only characters in the range U0000..U007F
are found, and if yes, add charset=US-ASCII to the MIME
header and send out the buffer unmodified.

c) Test If only characters in the range U0000..U00FF
are found, and if yes, add charset=ISO-8859-1 to the
MIME header and send out the buffer after passing it
through a UTF-8->ISO-8859-1 converter.

d) Optional: Test if only characters in the range of
some optionally specifiable legacy encoding XXX (like
ISO 8859-7) are found, and if yes, add the name of this
encoding XXX to the MIME header and send out the buffer
after sending it through a UTF-8->XXX converter.

e) In all other cases, add charset=UTF-8 to the MIME
header and send out the buffer unmodified.

It seems b) is already implemented, but there is at the
moment no handling of the situation where the sedit
buffer can only be sent out as UTF-8 to prevent loss of
information.

Also much of the exmh mechanics to select different
fonts for different MIME encodings has become obsolete.
Exmh should only handle everything in Unicode and leave
the font encoding selection entirely to Tk.

If you need to be sent a correct UTF-8 test message to
experiment, just let me know (mgk25@cl.cam.ac.uk) and
I'll email you one.

Markus Kuhn
University of Cambridge

Discussion

Markus Kuhn - 2001-11-20

priority: 5 --> 7
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

UTF-8 messages not composed correctly

Group

Searches

Help

#8 UTF-8 messages not composed correctly

Discussion