Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#69 Major errors with charsets support and display

open
nobody
None
5
2006-09-15
2006-09-15
Anonymous
No

Extended characters in header fields like From, To and
Subject that are correctly encoded in an 8-bit encoding
(using either Quoted-Printable or Base64) are not shown
correctly. It seems that the 'Encoded-Word's are
decoded to 8-bit but then are output raw as if they
were assumed to be in the system's Windows encoding. It
seems also that the interface does not use Unicode but
can use only characters from the Windows charset used
for the system's regional settings. UTF-7 is not
decoded at all (which is the least of the problems
because its use is discouraged). I have attached a file
showing an actual screen dump of how some characters
were presented (top part) and how they are supposed to
look like (bottom part; it's actually the same image as
the top part but the characters have been corrected in
a graphics editing program). Information on charsets
used (from top): 1. ISO-8859-13; 2. UTF-8; 3. ISO-8859-
2; 4. UTF-7. My system uses system encoding Windows-
1250. Now a few notes: Firstly many Westerners (mainly
from English-speaking countries, but not only) usually
make assumption that all ISO charsets are subsets of
corresponding Windows charsets, which is not true.
While ISO-8859-1 may be displayed raw using Windows-
1252, in case of ISO-8859-2 several characters in the
range 160-191 are on positions different from Windows-
1250 and will not be shown correctly just by decoding
them from QP/B64 and piping them raw onto display. ISO-
8859-16 is about 2/3 of ISO-8859-1 (on the same
positions than in Windows-1252) and half of ISO-8859-2
(mostly on positions different from Windows-1250). If
you don't take care of these problems first, the
results in many, many languages will be as illustrated.
Second: you can't assume that everybody will always
communicate in one language - the one his/her system is
set to. There are many people who speak more languages
than their mother tongue and may receive e-mails in say
Polish, German and Russian. There are also people who
either prefer or have to use English version of the
operating system even if they speak a non-Western
European language. So it is another reason why raw
piping of 8-bit characters onto display is a wrong
approach. Thirdly: characters used in some languages
can be encoded using various charsets. The characters
may be on different positions in each of the charsets,
like it happens with Polish and ISO-8859-2, ISO-8859-13
and ISO-8859-16. Now since I understand that providing
true Unicode support may need some substantial rewrite
of the software it would be good short-term solution to
make sure that all characters that exist in the Windows
encoding used on the host system are shown correctly.

Contact with me: http://nowazelandia.prv.pl/becky_en.
html

Discussion

  • How cetrain characters look and how they are expected to look

     
    Attachments