Our NGO gets a *lot* of international e-mail. We would like to be able to sort it by character encoding until the world realizes it should use UTF-8 capable mail software. That way we won't have to tell our stupid e-mail software the character set for each message in order to view it correctly. We could just set a magnet for each character encoding, and view the mail all in batches, or better still, tag it automatically so the software can handle the charset automatically. This is particularly important for languages that have a choice of encodings in addition to Unicode, which turns out to be nearly all of them.
Popfile can sort by language to a degree using word statistics, but seems not to store and analyze the character encoding names correctly.
I understand that you don't want to store the charset parameter in the Popfile database. I am asking only for the ability to search correctly for encoding identifiers in the From:, To:, and Subject: fields.
Currently, Popfile will not search for identifiers containing numerals or other non-alphabetic characters, or classify them correctly, as in the following example, where it treats GB2312 as GB.
[unclassified] ??QQ????-????-????????-??
From: 3132y3u4@qyuy.com
To:
POPFile has quarantined a message. It is attached to this email.
Quarantined Message Detail
Original From: 3132y3u4@qyuy.com
Original To:
Original Subject: ????????QQ????????????????-?????????????-???????????????????¶-????????
To examine the email open the attachment. To change this mail's classification go to http://127.0.0.1:7070/jump_to_message?view=439668
The first 20 words found in the email are:
GB QUQ To Content Type text plain charset GB Content Transfer Encoding bit Date Tue Feb Priority Mailer Microsoft Outlook
Encapsulated message
??QQ????-????-????????-??
From: 3132y3u4@qyuy.com
To:
=?GB2312?B?QUQ=?=
To: cherlin@pacbell.net
Content-Type: text/plain;charset="GB2312"
Other examples of what we need recognized:
Subject: =?windows-1251?B?IsL77+vg8uAgxOji6OTl7eTu4jogyuDqIO3lIO/u7+Dx8vwg4iDt4Ovu4+7i++kg?=
=?windows-1251?B?yuDv6uDtIg==?=
(Russian Cyrillic)
Subject: =?KOI8-R?Q?=E2=C5=D3=D0=CC=C1=D4=CE=CF=C5 =D0=D2=CF=C4=D7=C9=D6=C5=CE=C9=C5 =D3=C1=CA=
(Russian Cyrillic)
Subject: =?iso-2022-jp?B?GyRCJVEhPCVGJSMhPCQ3JF4kORsoQg==?=
From: =?shift-jis?B?eXVyaQ==?= <risa_s841@yahoo.fr
(Yes, two different Japanese character sets in the same message)
We need support at least for the following, which we get routinely:
GB2312 Simplified Chinese
Big5 Traditional Chinese
GB18030 All Chinese
EUC-KR Korean
JOHAB Korean
ISO-2022-JP Japanese
Shift-jis Japanese
Windows-1251 Russian Cyrillic
all of the KOI8-* encodings for Cyrillic (Russian, Ukrainian, and others)
all of the ISO 8859-* encodings for European, Hebrew, Arabic, Greek, and Cyrillic
Of these only JOHAB is correctly recognized.
Other people may need IBM, Mac, and country-specific encodings. See the View: Character Encoding menu in Mozilla/Firefox for more of the common options, or read the info and man pages for iconv for a reasonably complete list.