Display UTF-8 / non-ASCII characters in trn (2-line patch)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I recently started using trn (Debian trn4 = trn-4.0-test77), and
after taking the time to learn how to use it, am liking it a lot.
One annoyance however was the garbled display of non-ASCII characters.
A curly single quote would display as @^Y for example.

Not sure if there's some work-around or configuration I overlooked,
but what ended up allowing UTF-8 text to render was recompiling with
this patch [1]. It's not clear to me if this could introduce other
problems, but it's been working fine so far. I pasted a bunch of
Cyrillic, Arabic and IPA characters into the message body of a message
on my local spool and they all showed up perfectly in trn. Curly
quotes are coming through as curly quotes, etc. It's a simple two-line
edit, and I see another patch [2] (against an older trn 3.6 ) that's
more involved, so I don't entirely understand the meaning of something
like "UTF-8 aware" in this context... but personally I can't see using
trn without it.

Anyhow, if this two-line patch might cause an issue in certain
circumstances, perhaps the behavior around it could be toggled with
something like the "_C Switch to next available charset conversion"
command via a more involved patch.

Thoughts?

John

[1] https://groups.google.com/d/msg/comp.sys.raspberry-pi/7Z37Hdrm0DM/6aqD-reXFzAJ
... when trn checks for control characters in an article it assumes
everything might be 7-bit with parity, so it strips off the top bit
before checking! As a result it turns every extended byte between
0x80 and 0x9F into a control character which it then escapes! Hence
the various garbage characters in any utf-8 or ISO-8859 article text.
I ended up with a simple fix, which anyone who's compiled their own
should be able to quickly replicate. In util.h there are two macros:

    #define AT_GREY_SPACE(s) ((*(Uchar*)(s) & 0x7F) <= ' ' || *(s) == '\177')
    #define AT_NORM_CHAR(s) ((*(Uchar*)(s) & 0x7F) >= ' ' && *(s) != '\177')

which I switched to:

    #define AT_GREY_SPACE(s) (*(Uchar*)(s) <= ' ' || *(s) == '\177')
    #define AT_NORM_CHAR(s) (*(Uchar*)(s) >= ' ' && *(s) != '\177')

and I'm now getting my utf-8 text unadulterated ... Of course I'll be
in trouble if I ever encounter any post that is actually 7-bit-with
parity, but I wonder if this can ever happen these days? In the olden
time of UUCP, I suppose that was standard, but I think everything is
8-bit these days.

[2] http://maradns.samiam.org/download/non-maradns/trn-3.6-utf8.hack.patch

-- 
John Magolske
http://b79.net/contact