Display UTF-8 / non-ASCII characters in trn (2-line patch)
Status: Beta
Brought to you by:
wayned
|
From: John M. <lis...@b7...> - 2015-11-04 21:29:12
|
I recently started using trn (Debian trn4 = trn-4.0-test77), and after taking the time to learn how to use it, am liking it a lot. One annoyance however was the garbled display of non-ASCII characters. A curly single quote would display as @^Y for example. Not sure if there's some work-around or configuration I overlooked, but what ended up allowing UTF-8 text to render was recompiling with this patch [1]. It's not clear to me if this could introduce other problems, but it's been working fine so far. I pasted a bunch of Cyrillic, Arabic and IPA characters into the message body of a message on my local spool and they all showed up perfectly in trn. Curly quotes are coming through as curly quotes, etc. It's a simple two-line edit, and I see another patch [2] (against an older trn 3.6 ) that's more involved, so I don't entirely understand the meaning of something like "UTF-8 aware" in this context... but personally I can't see using trn without it. Anyhow, if this two-line patch might cause an issue in certain circumstances, perhaps the behavior around it could be toggled with something like the "_C Switch to next available charset conversion" command via a more involved patch. Thoughts? John [1] https://groups.google.com/d/msg/comp.sys.raspberry-pi/7Z37Hdrm0DM/6aqD-reXFzAJ ... when trn checks for control characters in an article it assumes everything might be 7-bit with parity, so it strips off the top bit before checking! As a result it turns every extended byte between 0x80 and 0x9F into a control character which it then escapes! Hence the various garbage characters in any utf-8 or ISO-8859 article text. I ended up with a simple fix, which anyone who's compiled their own should be able to quickly replicate. In util.h there are two macros: #define AT_GREY_SPACE(s) ((*(Uchar*)(s) & 0x7F) <= ' ' || *(s) == '\177') #define AT_NORM_CHAR(s) ((*(Uchar*)(s) & 0x7F) >= ' ' && *(s) != '\177') which I switched to: #define AT_GREY_SPACE(s) (*(Uchar*)(s) <= ' ' || *(s) == '\177') #define AT_NORM_CHAR(s) (*(Uchar*)(s) >= ' ' && *(s) != '\177') and I'm now getting my utf-8 text unadulterated ... Of course I'll be in trouble if I ever encounter any post that is actually 7-bit-with parity, but I wonder if this can ever happen these days? In the olden time of UUCP, I suppose that was standard, but I think everything is 8-bit these days. [2] http://maradns.samiam.org/download/non-maradns/trn-3.6-utf8.hack.patch -- John Magolske http://b79.net/contact |