Menu

#56 w3mman2html.cgi doesn't correctly underline UTF-8 characters

open
nobody
None
5
2013-11-10
2010-11-22
No

Software version: CVS

W3mman2html doesn't correctly deal with underlined UTF-8 text and every single byte is underlined separately, for example the backspace-escape code _^Hé is transformed into the HTML code <u>\0xC3</u>\0xA9 (with two invalid 1 byte UTF-8 sequences).

In fact it assumes that backspace escape codes are of the form __^Hé or é^H__ (with two underscores). However, as far as I could test with the Ubuntu man program, only one underscore is generated, independently of the length of the UTF-8 encoding for that letter (man version 2.5.7-4, groff version 1.20.1-10).

The number of backspace characters in the bold and italic escape codes is only one as far as I could see, hence the match for multiple backspace characters is useless, even if it is innocuous.

I submit a patch that should correctly deal with bold and underline escapes independently of the length of the UTF-8 character. Till now only 2-byte characters were taken into account.

If the man page is in a single byte encoding instead of UTF-8, the underline matching code may match too much character like for the combination _^Hé . Such sequences should however be very rare, since usually only whole words are underlined and a backspace escape code will be followed either by a space or by another backspace escape code.

Discussion

  • Piotr P. Karwasz

    Correct underline processing and more UTF-8 support

     

Log in to post a comment.