Software version: CVS
W3mman2html doesn't correctly deal with underlined UTF-8 text and every single byte is underlined separately, for example the backspace-escape code _^Hé is transformed into the HTML code <u>\0xC3</u>\0xA9 (with two invalid 1 byte UTF-8 sequences).
In fact it assumes that backspace escape codes are of the form __^Hé or é^H__ (with two underscores). However, as far as I could test with the Ubuntu man program, only one underscore is generated, independently of the length of the UTF-8 encoding for that letter (man version 2.5.7-4, groff version 1.20.1-10).
The number of backspace characters in the bold and italic escape codes is only one as far as I could see, hence the match for multiple backspace characters is useless, even if it is innocuous.
I submit a patch that should correctly deal with bold and underline escapes independently of the length of the UTF-8 character. Till now only 2-byte characters were taken into account.
If the man page is in a single byte encoding instead of UTF-8, the underline matching code may match too much character like for the combination _^Hé . Such sequences should however be very rare, since usually only whole words are underlined and a backspace escape code will be followed either by a space or by another backspace escape code.
Correct underline processing and more UTF-8 support
Fixed in Debian w3m 0.5.3-10.
http://anonscm.debian.org/gitweb/?p=collab-maint/w3m.git