#261 Win1252 chars get mangled in comments

closed-invalid
5
2003-03-29
2002-10-11
Björn Höhrmann
No

See
http://www.w3.org/2002/02/mid/Pine.LNX.4.44.02071210
21530.10810-100000@drizzle.com

<!-- <o:tag>Coulomb's law</o:tag> -->

replace ' by chr 146

Tidy outputs the same as above but replaces the chr
(146) by chr(25) for whatever reason. Since comments
do not allow character references, characters in
comments must be representable in the character
encoding. If the character encoding is e.g. US-ASCII or
ISO-8859-1 the mentioned character cannot be output.
What to do in this case? Fail completly? Replace the
character by '?'? Remove the character? Change the
output encoding to UTF-8?

Discussion

  • Terry Teague
    Terry Teague
    2002-10-11

    Logged In: YES
    user_id=225318

    FYI. chr(146) in Win1252 is mapped to U+2018 in the Win2Unicode table
    in tidy.c, and in pprint.c :

    /*
    Filters from Word and PowerPoint often use smart
    quotes resulting in character codes between 128
    and 159. Unfortunately, the corresponding HTML 4.0
    entities for these are not widely supported. The
    following converts dashes and quotation marks to
    the nearest ASCII equivalent. My thanks to
    Andrzej Novosiolov for his help with this code.
    */

    if ( (MakeClean && AsciiChars) || MakeBare )
    {
    if (c >= 0x2013 && c <= 0x201E)
    switch (c) {
    ...
    case 0x2018: /* left single quotation mark */
    ...
    c = '\'';
    break;
    ...

     
    • assigned_to: nobody --> hoehrmann
    • status: open --> closed-invalid
     
  • Logged In: YES
    user_id=188003

    Closing this bug, since Tidy works as designed. Will send
    mail to mailing list for encoding issues.