#261 Win1252 chars get mangled in comments



<!-- <o:tag>Coulomb's law</o:tag> -->

replace ' by chr 146

Tidy outputs the same as above but replaces the chr
(146) by chr(25) for whatever reason. Since comments
do not allow character references, characters in
comments must be representable in the character
encoding. If the character encoding is e.g. US-ASCII or
ISO-8859-1 the mentioned character cannot be output.
What to do in this case? Fail completly? Replace the
character by '?'? Remove the character? Change the
output encoding to UTF-8?


  • Anonymous - 2002-10-11

    Logged In: YES

    FYI. chr(146) in Win1252 is mapped to U+2018 in the Win2Unicode table
    in tidy.c, and in pprint.c :

    Filters from Word and PowerPoint often use smart
    quotes resulting in character codes between 128
    and 159. Unfortunately, the corresponding HTML 4.0
    entities for these are not widely supported. The
    following converts dashes and quotation marks to
    the nearest ASCII equivalent. My thanks to
    Andrzej Novosiolov for his help with this code.

    if ( (MakeClean && AsciiChars) || MakeBare )
    if (c >= 0x2013 && c <= 0x201E)
    switch (c) {
    case 0x2018: /* left single quotation mark */
    c = '\'';

  • Björn Höhrmann

    • assigned_to: nobody --> hoehrmann
    • status: open --> closed-invalid
  • Björn Höhrmann

    Logged In: YES

    Closing this bug, since Tidy works as designed. Will send
    mail to mailing list for encoding issues.


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks