#261 Win1252 chars get mangled in comments

closed-invalid
5
2003-03-29
2002-10-11
No

See
http://www.w3.org/2002/02/mid/Pine.LNX.4.44.02071210
21530.10810-100000@drizzle.com

<!-- <o:tag>Coulomb's law</o:tag> -->

replace ' by chr 146

Tidy outputs the same as above but replaces the chr
(146) by chr(25) for whatever reason. Since comments
do not allow character references, characters in
comments must be representable in the character
encoding. If the character encoding is e.g. US-ASCII or
ISO-8859-1 the mentioned character cannot be output.
What to do in this case? Fail completly? Replace the
character by '?'? Remove the character? Change the
output encoding to UTF-8?

Discussion

  • Anonymous - 2002-10-11

    Logged In: YES
    user_id=225318

    FYI. chr(146) in Win1252 is mapped to U+2018 in the Win2Unicode table
    in tidy.c, and in pprint.c :

    /*
    Filters from Word and PowerPoint often use smart
    quotes resulting in character codes between 128
    and 159. Unfortunately, the corresponding HTML 4.0
    entities for these are not widely supported. The
    following converts dashes and quotation marks to
    the nearest ASCII equivalent. My thanks to
    Andrzej Novosiolov for his help with this code.
    */

    if ( (MakeClean && AsciiChars) || MakeBare )
    {
    if (c >= 0x2013 && c <= 0x201E)
    switch (c) {
    ...
    case 0x2018: /* left single quotation mark */
    ...
    c = '\'';
    break;
    ...

     
  • Björn Höhrmann

    • assigned_to: nobody --> hoehrmann
    • status: open --> closed-invalid
     
  • Björn Höhrmann

    Logged In: YES
    user_id=188003

    Closing this bug, since Tidy works as designed. Will send
    mail to mailing list for encoding issues.

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks