Re: [SM-DEVEL] htmlspecialchars borking non-utf8 encodings

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Wed, May 26, 2021 11:50 pm, Alexey Shpakovsky via squirrelmail-devel
wrote:
> Hi SquirrelMail devs,
>
> Today investigating a decoding failure on one of spam messages I found an
> unexpected behaviour of htmlspecialchars function which affects
> SquirrelMail.

I didn't read your mail in detail, because I believe this issue is solved
by the patch you can find here:

https://sourceforge.net/p/squirrelmail/bugs/2806/?page=3

Patch for 1.4.x:  quoted_printable_fix-1.4.x-version_3.diff

If not, please make sure you're using the newest 1.4.23-svn snapshot with
this patch.

Cheers,
-- 
Paul Lesniewski
SquirrelMail Team
Please support Open Source Software by donating to SquirrelMail!
http://squirrelmail.org/donate_paul_lesniewski.php

> Consider this sample code (you can paste it into w3schools
> php tryit editor if don't want to run it on your machine):
>
> <?php
> $str=base64_decode('1OXk5fDg6/zt4P8g8evz5uHgIO/uIPLw8+TzIOgg5+Dt//Lu8fLo');
> echo 'before: ' . bin2hex($str) . '<br>';
> $esc=htmlspecialchars($str, ENT_COMPAT | ENT_SUBSTITUTE, 'utf-8');
> echo 'after:  ' . bin2hex($esc) . '<br>';
> ?>
>
> The "before" and "after" lines show hex representations of the same string
> (each byte of the string is encoded with two characters) before and after
> it get processed by the htmlspecialchars function. And they are vastly
> different:
>
> before: d4e5e4e5f0e0ebfc...
> after: efbfbdefbfbdefbfbd...
>
> The base64-decoded string is a valid cp1251 string, you can see some
> Cyrillic letters by adding this line to the code above:
>
> echo 'actual: ' . iconv('cp1251', 'utf-8', $str) . '<br>';
>
> No HTML special characters present in that string. However, because of
> ENT_SUBSTITUTE flag and the last 'utf-8' argument, htmlspecialchars
> function replaces all byte sequences which are invalid in utf-8 with a
> Unicode Replacement Character. And a valid cp1251 string is full of them!
>
> To fix it, in the code above one should specify 'cp1251' instead of
> 'utf-8' as the last argument to htmlspecialchars function.
>
>
>
> How this affects SquirrelMail?
>
>
>
> In version 1.4, function charset_decode calls sm_encode_html_special_chars
> without passing any character encoding to it:
>
> https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/i18n.php#l187
>
> what makes sm_encode_html_special_chars function default to
> $default_encoding which is, I believe, utf-8 in most cases, and later gets
> passed to htmlspecialchars:
>
> https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/strings.php#l1559
>
> And charset_decode is called by decodeHeader when it encounters a
> base64-encoded header:
>
> https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/mime.php#l727
>
> Example of the affected header (spammer's email redacted):
>
> From:
> "=?windows-1251?B?1OXk5fDg6/zt4P8g8evz5uHgIO/uIPLw8+TzIOgg5+Dt//Lu8fLo?="
> <sp...@ex...>
>
>
>
> I believe this can be fixed by function charset_decode passing some
> encoding argument to sm_encode_html_special_chars - either some hardcoded
> "ISO-8859-1" which is likely to have all characters allowed, or actual
> charset passed to charset_decode  by its caller - but note that it looks
> like htmlspecialchars supports less encodings than charset_decode does,
> hence probably first approach is better.
>
>
> Thanks for reading so far!
>
> Any thoughts?
>
> Alexey.