From: Alexey S. <al...@sh...> - 2021-05-27 00:08:48
|
Hi SquirrelMail devs, Today investigating a decoding failure on one of spam messages I found an unexpected behaviour of htmlspecialchars function which affects SquirrelMail. Consider this sample code (you can paste it into w3schools php tryit editor if don't want to run it on your machine): <?php $str=base64_decode('1OXk5fDg6/zt4P8g8evz5uHgIO/uIPLw8+TzIOgg5+Dt//Lu8fLo'); echo 'before: ' . bin2hex($str) . '<br>'; $esc=htmlspecialchars($str, ENT_COMPAT | ENT_SUBSTITUTE, 'utf-8'); echo 'after: ' . bin2hex($esc) . '<br>'; ?> The "before" and "after" lines show hex representations of the same string (each byte of the string is encoded with two characters) before and after it get processed by the htmlspecialchars function. And they are vastly different: before: d4e5e4e5f0e0ebfc... after: efbfbdefbfbdefbfbd... The base64-decoded string is a valid cp1251 string, you can see some Cyrillic letters by adding this line to the code above: echo 'actual: ' . iconv('cp1251', 'utf-8', $str) . '<br>'; No HTML special characters present in that string. However, because of ENT_SUBSTITUTE flag and the last 'utf-8' argument, htmlspecialchars function replaces all byte sequences which are invalid in utf-8 with a Unicode Replacement Character. And a valid cp1251 string is full of them! To fix it, in the code above one should specify 'cp1251' instead of 'utf-8' as the last argument to htmlspecialchars function. How this affects SquirrelMail? In version 1.4, function charset_decode calls sm_encode_html_special_chars without passing any character encoding to it: https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/i18n.php#l187 what makes sm_encode_html_special_chars function default to $default_encoding which is, I believe, utf-8 in most cases, and later gets passed to htmlspecialchars: https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/strings.php#l1559 And charset_decode is called by decodeHeader when it encounters a base64-encoded header: https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/mime.php#l727 Example of the affected header (spammer's email redacted): From: "=?windows-1251?B?1OXk5fDg6/zt4P8g8evz5uHgIO/uIPLw8+TzIOgg5+Dt//Lu8fLo?=" <sp...@ex...> I believe this can be fixed by function charset_decode passing some encoding argument to sm_encode_html_special_chars - either some hardcoded "ISO-8859-1" which is likely to have all characters allowed, or actual charset passed to charset_decode by its caller - but note that it looks like htmlspecialchars supports less encodings than charset_decode does, hence probably first approach is better. Thanks for reading so far! Any thoughts? Alexey. |
From: Paul L. <pa...@sq...> - 2021-05-27 01:32:08
|
On Wed, May 26, 2021 11:50 pm, Alexey Shpakovsky via squirrelmail-devel wrote: > Hi SquirrelMail devs, > > Today investigating a decoding failure on one of spam messages I found an > unexpected behaviour of htmlspecialchars function which affects > SquirrelMail. I didn't read your mail in detail, because I believe this issue is solved by the patch you can find here: https://sourceforge.net/p/squirrelmail/bugs/2806/?page=3 Patch for 1.4.x: quoted_printable_fix-1.4.x-version_3.diff If not, please make sure you're using the newest 1.4.23-svn snapshot with this patch. Cheers, -- Paul Lesniewski SquirrelMail Team Please support Open Source Software by donating to SquirrelMail! http://squirrelmail.org/donate_paul_lesniewski.php > Consider this sample code (you can paste it into w3schools > php tryit editor if don't want to run it on your machine): > > <?php > $str=base64_decode('1OXk5fDg6/zt4P8g8evz5uHgIO/uIPLw8+TzIOgg5+Dt//Lu8fLo'); > echo 'before: ' . bin2hex($str) . '<br>'; > $esc=htmlspecialchars($str, ENT_COMPAT | ENT_SUBSTITUTE, 'utf-8'); > echo 'after: ' . bin2hex($esc) . '<br>'; > ?> > > The "before" and "after" lines show hex representations of the same string > (each byte of the string is encoded with two characters) before and after > it get processed by the htmlspecialchars function. And they are vastly > different: > > before: d4e5e4e5f0e0ebfc... > after: efbfbdefbfbdefbfbd... > > The base64-decoded string is a valid cp1251 string, you can see some > Cyrillic letters by adding this line to the code above: > > echo 'actual: ' . iconv('cp1251', 'utf-8', $str) . '<br>'; > > No HTML special characters present in that string. However, because of > ENT_SUBSTITUTE flag and the last 'utf-8' argument, htmlspecialchars > function replaces all byte sequences which are invalid in utf-8 with a > Unicode Replacement Character. And a valid cp1251 string is full of them! > > To fix it, in the code above one should specify 'cp1251' instead of > 'utf-8' as the last argument to htmlspecialchars function. > > > > How this affects SquirrelMail? > > > > In version 1.4, function charset_decode calls sm_encode_html_special_chars > without passing any character encoding to it: > > https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/i18n.php#l187 > > what makes sm_encode_html_special_chars function default to > $default_encoding which is, I believe, utf-8 in most cases, and later gets > passed to htmlspecialchars: > > https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/strings.php#l1559 > > And charset_decode is called by decodeHeader when it encounters a > base64-encoded header: > > https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/mime.php#l727 > > Example of the affected header (spammer's email redacted): > > From: > "=?windows-1251?B?1OXk5fDg6/zt4P8g8evz5uHgIO/uIPLw8+TzIOgg5+Dt//Lu8fLo?=" > <sp...@ex...> > > > > I believe this can be fixed by function charset_decode passing some > encoding argument to sm_encode_html_special_chars - either some hardcoded > "ISO-8859-1" which is likely to have all characters allowed, or actual > charset passed to charset_decode by its caller - but note that it looks > like htmlspecialchars supports less encodings than charset_decode does, > hence probably first approach is better. > > > Thanks for reading so far! > > Any thoughts? > > Alexey. |
From: Alexey S. <al...@sh...> - 2021-05-27 15:10:43
|
Thanks! Yes, indeed that patch solves my issue. I can confirm that it applies clearly on top of today's stable version snapshot (1.4.23-svn), and the issue is gone. I only wonder why it's not added to SVN :) But it's not a problem for me - I can just use the patch file. Also, there is a small typo in the patch file: the $htmlspecialchars_charsets array has 'koi8-R' with uppercase R, while code below expects all values in this array to be in lowercase. And thanks for all your hard work on squirrelmail! Alexey. On Thu, May 27, 2021 03:31, Paul Lesniewski wrote: > > > On Wed, May 26, 2021 11:50 pm, Alexey Shpakovsky via squirrelmail-devel > wrote: >> Hi SquirrelMail devs, >> >> Today investigating a decoding failure on one of spam messages I found >> an >> unexpected behaviour of htmlspecialchars function which affects >> SquirrelMail. > > I didn't read your mail in detail, because I believe this issue is solved > by the patch you can find here: > > https://sourceforge.net/p/squirrelmail/bugs/2806/?page=3 > > Patch for 1.4.x: quoted_printable_fix-1.4.x-version_3.diff > > If not, please make sure you're using the newest 1.4.23-svn snapshot with > this patch. > > Cheers, |