From: Alexey S. <al...@sh...> - 2021-05-27 00:08:48
|
Hi SquirrelMail devs, Today investigating a decoding failure on one of spam messages I found an unexpected behaviour of htmlspecialchars function which affects SquirrelMail. Consider this sample code (you can paste it into w3schools php tryit editor if don't want to run it on your machine): <?php $str=base64_decode('1OXk5fDg6/zt4P8g8evz5uHgIO/uIPLw8+TzIOgg5+Dt//Lu8fLo'); echo 'before: ' . bin2hex($str) . '<br>'; $esc=htmlspecialchars($str, ENT_COMPAT | ENT_SUBSTITUTE, 'utf-8'); echo 'after: ' . bin2hex($esc) . '<br>'; ?> The "before" and "after" lines show hex representations of the same string (each byte of the string is encoded with two characters) before and after it get processed by the htmlspecialchars function. And they are vastly different: before: d4e5e4e5f0e0ebfc... after: efbfbdefbfbdefbfbd... The base64-decoded string is a valid cp1251 string, you can see some Cyrillic letters by adding this line to the code above: echo 'actual: ' . iconv('cp1251', 'utf-8', $str) . '<br>'; No HTML special characters present in that string. However, because of ENT_SUBSTITUTE flag and the last 'utf-8' argument, htmlspecialchars function replaces all byte sequences which are invalid in utf-8 with a Unicode Replacement Character. And a valid cp1251 string is full of them! To fix it, in the code above one should specify 'cp1251' instead of 'utf-8' as the last argument to htmlspecialchars function. How this affects SquirrelMail? In version 1.4, function charset_decode calls sm_encode_html_special_chars without passing any character encoding to it: https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/i18n.php#l187 what makes sm_encode_html_special_chars function default to $default_encoding which is, I believe, utf-8 in most cases, and later gets passed to htmlspecialchars: https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/strings.php#l1559 And charset_decode is called by decodeHeader when it encounters a base64-encoded header: https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/mime.php#l727 Example of the affected header (spammer's email redacted): From: "=?windows-1251?B?1OXk5fDg6/zt4P8g8evz5uHgIO/uIPLw8+TzIOgg5+Dt//Lu8fLo?=" <sp...@ex...> I believe this can be fixed by function charset_decode passing some encoding argument to sm_encode_html_special_chars - either some hardcoded "ISO-8859-1" which is likely to have all characters allowed, or actual charset passed to charset_decode by its caller - but note that it looks like htmlspecialchars supports less encodings than charset_decode does, hence probably first approach is better. Thanks for reading so far! Any thoughts? Alexey. |