Thread: [SM-DEVEL] htmlspecialchars borking non-utf8 encodings

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi SquirrelMail devs,

Today investigating a decoding failure on one of spam messages I found an
unexpected behaviour of htmlspecialchars function which affects
SquirrelMail. Consider this sample code (you can paste it into w3schools
php tryit editor if don't want to run it on your machine):

<?php
$str=base64_decode('1OXk5fDg6/zt4P8g8evz5uHgIO/uIPLw8+TzIOgg5+Dt//Lu8fLo');
echo 'before: ' . bin2hex($str) . '<br>';
$esc=htmlspecialchars($str, ENT_COMPAT | ENT_SUBSTITUTE, 'utf-8');
echo 'after:  ' . bin2hex($esc) . '<br>';
?>

The "before" and "after" lines show hex representations of the same string
(each byte of the string is encoded with two characters) before and after
it get processed by the htmlspecialchars function. And they are vastly
different:

before: d4e5e4e5f0e0ebfc...
after: efbfbdefbfbdefbfbd...

The base64-decoded string is a valid cp1251 string, you can see some
Cyrillic letters by adding this line to the code above:

echo 'actual: ' . iconv('cp1251', 'utf-8', $str) . '<br>';

No HTML special characters present in that string. However, because of
ENT_SUBSTITUTE flag and the last 'utf-8' argument, htmlspecialchars
function replaces all byte sequences which are invalid in utf-8 with a
Unicode Replacement Character. And a valid cp1251 string is full of them!

To fix it, in the code above one should specify 'cp1251' instead of
'utf-8' as the last argument to htmlspecialchars function.

How this affects SquirrelMail?

In version 1.4, function charset_decode calls sm_encode_html_special_chars
without passing any character encoding to it:

https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/i18n.php#l187

what makes sm_encode_html_special_chars function default to
$default_encoding which is, I believe, utf-8 in most cases, and later gets
passed to htmlspecialchars:

https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/strings.php#l1559

And charset_decode is called by decodeHeader when it encounters a
base64-encoded header:

https://sourceforge.net/p/squirrelmail/code/HEAD/tree/branches/SM-1_4-STABLE/squirrelmail/functions/mime.php#l727

Example of the affected header (spammer's email redacted):

From:
"=?windows-1251?B?1OXk5fDg6/zt4P8g8evz5uHgIO/uIPLw8+TzIOgg5+Dt//Lu8fLo?="
<sp...@ex...>

I believe this can be fixed by function charset_decode passing some
encoding argument to sm_encode_html_special_chars - either some hardcoded
"ISO-8859-1" which is likely to have all characters allowed, or actual
charset passed to charset_decode  by its caller - but note that it looks
like htmlspecialchars supports less encodings than charset_decode does,
hence probably first approach is better.

Thanks for reading so far!

Any thoughts?

Alexey.

Thread: [SM-DEVEL] htmlspecialchars borking non-utf8 encodings

squirrelmail-devel