Koolwired.Imap / Discussion / Help: "bugs" decoding Non-ASCII headers

I made a couple of simple bug fixes:

1)
<<
   An encoded-word may not be more than 75 characters long, including
   charset, encoding, encoded-text, and delimiters. If it is desirable
   to encode more text than will fit in an encoded-word of 75
   characters, multiple encoded-words (separated by CRLF SPACE) may be
   used.
>> (see: http://www.faqs.org/rfcs/rfc1522.html\)

When parsing a header with multiline encoded text, the input parameter in ImapDecode.Decode(string input) contains a SPACE (no CRLF though...) between the encoded lines that is not supposed to appear in the decoded string.

Therefore I added the following lines of code:

if (matches.Count > 1)
ret = ret.Replace("?= =?", "?==?");

There is still a slight chance of "false positives" in case of a mixed encoded/unencoded header (I hardly found any) which happens to contain the "?= =?" substring in the unencoded text, but it is probably a risk we can take.

The modified function is:

        internal static string Decode(string input)
        {
            if (input == "" || input == null)
                return "";

Regex regex = new Regex(@"=\?(?<Encoding>[^\?]+)\?(?<Method>[^\?]+)\?(?<Text>[^\?]+)\?=");
MatchCollection matches = regex.Matches(input);

string ret = input;

            //added lines
            if (matches.Count > 1)
                ret = ret.Replace("?= =?", "?==?");

            foreach (Match match in matches)
            {
                string encoding = match.Groups["Encoding"].Value;
                string method = match.Groups["Method"].Value;
                string text = match.Groups["Text"].Value;
                string decoded;
                if (method == "B")
                {
                    byte[] bytes = Convert.FromBase64String(text);
                    Encoding enc = Encoding.GetEncoding(encoding);
                    decoded = enc.GetString(bytes);
                }
                else
                    decoded = Decode(text, Encoding.GetEncoding(encoding));
                ret = ret.Replace(match.Groups[0].Value, decoded);
            }
            return ret;
       }

2)
<<
       The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be
       represented as "_" (underscore, ASCII 95.). (This character may
       not pass through some internetwork mail gateways, but its use
       will greatly enhance readability of "Q" encoded data with mail
       readers that do not support this encoding.) Note that the "_"
       always represents hexadecimal 20, even if the SPACE character
       occupies a different code position in the character set in use.
>> (see: http://www.faqs.org/rfcs/rfc1522.html\)

In this case, I simply replaced the '_' character with a space in the ImapDecode.Decode(string input, Encoding enc).

        internal static string Decode(string input, Encoding enc)
        {
            if (input == "" || input == null)
                return "";
            string decoded;
            byte[] bytes;

            //added line
            input = input.Replace("_", " ");
            MatchCollection matches = Regex.Matches(input, @"\=(?<num>[0-9A-Fa-f]{2})");// Substring(input.IndexOf('=') + 1, 2);

            foreach (Match match in matches) //while (input.Contains("="))
            {
                //string ttr = Regex.Match("input", @"=(?<num>[0-9A-Fa-f]{2})").Groups[num].Substring(input.IndexOf('=') + 1, 2);
                //int i = int.Parse(ttr, System.Globalization.NumberStyles.HexNumber);
                int i = int.Parse(match.Groups["num"].Value, System.Globalization.NumberStyles.HexNumber);
                char str = (char)i;
                input = input.Replace(match.Groups[0].Value, str.ToString());
            }
            bytes = System.Text.Encoding.Default.GetBytes(input);
            decoded = enc.GetString(bytes);
            return decoded;
        }

Great library!

Ciao
Stefano

The code will still fail to decode non-ASCII characters.
The problem is that
char str = (char)i;
bytes = System.Text.Encoding.Default.GetBytes(input);
will not always convert back to "i" in bytes as (char) and GetBytes() are different conversion methods.

I would write the procedure as follows:

        internal static string Decode(string input, Encoding enc)
        {
            if (string.IsNullOrEmpty(input)) return string.Empty;

char[] chars = input.ToCharArray();
byte[] bytes = new byte[chars.Length];

            int j = 0;
            for (int i = 0; i < chars.Length; i++, j++)
            {
                if (chars[i] == '=')
                {
                    i++;
                    if (chars.Length >= i + 2 &&
                        byte.TryParse(new string(chars, i, 2), System.Globalization.NumberStyles.HexNumber, System.Globalization.CultureInfo.InvariantCulture, out bytes[j]))
                        i++;
                    else
                        j--;
                }
                else if (chars[i] == '_')
                    bytes[j] = (byte)' ';
                else
                    bytes[j] = (byte)chars[i];
            }
            return new string(enc.GetChars(bytes, 0, j));
        }

"bugs" decoding Non-ASCII headers

Forums

Help

"bugs" decoding Non-ASCII headers document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

"bugs" decoding Non-ASCII headers