1)
<<
An encoded-word may not be more than 75 characters long, including
charset, encoding, encoded-text, and delimiters. If it is desirable
to encode more text than will fit in an encoded-word of 75
characters, multiple encoded-words (separated by CRLF SPACE) may be
used.
>> (see: http://www.faqs.org/rfcs/rfc1522.html\)
When parsing a header with multiline encoded text, the input parameter in ImapDecode.Decode(string input) contains a SPACE (no CRLF though...) between the encoded lines that is not supposed to appear in the decoded string.
Therefore I added the following lines of code:
if (matches.Count > 1)
ret = ret.Replace("?= =?", "?==?");
There is still a slight chance of "false positives" in case of a mixed encoded/unencoded header (I hardly found any) which happens to contain the "?= =?" substring in the unencoded text, but it is probably a risk we can take.
Regex regex = new Regex(@"=\?(?<Encoding>[^\?]+)\?(?<Method>[^\?]+)\?(?<Text>[^\?]+)\?=");
MatchCollection matches = regex.Matches(input);
string ret = input;
//added lines
if (matches.Count > 1)
ret = ret.Replace("?= =?", "?==?");
foreach (Match match in matches)
{
string encoding = match.Groups["Encoding"].Value;
string method = match.Groups["Method"].Value;
string text = match.Groups["Text"].Value;
string decoded;
if (method == "B")
{
byte[] bytes = Convert.FromBase64String(text);
Encoding enc = Encoding.GetEncoding(encoding);
decoded = enc.GetString(bytes);
}
else
decoded = Decode(text, Encoding.GetEncoding(encoding));
ret = ret.Replace(match.Groups[0].Value, decoded);
}
return ret;
}
2)
<<
The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be
represented as "_" (underscore, ASCII 95.). (This character may
not pass through some internetwork mail gateways, but its use
will greatly enhance readability of "Q" encoded data with mail
readers that do not support this encoding.) Note that the "_"
always represents hexadecimal 20, even if the SPACE character
occupies a different code position in the character set in use.
>> (see: http://www.faqs.org/rfcs/rfc1522.html\)
In this case, I simply replaced the '_' character with a space in the ImapDecode.Decode(string input, Encoding enc).
The code will still fail to decode non-ASCII characters.
The problem is that
char str = (char)i;
bytes = System.Text.Encoding.Default.GetBytes(input);
will not always convert back to "i" in bytes as (char) and GetBytes() are different conversion methods.
I made a couple of simple bug fixes:
1)
<<
An encoded-word may not be more than 75 characters long, including
charset, encoding, encoded-text, and delimiters. If it is desirable
to encode more text than will fit in an encoded-word of 75
characters, multiple encoded-words (separated by CRLF SPACE) may be
used.
>> (see: http://www.faqs.org/rfcs/rfc1522.html\)
When parsing a header with multiline encoded text, the input parameter in ImapDecode.Decode(string input) contains a SPACE (no CRLF though...) between the encoded lines that is not supposed to appear in the decoded string.
Therefore I added the following lines of code:
if (matches.Count > 1)
ret = ret.Replace("?= =?", "?==?");
There is still a slight chance of "false positives" in case of a mixed encoded/unencoded header (I hardly found any) which happens to contain the "?= =?" substring in the unencoded text, but it is probably a risk we can take.
The modified function is:
internal static string Decode(string input)
{
if (input == "" || input == null)
return "";
Regex regex = new Regex(@"=\?(?<Encoding>[^\?]+)\?(?<Method>[^\?]+)\?(?<Text>[^\?]+)\?=");
MatchCollection matches = regex.Matches(input);
string ret = input;
//added lines
if (matches.Count > 1)
ret = ret.Replace("?= =?", "?==?");
foreach (Match match in matches)
{
string encoding = match.Groups["Encoding"].Value;
string method = match.Groups["Method"].Value;
string text = match.Groups["Text"].Value;
string decoded;
if (method == "B")
{
byte[] bytes = Convert.FromBase64String(text);
Encoding enc = Encoding.GetEncoding(encoding);
decoded = enc.GetString(bytes);
}
else
decoded = Decode(text, Encoding.GetEncoding(encoding));
ret = ret.Replace(match.Groups[0].Value, decoded);
}
return ret;
}
2)
<<
The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be
represented as "_" (underscore, ASCII 95.). (This character may
not pass through some internetwork mail gateways, but its use
will greatly enhance readability of "Q" encoded data with mail
readers that do not support this encoding.) Note that the "_"
always represents hexadecimal 20, even if the SPACE character
occupies a different code position in the character set in use.
>> (see: http://www.faqs.org/rfcs/rfc1522.html\)
In this case, I simply replaced the '_' character with a space in the ImapDecode.Decode(string input, Encoding enc).
internal static string Decode(string input, Encoding enc)
{
if (input == "" || input == null)
return "";
string decoded;
byte[] bytes;
//added line
input = input.Replace("_", " ");
MatchCollection matches = Regex.Matches(input, @"\=(?<num>[0-9A-Fa-f]{2})");// Substring(input.IndexOf('=') + 1, 2);
foreach (Match match in matches) //while (input.Contains("="))
{
//string ttr = Regex.Match("input", @"=(?<num>[0-9A-Fa-f]{2})").Groups[num].Substring(input.IndexOf('=') + 1, 2);
//int i = int.Parse(ttr, System.Globalization.NumberStyles.HexNumber);
int i = int.Parse(match.Groups["num"].Value, System.Globalization.NumberStyles.HexNumber);
char str = (char)i;
input = input.Replace(match.Groups[0].Value, str.ToString());
}
bytes = System.Text.Encoding.Default.GetBytes(input);
decoded = enc.GetString(bytes);
return decoded;
}
Great library!
Ciao
Stefano
The code will still fail to decode non-ASCII characters.
The problem is that
char str = (char)i;
bytes = System.Text.Encoding.Default.GetBytes(input);
will not always convert back to "i" in bytes as (char) and GetBytes() are different conversion methods.
I would write the procedure as follows:
internal static string Decode(string input, Encoding enc)
{
if (string.IsNullOrEmpty(input)) return string.Empty;
char[] chars = input.ToCharArray();
byte[] bytes = new byte[chars.Length];
int j = 0;
for (int i = 0; i < chars.Length; i++, j++)
{
if (chars[i] == '=')
{
i++;
if (chars.Length >= i + 2 &&
byte.TryParse(new string(chars, i, 2), System.Globalization.NumberStyles.HexNumber, System.Globalization.CultureInfo.InvariantCulture, out bytes[j]))
i++;
else
j--;
}
else if (chars[i] == '_')
bytes[j] = (byte)' ';
else
bytes[j] = (byte)chars[i];
}
return new string(enc.GetChars(bytes, 0, j));
}
well done, thank you
Thanks the changes you guys made should be in the current version