Migrate from GitHub to SourceForge with this tool. Check out all of SourceForge's recent improvements.
Close

#1676 WM detects invalid UTF-8 sequence as UTF-8

Trunk
closed-fixed
None
5
2008-02-24
2008-02-24
No

When comparing the attached file, WinMerge detects the file as UTF-8. But the file is CP-932.

I know anybody cannot make perfect UTF-8 detection system. But it is possible to detect the file as non-UTF-8 because it has invalid UTF-8 sequence.

For example, the first double byte character in the attached file is 0x8376 and it is invalid UTF-8 sequence.

[cp-932.txt]
0000000: 8376 838d 8367 835e 8343 8376 90e9 8cbe 0000010: 0d0a 6161 6161 6161 6161 6161 6161 6161
0000020: 6161 6161 610d 0a0d 0a0d 0a

[UTF-8 bit pattern]
0xxxxxxx (00-7f)
110xxxxx 10xxxxxx (c0-df)(80-bf)
1110xxxx 10xxxxxx 10xxxxxx (e0-ef)(80-bf)(80-bf)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (f0-f7)(80-bf)(80-bf)(80-bf)
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx (f8-fb)(80-bf)(80-bf)(80-bf)(80-bf)
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx (fc-fd)(80-bf)(80-bf)(80-bf)(80-bf)(80-bf)

Here is the fix:
Index: Utf8FileDetect.cpp
===================================================================
--- Utf8FileDetect.cpp (revision 5074)
+++ Utf8FileDetect.cpp (working copy)
@@ -50,7 +50,9 @@
bool bUTF8 = false;
for (int i = 0; i < (size - 3); ++i)
{
- if ((*pVal2 & 0xE0) == 0xC0)
+ if ((*pVal2 & 0x80) == 0x00)
+ ;
+ else if ((*pVal2 & 0xE0) == 0xC0)
{
pVal2++;
i++;
@@ -58,7 +60,7 @@
return true;
bUTF8 = true;
}
- if ((*pVal2 & 0xF0) == 0xE0)
+ else if ((*pVal2 & 0xF0) == 0xE0)
{
pVal2++;
i++;
@@ -70,7 +72,7 @@
return true;
bUTF8 = true;
}
- if ((*pVal2 & 0xF8) == 0xF0)
+ else if ((*pVal2 & 0xF8) == 0xF0)
{
pVal2++;
i++;
@@ -86,6 +88,8 @@
return true;
bUTF8 = true;
}
+ else
+ return true;
pVal2++;
}
if (bUTF8)

Discussion

  • Takashi Sawanaka

     
  • Kimmo Varis

    Kimmo Varis - 2008-02-24

    Logged In: YES
    user_id=631874
    Originator: NO

    Ok.

     
  • Takashi Sawanaka

    • assigned_to: nobody --> sdottaka
    • status: open --> closed-fixed
     
  • Takashi Sawanaka

    Logged In: YES
    user_id=954028
    Originator: YES

    Committed to SVN trunk. Completed: At revision: 5076

     

Log in to post a comment.