#2129 UTF8 detection fails if newline is missing

Andrew Scheller

WinMerge can auto-detect if a text file is saved in Windows-1252 or UTF-8 format. However this detection seems to fail if there are no newline characters between the first UTF-8 character and the end of the file (i.e. the UTF8 characters are only on the last line of the file).

Hopefully the test-case files in the attachment illustrate this clearly:
* the same-named-files in the corresponding "dirA" and "dirB" directores are identical apart from the first character being either 'A' or 'B' (so that when doing a folder-comparison WinMerge always spots each pair of files as being different)
* apart from the right4.txt files (which are Windows-1252), all the rest of the files are in UTF-8 format

WinMerge guesses the file-encoding correctly for the right?.txt files, but for the wrong?.txt files it guesses "1252" when the files are actually "UTF-8". This means WinMerge displays '£' instead of '£'. The only differences between (wrong1.txt & right1.txt) and (wrong2.txt & right2.txt) are the addition of a newline.

I guess there's probably an end-of-line callback (or something) that isn't getting triggered?

WinMerge configuration log

Saved to: C:\Users\Andrew\Documents\WinMerge\WinMerge.txt
Please add this information (or attach this file)
when reporting bugs.
Module names prefixed with tilda (~) are currently loaded in WinMerge process.

Version information:
Build config: UNICODE _UNICODE
Command Line: "D:\WinMergeTest\dirA" "D:\WinMergeTest\dirB"
Windows: Microsoft Windows 7 Home Edition Service Pack 1 (Build 7601)
~COMCTL32.dll dllversion=6.16 dllbuild=7601
~shlwapi.dll dllversion=6.01 dllbuild=7601
MergeLang.dll version=2.14 build=0000
ShellExtension.dll version=0.00 build=0000
ShellExtensionU.dll version=0.00 build=0000
ShellExtensionX64.dll version=1.16 build=0008

WinMerge configuration:
Compare settings:
Ignore blank lines: No
Ignore case: No
Ignore carriage return differences: Yes
Whitespace compare: Ignore change
Detect moved blocks: Yes
Compare method: 0
Stop after first diff: No

Other settings:
Automatic rescan: No
Simple EOL: Yes
Automatic scroll to 1st difference: No
Backup original file: No

Folder compare:
Identical files: No
Different files: Yes
Left Unique files: Yes
Right Unique files: Yes
Binary files: Yes
Skipped files: Yes
Tree-mode enabled: No

File compare:
Preserve filetimes: No
Match similar lines: No

Editor settings:
View Whitespace: No
Merge Mode enabled: No
Show linenumbers: No
Wrap lines: No
Syntax Highlight: Yes
Tab size: 4
Insert tabs: Yes

Font facename: Courier New
Font charset: 0 (Ansi)

System settings:
codepage settings:
ANSI codepage: 1252
OEM codepage: 850
Locale (Thread):
Def ANSI codepage: 1252
Def OEM codepage: 437
Country: United States
Language: English
Language code: 0409
ISO Language code: en
Locale (User):
Def ANSI codepage: 1252
Def OEM codepage: 850
Country: United Kingdom
Language: English
Language code: 0809
ISO Language code: en
Locale (System):
Def ANSI codepage: 1252
Def OEM codepage: 850
Country: United Kingdom
Language: English
Language code: 0809
ISO Language code: en
Detect codepage automatically for RC and HTML files: No
unicoder codepage: 1252

Plugins enabled: No
CompareMSExcelFiles.dll [C:\Program Files (x86)\WinMerge\MergePlugins\CompareMSExcelFiles.dll]
CompareMSWordFiles.dll [C:\Program Files (x86)\WinMerge\MergePlugins\CompareMSWordFiles.dll]
DisplayXMLFiles.dll [C:\Program Files (x86)\WinMerge\MergePlugins\DisplayXMLFiles.dll]
WatchBeginningOfLog.dll [C:\Program Files (x86)\WinMerge\MergePlugins\WatchBeginningOfLog.dll]
WatchEndOfLog.dll [C:\Program Files (x86)\WinMerge\MergePlugins\WatchEndOfLog.dll]
IgnoreColumns.dll [C:\Program Files (x86)\WinMerge\MergePlugins\IgnoreColumns.dll]
IgnoreCommentsC.dll [C:\Program Files (x86)\WinMerge\MergePlugins\IgnoreCommentsC.dll]
IgnoreFieldsComma.dll [C:\Program Files (x86)\WinMerge\MergePlugins\IgnoreFieldsComma.dll]
IgnoreFieldsTab.dll [C:\Program Files (x86)\WinMerge\MergePlugins\IgnoreFieldsTab.dll]
IgnoreLeadingLineNumbers.dll [C:\Program Files (x86)\WinMerge\MergePlugins\IgnoreLeadingLineNumbers.dll]
Editor scripts:
editor addin.sct [C:\Program Files (x86)\WinMerge\MergePlugins\editor addin.sct]
insert datetime.sct [C:\Program Files (x86)\WinMerge\MergePlugins\insert datetime.sct]

Archive support:
Enable: 1
7-Zip software installed on your computer: 0.00
7-Zip components for standalone operation: 0.00
Merge7z plugins on path:

1 Attachments


  • Been reading a bit more about the UTF-8 file format. The bug detailed above still stands, but if I have a UTF8 BOM at the start of the file (e.g. after saving the file in Notepad), then WinMerge correctly identifies it as UTF8 regardless of whether there's any newlines or not.
    But of course the majority of UTF8 files don't have BOM markers...

  • Okay, so looking at the fix made to winmerge2011 shows that all my comments about newlines were a bit of a red herring ;-)
    Turns out that the problem is simply having the only UTF-8 characters too close to the end of the file (i.e. I only triggered this bug purely because I was creating such small test files!).
    So, in WinMerge the raw bytes '0x41 0x20 0xC2 0xA3' or '0x41 0x20 0xC2 0xA3 0x35' get mis-identified as "1252" (and displayed wrong), but the raw bytes '0x41 0x20 0xC2 0xA3 0x35 0x35' (A £55) get identified as "UTF-8" (and displayed correctly).

    New zipfile attached which includes some extra testcases (in the zipfile right4.txt is Windows-1252, right5.txt is pure-ASCII, and the rest are UTF-8).

  • Hmmm, looking again at the winmerge2011 fix, is there a good reason for the string to be iterated over twice? Wouldn't it be more efficient to simply include the test from the first while loop inside the second while loop instead?
    I have to confess I don't entirely follow the UTF8-detection algorithm, but would combining the two while loops make a folder-comparison (with a lot of files) quicker, or would it just make the code more complex with no noticeable increase in speed?