When comparing two folders using Content or QuickContent comparison, this code short circuits the byte comparison if the files have different sizes. In that case, we know the contents also differ.
Patch created against revision 7130.
But this is unfortunately wrong. We cannot trust the file size. Think about different Unicode encodings. Same file content in UTF-8 and UTF-16 results different file sizes. We could optimize some of the cases we know encodings are same.
Another case is existing / missing UTF-8 BOM bytes. So again same content and same encoding but file size may differ with three bytes.
Then the interesting point in Unicode is there are different ways to present the same data. And there might be variance in length between those presentations. So we simply cannot determine anything from the size of Unicode files.
So only case we for sure can optimize like this would be two ANSI files.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, there is that file size compare method. It suffers the shortcomings I've explained. And I think we should really document those shortcomings better. As currently it may give users wrong impression about results.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> Binary files present another case we could optimize like this.
True. There for sure are changes for this kind of optimization. The problem is to find good place to do it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It might help if the software and documentatino made an explicit distinction between the two types of content-comparisons:
(1) Text-content comparisons (insensitive to character encodings and UTF-8 BOM and possibly even white space and upper/lower case), and
(2) Binary-content comparisons (or byte-content comparisons).
As a short circuit, the size-comparison would apply only to binary-content comparisons.
A bonus is that if this distinction is explicit (i.e., offered as distinct user options), it should help alert the user to the shortcomings of the size-comparison method. This is in addition to the benefit of giving the user an option to specify what type of content-comparison matters.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Problem is that there is no technical distinction between text and binary files. It is totally dependent on application. Pretty general "rule" is to determine files with zero bytes as binary files. And that is what WinMerge does also. With exception of files that can be recognized as Unicode files. But then there are special cases like PDF files which don't contain zero bytes but don't have any text content either...
You also need to add UCS-2/UTF-16 to your (1) which means zero bytes are in the content. You only know if file is binary or UTF-16 from the BOM bytes. Or other way around, if file has zero bytes but no BOM bytes it must be a binary file. I don't remember how we currently do this detection but after that we could use this short circuiting.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What I'm imagining is another set of items in the Compare Method drop down list in Edit > Options > Compare > Folder. Currently, the list offers
Full Contents
Quick Contents
Modified Date
Modified Date & Size
Size
I'm imagining this list expanding to something like
Full Contents
Full Binary Contents
Quick Contents
Quick Binary Contents
Modified Date
Modified Date & Size
Size
With the existing options, WinMerge continues to behave as it does now. In particular, with the options Full Contents and Quick Contents, it continues to try to perform a text comparison when possible and does not use the short circuit size comparison.
When the user selects either of the two new options, Full Binary Contents or Quick Binary Contents, then WinMerge does a binary comparison of all file types, including text files, and it uses the short circuit size comparison for all file types. (There are probably better ways to label these new options).
In addition to making it a user option whether to compare text files as text or raw bytes, this should also help make clear to the user the distinction between comparison types and may help educate the user about the shortcomings of the size-comparison method. Users who appreciate that a text comparison is not the same as a byte comparison can more easily grasp that the file size isn't necessarily a meaningful property when comparing text.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
>Or other way around, if file has zero bytes but no BOM bytes it must
>be a binary file.
wrong, we can have UTF-8 also UTF-16LE without BOM
WinCE handles textfiles(ini) as UTF-16LE without BOM.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
UTF-16BE and UTF-16LE files without BOM bytes are totally another can of worms and not subject of this patch. UTF-16 files must have BOM bytes to avoid guesswork in processing them.
> I'm imagining this list expanding to something like ...
Interesting idea. In principle I'm still against adding new compare methods/options. It appears users don't understand our current methods and many questions and bug reports are dues to wrong expectations.
After years of pondering these problems I'm beginning to think we just can't do this file type detection good enough without help from user. Or rather, make it fast enough without reading whole gigabytes sized file.
This is race between results reliability and speed (as usual). But it is hard to decide beforehand (without knowing the file type) what to do. And when you know it is too late. So I think we need to ask users some help. Current (slower) automation for cases where user doesn't know or care should of course remain.
One simple solution might be just create lists of filename/path patterns and match those to file types. In most systems .ISO, .EXE, .JPG files etc are binaries.There are some "quite certain" extensions we can use as defaults. Then depending on user's environment one could add more/modify that list to suit the needs. If one knows all .doc files are text files and .myext are text files one could add those to mappings list.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry about getting quite off-topic in my previous comment about file detection and such..
We already have one "short circuit" option in compare options for stopping quick contents compare after first different byte found. It is also documented that it can have side-effects. But sometimes the win in speed is more important than e.g. wrong statistics.
I figure this size shortcutting might be used in similar way - as an advanced option for users knowing what they are doing. And alter the compare process a bit by doing fast UTF-8/UTF-16 file detection first.
Fast UTF-8/16 detection would be checking for BOM bytes and few kilobytes from beginning for possible non-UTF-8 chars (to see if it is UTF.8 or 8-bit ASCII).
If we detect the file as text file (non-Unicode) or binary file then we can use this size compare shortcut.
Won't be perfect but still would be a big help in many cases. For WinMerge 3.x we could then think about better solution this whole problem...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Short circuit the content comparison when file sizes differ
Thanks for the patch.
But this is unfortunately wrong. We cannot trust the file size. Think about different Unicode encodings. Same file content in UTF-8 and UTF-16 results different file sizes. We could optimize some of the cases we know encodings are same.
Another case is existing / missing UTF-8 BOM bytes. So again same content and same encoding but file size may differ with three bytes.
Then the interesting point in Unicode is there are different ways to present the same data. And there might be variance in length between those presentations. So we simply cannot determine anything from the size of Unicode files.
So only case we for sure can optimize like this would be two ANSI files.
Yes, there is that file size compare method. It suffers the shortcomings I've explained. And I think we should really document those shortcomings better. As currently it may give users wrong impression about results.
Thank you for the comments, kimmov.
> So only case we for sure can optimize like this would be two ANSI files.
Binary files present another case we could optimize like this.
> Binary files present another case we could optimize like this.
True. There for sure are changes for this kind of optimization. The problem is to find good place to do it.
Yes, I appreciate some of the difficulties.
It might help if the software and documentatino made an explicit distinction between the two types of content-comparisons:
(1) Text-content comparisons (insensitive to character encodings and UTF-8 BOM and possibly even white space and upper/lower case), and
(2) Binary-content comparisons (or byte-content comparisons).
As a short circuit, the size-comparison would apply only to binary-content comparisons.
A bonus is that if this distinction is explicit (i.e., offered as distinct user options), it should help alert the user to the shortcomings of the size-comparison method. This is in addition to the benefit of giving the user an option to specify what type of content-comparison matters.
Problem is that there is no technical distinction between text and binary files. It is totally dependent on application. Pretty general "rule" is to determine files with zero bytes as binary files. And that is what WinMerge does also. With exception of files that can be recognized as Unicode files. But then there are special cases like PDF files which don't contain zero bytes but don't have any text content either...
You also need to add UCS-2/UTF-16 to your (1) which means zero bytes are in the content. You only know if file is binary or UTF-16 from the BOM bytes. Or other way around, if file has zero bytes but no BOM bytes it must be a binary file. I don't remember how we currently do this detection but after that we could use this short circuiting.
Good points.
What I'm imagining is another set of items in the Compare Method drop down list in Edit > Options > Compare > Folder. Currently, the list offers
Full Contents
Quick Contents
Modified Date
Modified Date & Size
Size
I'm imagining this list expanding to something like
Full Contents
Full Binary Contents
Quick Contents
Quick Binary Contents
Modified Date
Modified Date & Size
Size
With the existing options, WinMerge continues to behave as it does now. In particular, with the options Full Contents and Quick Contents, it continues to try to perform a text comparison when possible and does not use the short circuit size comparison.
When the user selects either of the two new options, Full Binary Contents or Quick Binary Contents, then WinMerge does a binary comparison of all file types, including text files, and it uses the short circuit size comparison for all file types. (There are probably better ways to label these new options).
In addition to making it a user option whether to compare text files as text or raw bytes, this should also help make clear to the user the distinction between comparison types and may help educate the user about the shortcomings of the size-comparison method. Users who appreciate that a text comparison is not the same as a byte comparison can more easily grasp that the file size isn't necessarily a meaningful property when comparing text.
>Or other way around, if file has zero bytes but no BOM bytes it must
>be a binary file.
wrong, we can have UTF-8 also UTF-16LE without BOM
WinCE handles textfiles(ini) as UTF-16LE without BOM.
> wrong [about UTF-16 files without BOM]
UTF-16BE and UTF-16LE files without BOM bytes are totally another can of worms and not subject of this patch. UTF-16 files must have BOM bytes to avoid guesswork in processing them.
> I'm imagining this list expanding to something like ...
Interesting idea. In principle I'm still against adding new compare methods/options. It appears users don't understand our current methods and many questions and bug reports are dues to wrong expectations.
After years of pondering these problems I'm beginning to think we just can't do this file type detection good enough without help from user. Or rather, make it fast enough without reading whole gigabytes sized file.
This is race between results reliability and speed (as usual). But it is hard to decide beforehand (without knowing the file type) what to do. And when you know it is too late. So I think we need to ask users some help. Current (slower) automation for cases where user doesn't know or care should of course remain.
One simple solution might be just create lists of filename/path patterns and match those to file types. In most systems .ISO, .EXE, .JPG files etc are binaries.There are some "quite certain" extensions we can use as defaults. Then depending on user's environment one could add more/modify that list to suit the needs. If one knows all .doc files are text files and .myext are text files one could add those to mappings list.
Sorry about getting quite off-topic in my previous comment about file detection and such..
We already have one "short circuit" option in compare options for stopping quick contents compare after first different byte found. It is also documented that it can have side-effects. But sometimes the win in speed is more important than e.g. wrong statistics.
I figure this size shortcutting might be used in similar way - as an advanced option for users knowing what they are doing. And alter the compare process a bit by doing fast UTF-8/UTF-16 file detection first.
Fast UTF-8/16 detection would be checking for BOM bytes and few kilobytes from beginning for possible non-UTF-8 chars (to see if it is UTF.8 or 8-bit ASCII).
If we detect the file as text file (non-Unicode) or binary file then we can use this size compare shortcut.
Won't be perfect but still would be a big help in many cases. For WinMerge 3.x we could then think about better solution this whole problem...