#2942 Short circuit content comparison when sizes differ

Not Scheduled
open-later
nobody
5
2013-02-28
2010-04-23
Greg Bullock
No

When comparing two folders using Content or QuickContent comparison, this code short circuits the byte comparison if the files have different sizes. In that case, we know the contents also differ.
Patch created against revision 7130.

Discussion

  • Kimmo Varis
    Kimmo Varis
    2010-04-23

    Thanks for the patch.

    But this is unfortunately wrong. We cannot trust the file size. Think about different Unicode encodings. Same file content in UTF-8 and UTF-16 results different file sizes. We could optimize some of the cases we know encodings are same.

    Another case is existing / missing UTF-8 BOM bytes. So again same content and same encoding but file size may differ with three bytes.

    Then the interesting point in Unicode is there are different ways to present the same data. And there might be variance in length between those presentations. So we simply cannot determine anything from the size of Unicode files.

    So only case we for sure can optimize like this would be two ANSI files.

     
  • Kimmo Varis
    Kimmo Varis
    2010-04-23

    Yes, there is that file size compare method. It suffers the shortcomings I've explained. And I think we should really document those shortcomings better. As currently it may give users wrong impression about results.

     
  • Greg Bullock
    Greg Bullock
    2010-04-23

    Thank you for the comments, kimmov.

    > So only case we for sure can optimize like this would be two ANSI files.

    Binary files present another case we could optimize like this.

     
  • Kimmo Varis
    Kimmo Varis
    2010-04-23

    > Binary files present another case we could optimize like this.
    True. There for sure are changes for this kind of optimization. The problem is to find good place to do it.

     
  • Greg Bullock
    Greg Bullock
    2010-04-23

    Yes, I appreciate some of the difficulties.

    It might help if the software and documentatino made an explicit distinction between the two types of content-comparisons:
    (1) Text-content comparisons (insensitive to character encodings and UTF-8 BOM and possibly even white space and upper/lower case), and
    (2) Binary-content comparisons (or byte-content comparisons).

    As a short circuit, the size-comparison would apply only to binary-content comparisons.

    A bonus is that if this distinction is explicit (i.e., offered as distinct user options), it should help alert the user to the shortcomings of the size-comparison method. This is in addition to the benefit of giving the user an option to specify what type of content-comparison matters.

     
  • Kimmo Varis
    Kimmo Varis
    2010-04-29

    Problem is that there is no technical distinction between text and binary files. It is totally dependent on application. Pretty general "rule" is to determine files with zero bytes as binary files. And that is what WinMerge does also. With exception of files that can be recognized as Unicode files. But then there are special cases like PDF files which don't contain zero bytes but don't have any text content either...

    You also need to add UCS-2/UTF-16 to your (1) which means zero bytes are in the content. You only know if file is binary or UTF-16 from the BOM bytes. Or other way around, if file has zero bytes but no BOM bytes it must be a binary file. I don't remember how we currently do this detection but after that we could use this short circuiting.

     
  • Greg Bullock
    Greg Bullock
    2010-04-30

    Good points.

    What I'm imagining is another set of items in the Compare Method drop down list in Edit > Options > Compare > Folder. Currently, the list offers

    Full Contents
    Quick Contents
    Modified Date
    Modified Date & Size
    Size

    I'm imagining this list expanding to something like

    Full Contents
    Full Binary Contents
    Quick Contents
    Quick Binary Contents
    Modified Date
    Modified Date & Size
    Size

    With the existing options, WinMerge continues to behave as it does now. In particular, with the options Full Contents and Quick Contents, it continues to try to perform a text comparison when possible and does not use the short circuit size comparison.

    When the user selects either of the two new options, Full Binary Contents or Quick Binary Contents, then WinMerge does a binary comparison of all file types, including text files, and it uses the short circuit size comparison for all file types. (There are probably better ways to label these new options).

    In addition to making it a user option whether to compare text files as text or raw bytes, this should also help make clear to the user the distinction between comparison types and may help educate the user about the shortcomings of the size-comparison method. Users who appreciate that a text comparison is not the same as a byte comparison can more easily grasp that the file size isn't necessarily a meaningful property when comparing text.

     
  • Matthias
    Matthias
    2010-05-01

    >Or other way around, if file has zero bytes but no BOM bytes it must
    >be a binary file.
    wrong, we can have UTF-8 also UTF-16LE without BOM
    WinCE handles textfiles(ini) as UTF-16LE without BOM.

     
  • Kimmo Varis
    Kimmo Varis
    2010-05-01

    > wrong [about UTF-16 files without BOM]

    UTF-16BE and UTF-16LE files without BOM bytes are totally another can of worms and not subject of this patch. UTF-16 files must have BOM bytes to avoid guesswork in processing them.

    > I'm imagining this list expanding to something like ...

    Interesting idea. In principle I'm still against adding new compare methods/options. It appears users don't understand our current methods and many questions and bug reports are dues to wrong expectations.

    After years of pondering these problems I'm beginning to think we just can't do this file type detection good enough without help from user. Or rather, make it fast enough without reading whole gigabytes sized file.

    This is race between results reliability and speed (as usual). But it is hard to decide beforehand (without knowing the file type) what to do. And when you know it is too late. So I think we need to ask users some help. Current (slower) automation for cases where user doesn't know or care should of course remain.

    One simple solution might be just create lists of filename/path patterns and match those to file types. In most systems .ISO, .EXE, .JPG files etc are binaries.There are some "quite certain" extensions we can use as defaults. Then depending on user's environment one could add more/modify that list to suit the needs. If one knows all .doc files are text files and .myext are text files one could add those to mappings list.

     
  • Kimmo Varis
    Kimmo Varis
    2010-05-04

    Sorry about getting quite off-topic in my previous comment about file detection and such..

    We already have one "short circuit" option in compare options for stopping quick contents compare after first different byte found. It is also documented that it can have side-effects. But sometimes the win in speed is more important than e.g. wrong statistics.

    I figure this size shortcutting might be used in similar way - as an advanced option for users knowing what they are doing. And alter the compare process a bit by doing fast UTF-8/UTF-16 file detection first.

    Fast UTF-8/16 detection would be checking for BOM bytes and few kilobytes from beginning for possible non-UTF-8 chars (to see if it is UTF.8 or 8-bit ASCII).

    If we detect the file as text file (non-Unicode) or binary file then we can use this size compare shortcut.

    Won't be perfect but still would be a big help in many cases. For WinMerge 3.x we could then think about better solution this whole problem...

     
  • Christian List
    Christian List
    2012-12-31

    • status: open --> open-later
     
  • Christian List
    Christian List
    2013-02-28

    • milestone: Trunk --> Not Scheduled