Menu

#1313 Add binary file check to quick contents compare

Trunk
closed-accepted
5
2005-07-27
2005-07-12
Kimmo Varis
No

RFE #1234813 Compare by quick contents should do binary
file check

This patch adds new function to loop through file
buffers and check if there are zero-bytes. If
zero-bytes are found from file-buffer we add
binary-flag to result and reset all compare ignore
settings.

I can't say since I havent' profiled, but I think this
is a small speed gain for compare. When we reset ignore
settings for binary files, that makes comparing them
faster.

Also, we currently have 64-byte buffers, so they fit
very well to processor cache and finding zeros is very
fast.

Discussion

  • Kimmo Varis

    Kimmo Varis - 2005-07-12

    Original and altered files

     
  • Kimmo Varis

    Kimmo Varis - 2005-07-13

    Logged In: YES
    user_id=631874

    Applied to CVS trunk (2.3). Closing.

    Checking in ByteComparator.cpp;
    /cvsroot/winmerge/WinMerge/Src/ByteComparator.cpp,v <--
    ByteComparator.cpp
    new revision: 1.3; previous revision: 1.2
    done
    Checking in ByteComparator.h;
    /cvsroot/winmerge/WinMerge/Src/ByteComparator.h,v <--
    ByteComparator.h
    new revision: 1.2; previous revision: 1.1
    done
    Checking in DiffWrapper.cpp;
    /cvsroot/winmerge/WinMerge/Src/DiffWrapper.cpp,v <--
    DiffWrapper.cpp
    new revision: 1.53; previous revision: 1.52
    done

     
  • Kimmo Varis

    Kimmo Varis - 2005-07-13
    • assigned_to: nobody --> kimmov
    • status: open --> closed-accepted
     
  • elsapo

    elsapo - 2005-07-21

    Logged In: YES
    user_id=1195173

    This means processing the whole buffer twice instead of
    once; won't this make processing text files slower?

    My guess is that most people process mostly text files, so
    slowing down all text files in order to speed up the binary
    files may not be a great trade-off?

    (I'm debugging something and just ran into this code.)

     
  • Kimmo Varis

    Kimmo Varis - 2005-07-21

    Logged In: YES
    user_id=631874

    That kind of byte-check in loop is very fast, just think
    about how it is in assembler:
    BEGIN:
    load byte to register
    jump to BEGIN if register value is zero

    These all (<10 real instructions) are (very) fast
    operations, byte loading can take most of the time if we
    don't hit cache.

    Now, compare this to case where you apply whitespace ignore
    logic to big binary files.

    So, we lose small amount of speed in text files, but save a
    lot of time for binary files.

    Also, avoiding user frustation when quick compare marks all
    files as text files and user then tries to open .ico or .bmp
    files (we have those in WinMerge source tree for example)
    gets message WinMerge cannot open binary files and still
    dirview shows them as text files.

    So, in my opinion, small time we lose in text file compare
    is big total win, in user experience and compare speed in
    real-life source trees (which have binaries too).

    And I specifically did it that way (simple) loop since I
    think (unfortunately don't have numbers) going for simple
    byte compare for binary files as fast as possible is biggest
    win in compare time.

    And, while working with dir compare patches, I think more
    time goes to other tasks than actual compare with relatively
    small text files (~few KB). UI update etc takes a lot of
    time related to actual compare.

     
  • elsapo

    elsapo - 2005-07-21

    Logged In: YES
    user_id=1195173

    Ok.

    That doesn't solve either of the outstanding text/binary
    problems tho.

    #1) diffutils uses a partial binary algorithm on top of file

    #2) This algorithm that you modified bails out if it finds a
    difference, and quits checking binary vs. text, so it may
    not find that a binary file is binary if it finds a
    difference early.

     
  • elsapo

    elsapo - 2005-07-21

    Logged In: YES
    user_id=1195173

    In fact, we can't really solve the binary problem (that is,
    the issue that WinMerge may show a binary file as text) and
    have a fast diff, because they're incompatible goals:

    - To really tell if a file is binary, we have to go through
    every byte

    - To have a fast diff, we have to bail out when we hit a
    difference

    But, we can improve what we can. I think my PATCH [ 1242008
    ] (Update status to binary if open fails because binary) is
    along the direction of improving the experience when this
    happens.

    A really big help would be some way from the DirView to
    compare binary files -- that is a gap in our GUI, that the
    user cannot select the required prediffer
    (DisplayBinaryFiles.dll), which greatly annoys me, but not
    enough yet for me to create a patch to fix it.

     
  • Kimmo Varis

    Kimmo Varis - 2005-07-23

    Logged In: YES
    user_id=631874

    I really don't care about showing binary files. We are not
    binary editor and can't be unless somebody is ready to
    implement binary file diffing.

    What I (and users) want is we can compare ALL files, not
    just files up to say 100 MB. Thats pretty essential for
    directory compare if one wants to use it to synch directories.

    Yes, we might not be able to detect if 200 MB file is binary
    or text. But who cares? If we can't open that file, what
    does it matter if it is text or binary file? Nothing.

    Good point about quick contents compare to stop comparing
    after it finds first diff. I can revert my patches for
    faster binary compare.

    But still I'd like to see we use full potential of your
    quick contents compare, since it has no upper limit for file
    sizes.

    Couple of choices I can think of:
    1) Alter current 2MB limit (bigger files compared with quick
    contents) to say 50 MB and consider files bigger than that
    always as binary files (or add new category for them) so
    they cannot be opened to file compare
    2) Alter current quick contents compare to look through full
    file if its binary or not

     
  • elsapo

    elsapo - 2005-07-23

    Logged In: YES
    user_id=1195173

    I care a lot about "binary" files because my primary use
    these days is that I have an automated script running, which
    monitors schema in several databases. For some reason, a few
    of the data dump files wind up with a few binary 0s in them.
    They are really text files -- they look textual anyway, and
    have line returns -- but the WinMerge algorithm calls them
    binary. I guess this is my problem with that optimization of
    yours -- it makes sense for most people, but I'm in a small
    group that is using WinMerge text compare on "binary" files.
    That is, I need to ignore blank lines in these files (due to
    obscure bugs in SQL Server which generate spurious line
    returns).

    I guess my problem is that I have text files which have a
    few binary zeros in them.

    It is an arcane problem, but one that plagues me :)

    What I'd really like is an option to use whitespace rules
    for binary files as well -- or else, some way to force these
    files to be treated as text (eg, a switch to treat all files
    found as text, or to use extension override, so that it will
    treat all .txt files as text even if they have a binary zero
    or a handful).

    Also, btw, re: "If we can't open it, who cares if its binary
    or text"

    Except, we can still say whether it has changed or not,
    whether or not our WinMerge editor can open it.

    Its starting to get difficult for me to keep straight all
    the scenarios/use cases, frankly :)

     
  • Kimmo Varis

    Kimmo Varis - 2005-07-23

    Logged In: YES
    user_id=631874

    Thanks a lot for your explanation. I understand this now a
    lot better.

    Did I understand correctly that your main problem is I
    disabled whitespace options for binary files? Yes it was an
    optimization I thought was safe to do.

    So, we don't really have good defaults here. So maybe we
    just have to add new compare options so we can really handle
    these situations too.

    One nasty situation also just came into my mind: we start
    comparing two files, and default to check for whitespace
    ignores. We find couple of differences, but ighnored by
    whitespace ignores. Then we detect file as binary. Now, we
    already have ignored differences so we think files as
    identical, even though they propably should be different, as
    they can be some binary data files..

    Ugly.

     
  • Kimmo Varis

    Kimmo Varis - 2005-07-23
    • status: closed-accepted --> open-accepted
     
  • elsapo

    elsapo - 2005-07-23

    Logged In: YES
    user_id=1195173

    Wow, thats ugly, where we mark binary file as identical on
    the first pass, but if they refresh, we'll do strict binary
    comparison and show it as different after refresh.

    I think my problem with my "text"/"binary" files is that I'm
    in the gray area between text and binary. I finally just
    thought of this, that there are two different attributes,
    which are so similar that I've had them confused all this time.

    *** #1) Line-delimited files

    This is files with line delimiters spaced through them.

    We don't care which kind of line delimiters, or even if they
    change, but these line delimiters make the file much more
    legible in text editors, and furthermore, the WinMerge
    (MergeEditView & crystal) really is only good for
    line-delimited files, because it does not ever wrap lines.

    So, lets say 1=line delimited files and 2=non-line delimited
    files

    *** A) text files

    By text files, I mean ones displayable (and editable?) in a
    text editor. A file is displayable as text if it has no
    binary zeros. It is editable if it has no control characters
    (bytes under 0x20, besides tab & line terminators).

    Let's say
    A = text files with no control characters (except tab &
    line ends)
    B = text files with no binary zeros (but control characters)
    Z = files with binary zeros

    Possibilities

    == 1A: line-delimited files with no binary zeros or control
    characters

    This is the usual case "text files", and works great

    == 1B: line-delimited files with no binary zeros but control
    characters

    We display this pretty well, but I don't know if editing
    them will be lossy. I don't know if anyone uses these either.

    == 1Z: line-delimited files with binary zeros embedded

    This is a weird case that I happen to have. We can compare
    these ok, but display is problematic, and editing probably
    impossible. These are definitely to be called "binary" files.

    == 2A: no line-delimiters text files

    These compare ok, but our display/editor is not really
    suitable for these. We've not really addressed these, I
    think. Line breaking is needed to view/edit these, but line
    breaking isn't really compatible with the WinMerge display
    of differences. I think these files may fall outside our scope.

    == 2B no line delimiters and control characters

    Like (2A), only editing is more problematic; I think
    out-of-scope.

    == 2C no line delimiters and binary zeros

    These are what I envision as classical "binary files", and
    definitely out-of-scope for our viewer/editor.

     
  • elsapo

    elsapo - 2005-07-24

    Logged In: YES
    user_id=1195173

    I propose that we close this patch (you must have reopened
    it) -- that is, we keep the code -- and we add a new option
    "Use whitespace rules for binary files". Obviously
    implementing that option will involve modifying this patch code.

     
  • Kimmo Varis

    Kimmo Varis - 2005-07-26

    Logged In: YES
    user_id=631874

    Yes, I reopened it so I can easier find it.

    Implementing that new compare option solves problem with
    text files with zeros.

    I have a radical suggestion:
    what if we stop using 'binary file' term in our UI. And
    redefine binary/text file categories to:
    1) 'Editable files' we can open to file compare
    2) 'Non-editable files' we cannot open to file compare

    This way we get rid of this binary/text problem - we just
    detect files we can open to editor. After all, thats the
    reason we have those catetegories.

    This would require a lot of UI changes so it will be
    WinMerge 3.0 change anyway. But something to consider after
    we branch for 2.4 stable.

     
  • elsapo

    elsapo - 2005-07-26

    Logged In: YES
    user_id=1195173

    I like that idea (editable/noneditable), in theory, but I
    think we're still glossing over the problem (?) of control
    characters.

    A file with control characters (1B in my list below) is
    viewable, but is only somewhat editable -- I mean, I think
    WinMerge will preserve the control characters until you edit
    them, and once you touch them you can't get them back -- but
    that is actually speculation -- I'm not sure what will
    happen really.

     
  • Kimmo Varis

    Kimmo Varis - 2005-07-26

    Logged In: YES
    user_id=631874

    Yes, it was radical suggestion we have to discuss a lot
    before implementing it... It would be different for what all
    other programs do, big improvement for how users see files.

    Now, back to the topic. I just remembered I tested my patch,
    I compared one of my (humor) video directory with > 600 Mb
    of videos to itself. So all files were identical. It took
    ~12 second from cold cache. And since all files were
    identical, WinMerge had to sweep through all files from
    first byte to last byte. >1,2 GB of data to read. I wouldn't
    say that is slow. And remember without these patches I
    couldn't compare those files at all...

    So for binary file detection, my suggestion is to check for
    full files for zero bytes, even though we can stop searcing
    for other differences after first one. It gets us
    predictable results with pretty small speed loss.

     
  • elsapo

    elsapo - 2005-07-26

    Logged In: YES
    user_id=1195173

    That is fast.

    But, some users may run WinMerge against network-mapped
    drives :(

    This is a hard topic, because getting binary test wrong is
    undesired, but I'm afraid insisting on scanning every byte
    of every file may sometimes be undesired as well.

     
  • Kimmo Varis

    Kimmo Varis - 2005-07-26

    Logged In: YES
    user_id=631874

    Yes, for that kind of code disk I/O is limiting factor.
    Speed wouldn't be in same level for compact flash card or
    USB v1.. :)

    I think we have to prioritize for correct results here.
    Giving wrong results may lead users to bad decisions,
    imagine somebody wants to delete identical file based on
    WinMerge compare (even if it really is not)...

    Now, we could add option to stop comparing after first
    difference, but we must have big red lights blinking that
    results may not be correct.

     
  • Kimmo Varis

    Kimmo Varis - 2005-07-27
    • status: open-accepted --> closed-accepted
     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.