RFE #1234813 Compare by quick contents should do binary
file check
This patch adds new function to loop through file
buffers and check if there are zero-bytes. If
zero-bytes are found from file-buffer we add
binary-flag to result and reset all compare ignore
settings.
I can't say since I havent' profiled, but I think this
is a small speed gain for compare. When we reset ignore
settings for binary files, that makes comparing them
faster.
Also, we currently have 64-byte buffers, so they fit
very well to processor cache and finding zeros is very
fast.
Original and altered files
Logged In: YES
user_id=631874
Applied to CVS trunk (2.3). Closing.
Checking in ByteComparator.cpp;
/cvsroot/winmerge/WinMerge/Src/ByteComparator.cpp,v <--
ByteComparator.cpp
new revision: 1.3; previous revision: 1.2
done
Checking in ByteComparator.h;
/cvsroot/winmerge/WinMerge/Src/ByteComparator.h,v <--
ByteComparator.h
new revision: 1.2; previous revision: 1.1
done
Checking in DiffWrapper.cpp;
/cvsroot/winmerge/WinMerge/Src/DiffWrapper.cpp,v <--
DiffWrapper.cpp
new revision: 1.53; previous revision: 1.52
done
Logged In: YES
user_id=1195173
This means processing the whole buffer twice instead of
once; won't this make processing text files slower?
My guess is that most people process mostly text files, so
slowing down all text files in order to speed up the binary
files may not be a great trade-off?
(I'm debugging something and just ran into this code.)
Logged In: YES
user_id=631874
That kind of byte-check in loop is very fast, just think
about how it is in assembler:
BEGIN:
load byte to register
jump to BEGIN if register value is zero
These all (<10 real instructions) are (very) fast
operations, byte loading can take most of the time if we
don't hit cache.
Now, compare this to case where you apply whitespace ignore
logic to big binary files.
So, we lose small amount of speed in text files, but save a
lot of time for binary files.
Also, avoiding user frustation when quick compare marks all
files as text files and user then tries to open .ico or .bmp
files (we have those in WinMerge source tree for example)
gets message WinMerge cannot open binary files and still
dirview shows them as text files.
So, in my opinion, small time we lose in text file compare
is big total win, in user experience and compare speed in
real-life source trees (which have binaries too).
And I specifically did it that way (simple) loop since I
think (unfortunately don't have numbers) going for simple
byte compare for binary files as fast as possible is biggest
win in compare time.
And, while working with dir compare patches, I think more
time goes to other tasks than actual compare with relatively
small text files (~few KB). UI update etc takes a lot of
time related to actual compare.
Logged In: YES
user_id=1195173
Ok.
That doesn't solve either of the outstanding text/binary
problems tho.
#1) diffutils uses a partial binary algorithm on top of file
#2) This algorithm that you modified bails out if it finds a
difference, and quits checking binary vs. text, so it may
not find that a binary file is binary if it finds a
difference early.
Logged In: YES
user_id=1195173
In fact, we can't really solve the binary problem (that is,
the issue that WinMerge may show a binary file as text) and
have a fast diff, because they're incompatible goals:
- To really tell if a file is binary, we have to go through
every byte
- To have a fast diff, we have to bail out when we hit a
difference
But, we can improve what we can. I think my PATCH [ 1242008
] (Update status to binary if open fails because binary) is
along the direction of improving the experience when this
happens.
A really big help would be some way from the DirView to
compare binary files -- that is a gap in our GUI, that the
user cannot select the required prediffer
(DisplayBinaryFiles.dll), which greatly annoys me, but not
enough yet for me to create a patch to fix it.
Logged In: YES
user_id=631874
I really don't care about showing binary files. We are not
binary editor and can't be unless somebody is ready to
implement binary file diffing.
What I (and users) want is we can compare ALL files, not
just files up to say 100 MB. Thats pretty essential for
directory compare if one wants to use it to synch directories.
Yes, we might not be able to detect if 200 MB file is binary
or text. But who cares? If we can't open that file, what
does it matter if it is text or binary file? Nothing.
Good point about quick contents compare to stop comparing
after it finds first diff. I can revert my patches for
faster binary compare.
But still I'd like to see we use full potential of your
quick contents compare, since it has no upper limit for file
sizes.
Couple of choices I can think of:
1) Alter current 2MB limit (bigger files compared with quick
contents) to say 50 MB and consider files bigger than that
always as binary files (or add new category for them) so
they cannot be opened to file compare
2) Alter current quick contents compare to look through full
file if its binary or not
Logged In: YES
user_id=1195173
I care a lot about "binary" files because my primary use
these days is that I have an automated script running, which
monitors schema in several databases. For some reason, a few
of the data dump files wind up with a few binary 0s in them.
They are really text files -- they look textual anyway, and
have line returns -- but the WinMerge algorithm calls them
binary. I guess this is my problem with that optimization of
yours -- it makes sense for most people, but I'm in a small
group that is using WinMerge text compare on "binary" files.
That is, I need to ignore blank lines in these files (due to
obscure bugs in SQL Server which generate spurious line
returns).
I guess my problem is that I have text files which have a
few binary zeros in them.
It is an arcane problem, but one that plagues me :)
What I'd really like is an option to use whitespace rules
for binary files as well -- or else, some way to force these
files to be treated as text (eg, a switch to treat all files
found as text, or to use extension override, so that it will
treat all .txt files as text even if they have a binary zero
or a handful).
Also, btw, re: "If we can't open it, who cares if its binary
or text"
Except, we can still say whether it has changed or not,
whether or not our WinMerge editor can open it.
Its starting to get difficult for me to keep straight all
the scenarios/use cases, frankly :)
Logged In: YES
user_id=631874
Thanks a lot for your explanation. I understand this now a
lot better.
Did I understand correctly that your main problem is I
disabled whitespace options for binary files? Yes it was an
optimization I thought was safe to do.
So, we don't really have good defaults here. So maybe we
just have to add new compare options so we can really handle
these situations too.
One nasty situation also just came into my mind: we start
comparing two files, and default to check for whitespace
ignores. We find couple of differences, but ighnored by
whitespace ignores. Then we detect file as binary. Now, we
already have ignored differences so we think files as
identical, even though they propably should be different, as
they can be some binary data files..
Ugly.
Logged In: YES
user_id=1195173
Wow, thats ugly, where we mark binary file as identical on
the first pass, but if they refresh, we'll do strict binary
comparison and show it as different after refresh.
I think my problem with my "text"/"binary" files is that I'm
in the gray area between text and binary. I finally just
thought of this, that there are two different attributes,
which are so similar that I've had them confused all this time.
*** #1) Line-delimited files
This is files with line delimiters spaced through them.
We don't care which kind of line delimiters, or even if they
change, but these line delimiters make the file much more
legible in text editors, and furthermore, the WinMerge
(MergeEditView & crystal) really is only good for
line-delimited files, because it does not ever wrap lines.
So, lets say 1=line delimited files and 2=non-line delimited
files
*** A) text files
By text files, I mean ones displayable (and editable?) in a
text editor. A file is displayable as text if it has no
binary zeros. It is editable if it has no control characters
(bytes under 0x20, besides tab & line terminators).
Let's say
A = text files with no control characters (except tab &
line ends)
B = text files with no binary zeros (but control characters)
Z = files with binary zeros
Possibilities
== 1A: line-delimited files with no binary zeros or control
characters
This is the usual case "text files", and works great
== 1B: line-delimited files with no binary zeros but control
characters
We display this pretty well, but I don't know if editing
them will be lossy. I don't know if anyone uses these either.
== 1Z: line-delimited files with binary zeros embedded
This is a weird case that I happen to have. We can compare
these ok, but display is problematic, and editing probably
impossible. These are definitely to be called "binary" files.
== 2A: no line-delimiters text files
These compare ok, but our display/editor is not really
suitable for these. We've not really addressed these, I
think. Line breaking is needed to view/edit these, but line
breaking isn't really compatible with the WinMerge display
of differences. I think these files may fall outside our scope.
== 2B no line delimiters and control characters
Like (2A), only editing is more problematic; I think
out-of-scope.
== 2C no line delimiters and binary zeros
These are what I envision as classical "binary files", and
definitely out-of-scope for our viewer/editor.
Logged In: YES
user_id=1195173
I propose that we close this patch (you must have reopened
it) -- that is, we keep the code -- and we add a new option
"Use whitespace rules for binary files". Obviously
implementing that option will involve modifying this patch code.
Logged In: YES
user_id=631874
Yes, I reopened it so I can easier find it.
Implementing that new compare option solves problem with
text files with zeros.
I have a radical suggestion:
what if we stop using 'binary file' term in our UI. And
redefine binary/text file categories to:
1) 'Editable files' we can open to file compare
2) 'Non-editable files' we cannot open to file compare
This way we get rid of this binary/text problem - we just
detect files we can open to editor. After all, thats the
reason we have those catetegories.
This would require a lot of UI changes so it will be
WinMerge 3.0 change anyway. But something to consider after
we branch for 2.4 stable.
Logged In: YES
user_id=1195173
I like that idea (editable/noneditable), in theory, but I
think we're still glossing over the problem (?) of control
characters.
A file with control characters (1B in my list below) is
viewable, but is only somewhat editable -- I mean, I think
WinMerge will preserve the control characters until you edit
them, and once you touch them you can't get them back -- but
that is actually speculation -- I'm not sure what will
happen really.
Logged In: YES
user_id=631874
Yes, it was radical suggestion we have to discuss a lot
before implementing it... It would be different for what all
other programs do, big improvement for how users see files.
Now, back to the topic. I just remembered I tested my patch,
I compared one of my (humor) video directory with > 600 Mb
of videos to itself. So all files were identical. It took
~12 second from cold cache. And since all files were
identical, WinMerge had to sweep through all files from
first byte to last byte. >1,2 GB of data to read. I wouldn't
say that is slow. And remember without these patches I
couldn't compare those files at all...
So for binary file detection, my suggestion is to check for
full files for zero bytes, even though we can stop searcing
for other differences after first one. It gets us
predictable results with pretty small speed loss.
Logged In: YES
user_id=1195173
That is fast.
But, some users may run WinMerge against network-mapped
drives :(
This is a hard topic, because getting binary test wrong is
undesired, but I'm afraid insisting on scanning every byte
of every file may sometimes be undesired as well.
Logged In: YES
user_id=631874
Yes, for that kind of code disk I/O is limiting factor.
Speed wouldn't be in same level for compact flash card or
USB v1.. :)
I think we have to prioritize for correct results here.
Giving wrong results may lead users to bad decisions,
imagine somebody wants to delete identical file based on
WinMerge compare (even if it really is not)...
Now, we could add option to stop comparing after first
difference, but we must have big red lights blinking that
results may not be correct.
Logged In: YES
user_id=631874
I submitted patch to scan full files:
#1245773 Scan full files in quick contents compare
https://sourceforge.net/tracker/index.php?func=detail&aid=1245773&group_id=13216&atid=313216
I close this item again.