Kimmo Varis wrote:
Thanks for your comments.
Yes, the filtering seems to be something I really can't agree with
elsapo/Perry. But that's nothing new...
I've been reading the wiki and through some of the source code, and it's
been pretty interesting. I'd like to bring several questions up for
discussion. The first question is that of filtering, which seems to be
pretty core to the design. (There's been some discussion at
The idea that elsapo advocated was that transformations should happen
before the diff. If I understand correctly, a diff would consist of
Maybe it is best first to define what we think when we talk about
filtering. When I'm working with files in [any OS] I usually do the file
matching using wildcards. Lets think of that as simples file filtering.
If I want to see .txt files in any folder, I type $[command] *.txt. So
that filtering gives me a *subset* of the original set. I hope this kind
of mental model is good when we talk about filtering in WinMerge: we
give user a subset of original set. We don't change the original set.
And the subset must be inside the original set, it can't include
something not in original set.
Transformations are not filtering. If we alter the original set, then it
is some another feature.
I know some people want to think these as same feature, but I don't. I'd
like to keep filtering as easy to use and approach feature. Think about
that file matching example I wrote above. It is easy to think you filter
your results by some criteria (*.txt). But when you start thinking about
transforming it, it gets easily a lot more complex. It is no trick to
give a commad like
$cat file.txt | grep todo
to find lines having word todo. But it is a lot more complex to think
about commands changing those todo -words to e.g. done -words. (I don't
even try to write the command for it now.)
By doing this, the whole mess surrounding line endings could be kept out
of the diff code. Case-insensitive compare could be done by converting
the entire input string to lower case. Whitespace-agnostic compares
could be done by collapsing whitespace. This would also allow multi-line
EOL issue is a good point. But we don't need transformations to solve
it. We always give linedata to diffutils, so we can unify EOL bytes
before diffutils. Regexps don't understand different EOL styles. It is
unfortunate that current linefiltering is implemented using diffutils
regexp so doing it is not trivial. If we unify the regexps then this
would be easy to solve. (And in fact WinMerge already stores the lines
and their EOL bytes separately, but that separation doesn't go into
Case-insensitivity and whitespaces are very good points. I've never
thought about doing them this way. But I'm afraid these are special
cases too. They are pretty easy to hard-code and optimize. Since these
are very common cases I'd say we also want to optimize them. It is
faster to do tolower() in compare code than go through some custom
But if we do generalize this transformation-idea then we have something
along current plugins? What these transformations should and could do is
something I haven't really thought yet.
Anyway, as I've said, I'd like to keep filtering and transformations as
This distinction between transformations and filters is a good one to
make, and I also see them as to separate features. Your point on
performance is well taken. I would enjoy seeing numbers regarding the
performance of using regular expressions for this, however.
What are the disadvantages of this approach? What are the advantages of
Major thing I'm against pre-filtering is we are doing a real-world
software. Speed is very important. Remember users have large files and
folders having lots of files to compare. And we all know data sizes grow
These are assumptions we have to make (and hopefully can agree):
1) regexps are slow, they are advanced string matching, they just can't
2) compare is faster than matching. for compare you only need to find
different chars, for matching you need to find exactly same chars. I
mean you need to find one differing char from the lines to judge them
different. But you must check whole line according matching rules to say
if it matches or not. Ok this is not obvious and can be false also.
3) most of the time matching every line of the file is a waste of the
time (as most lines are not differences).
4) assuming the normal case is most of the lines are not differences.
I'd say it is an exception to to have files where there are more than
50% lines different.
5) oh, and filtering only changes different lines to ignored
differences. Nothing more.
Think about it this way: we have two files, having lines like this:
1 #include "statlink.h"
3 * @brief About-dialog class.
5 class CAboutDlg : public CDialog
1 #include "statlink.h"
3 * @brief About-dialog class and something else.
5 class CAboutDlg : public CDialog
Lines 2 and 3 are different. Lets assume you have filters to match lines
beginning with /* and lines having @brief -word. So your filters would
match lines 2 and 3.
Only changes there are possible by filtering, are to change lines 2 and
3 to ignored differences. Filtering cannot add differences to other
lines. So even trying to match other lines is waste of the time.
So by first comparing, we do first compare 5 lines. And then match 2
lines. And by first matching we first match 5 lines and then compare 5
lines. I'd bet first case is faster.
I keep talking about speed. But it is what users see in real life. It is
a big difference if comparing two xml files takes 2 seconds or 10
seconds. (Just some numbers, I don't claim that is the speed
difference.) For file compare we rescan (and so filter/transform) files
pretty often, every time the difference is merged, every time files are
changed (if automatic rescan is enabled).
Sorry if this sounds like I'm repeating myself. Long day at work..
I think the choice between filters or transformations simply comes down
the question, "How will WinMerge be used?" Most of the diffing I do is
from my source control, so I'm generally diffing <100KB and I can
spare a couple of processor instructions for that! As you mentioned,
however, this might not scale to directory compares or to large files,
especially since there's a significant algorithmic difference between
the two approaches.
Filters, as they are implemented today, are more of a UI feature than a
comparison feature (except when exporting patches). This makes them
very easy to understand and use. But they also limit their usefulness.
Possible limitations of filters are lack of support for intra-line
diffs and moved-block detection, if that might be affected. (Please
correct me if I'm wrong.) When I'm refactoring
code, I often correct several variable names, and I want to ignore
those differences when reviewing my changes. However, I don't want to
simply ignore all lines with the new variable names, since I might have
introduced a typo in an inline comment. I need to see those typos!
Rather than simply filtering the changes, I want to enter a regular
expression that describes the changes I made so that I don't see those
changes any more. This would still have the advantages of applying the
regular expressions to only part of the file, but it would require that
those sections be compared again.
To summarize, possible techniques are:
-Transform the entire file (useful for binary/XML/Word diffing)
-Transform text within changes (useful for ignoring only parts of
changes--Perry mentioned ignoring columns in a log file)
-Filter specific lines within changes (useful for ignoring specific
As mentioned on the forum, the cost of implementing these might
outweigh any benefits.
By the way, I'm curious what a common use case is for line filters
Hope these ideas make sense.