Kimmo Varis wrote:
Thanks for your comments.

Yes, the filtering seems to be something I really can't agree with 
elsapo/Perry. But that's nothing new...

  
I've been reading the wiki and through some of the source code, and it's 
been pretty interesting. I'd like to bring several questions up for 
discussion. The first question is that of filtering, which seems to be 
pretty core to the design. (There's been some discussion at 
http://sourceforge.net/forum/message.php?msg_id=3940907 and 
http://sourceforge.net/forum/message.php?msg_id=3941584.)

The idea that elsapo advocated was that transformations should happen 
before the diff. If I understand correctly, a diff would consist of 
three steps:
    

Maybe it is best first to define what we think when we talk about 
filtering. When I'm working with files in [any OS] I usually do the file 
matching using wildcards. Lets think of that as simples file filtering. 
If I want to see .txt files in any folder, I type $[command] *.txt. So 
that filtering gives me a *subset* of the original set. I hope this kind 
of mental model is good when we talk about filtering in WinMerge: we 
give user a subset of original set. We don't change the original set. 
And the subset must be inside the original set, it can't include 
something not in original set.

Transformations are not filtering. If we alter the original set, then it 
is some another feature.

I know some people want to think these as same feature, but I don't. I'd 
like to keep filtering as easy to use and approach feature. Think about 
that file matching example I wrote above. It is easy to think you filter 
your results by some criteria (*.txt). But when you start thinking about 
transforming it, it gets easily a lot more complex. It is no trick to 
give a commad like
  $cat file.txt | grep todo
to find lines having word todo. But it is a lot more complex to think 
about commands changing those todo -words to e.g. done -words. (I don't 
even try to write the command for it now.)


  
By doing this, the whole mess surrounding line endings could be kept out 
of the diff code. Case-insensitive compare could be done by converting 
the entire input string to lower case. Whitespace-agnostic compares 
could be done by collapsing whitespace. This would also allow multi-line 
filtering.
    

EOL issue is a good point. But we don't need transformations to solve 
it. We always give linedata to diffutils, so we can unify EOL bytes 
before diffutils. Regexps don't understand different EOL styles. It is 
unfortunate that current linefiltering is implemented using diffutils 
regexp so doing it is not trivial. If we unify the regexps then this 
would be easy to solve. (And in fact WinMerge already stores the lines 
and their EOL bytes separately, but that separation doesn't go into 
diffutils level.)

Case-insensitivity and whitespaces are very good points. I've never 
thought about doing them this way. But I'm afraid these are special 
cases too. They are pretty easy to hard-code and optimize. Since these 
are very common cases I'd say we also want to optimize them. It is 
faster to do tolower() in compare code than go through some custom 
transforming code.

But if we do generalize this transformation-idea then we have something 
along current plugins? What these transformations should and could do is 
something I haven't really thought yet.

Anyway, as I've said, I'd like to keep filtering and transformations as 
separate features.
  

This distinction between transformations and filters is a good one to make, and I also see them as to separate features. Your point on performance is well taken. I would enjoy seeing numbers regarding the performance of using regular expressions for this, however.


  
What are the disadvantages of this approach? What are the advantages of 
post-diff filtering?
    

Major thing I'm against pre-filtering is we are doing a real-world 
software. Speed is very important. Remember users have large files and 
folders having lots of files to compare. And we all know data sizes grow 
with time.

These are assumptions we have to make (and hopefully can agree):
1) regexps are slow, they are advanced string matching, they just can't 
be fast.
2) compare is faster than matching. for compare you only need to find 
different chars, for matching you need to find exactly same chars. I 
mean you need to find one differing char from the lines to judge them 
different. But you must check whole line according matching rules to say 
if it matches or not. Ok this is not obvious and can be false also.
3) most of the time matching every line of the file is a waste of the 
time (as most lines are not differences).
4) assuming the normal case is most of the lines are not differences. 
I'd say it is an exception to to have files where there are more than 
50% lines different.
5) oh, and filtering only changes different lines to ignored 
differences. Nothing more.

Think about it this way: we have two files, having lines like this:
file a:
---
1 #include "statlink.h"
2 /**
3  * @brief About-dialog class.
4  */
5 class CAboutDlg : public CDialog
---
1 #include "statlink.h"
2 /***
3  * @brief About-dialog class and something else.
4  */
5 class CAboutDlg : public CDialog
---

Lines 2 and 3 are different. Lets assume you have filters to match lines 
beginning with /* and lines having @brief -word. So your filters would 
match lines 2 and 3.

Only changes there are possible by filtering, are to change lines 2 and 
3 to ignored differences. Filtering cannot add differences to other 
lines. So even trying to match other lines is waste of the time.

So by first comparing, we do first compare 5 lines. And then match 2 
lines. And by first matching we first match 5 lines and then compare 5 
lines. I'd bet first case is faster.

I keep talking about speed. But it is what users see in real life. It is 
a big difference if comparing two xml files takes 2 seconds or 10 
seconds. (Just some numbers, I don't claim that is the speed 
difference.) For file compare we rescan (and so filter/transform) files 
pretty often, every time the difference is merged, every time files are 
changed (if automatic rescan is enabled).

Sorry if this sounds like I'm repeating myself. Long day at work..

Regards,
Kimmo
  

I think the choice between filters or transformations simply comes down the question, "How will WinMerge be used?" Most of the diffing I do is from my source control, so I'm generally diffing <100KB and I can spare a couple of processor instructions for that! As you mentioned, however, this might not scale to directory compares or to large files, especially since there's a significant algorithmic difference between the two approaches.

Filters, as they are implemented today, are more of a UI feature than a comparison feature (except when exporting patches). This makes them very easy to understand and use. But they also limit their usefulness.

Possible limitations of filters are lack of support for intra-line diffs and moved-block detection, if that might be affected. (Please correct me if I'm wrong.) When I'm refactoring
code, I often correct several variable names, and I want to ignore those differences when reviewing my changes. However, I don't want to simply ignore all lines with the new variable names, since I might have introduced a typo in an inline comment. I need to see those typos! Rather than simply filtering the changes, I want to enter a regular expression that describes the changes I made so that I don't see those changes any more. This would still have the advantages of applying the regular expressions to only part of the file, but it would require that those sections be compared again.

To summarize, possible techniques are:
-Transform the entire file (useful for binary/XML/Word diffing)
-Transform text within changes (useful for ignoring only parts of changes--Perry mentioned ignoring columns in a log file)
-Filter specific lines within changes (useful for ignoring specific lines)

As mentioned on the forum, the cost of implementing these might outweigh any benefits.

By the way, I'm curious what a common use case is for line filters right now.

Hope these ideas make sense.

Regards,
Matthias Miller