Thread: [Winmerge-development] Wiki page for compare lib added

Windows visual diff and merge for files and directories

Brought to you by: christianlist, grimmdp

winmerge-development

[Winmerge-development] Wiki page for compare lib added

From: Kimmo V. <ki...@wi...> - 2006-10-17 17:53:21

I added a new Wiki-page for compare lib:
http://winmerge.org/Wiki/index.php?title=Compare_Library

Feel free to add new content there if you have ideas and suggestions.

But discussion should happen in this mailing-list.

Regards,
Kimmo

Re: [Winmerge-development] Wiki page for compare lib added

From: Matthias M. <Blog@OutOfHanwell.com> - 2006-10-17 19:09:31

Kimmo Varis wrote:
> I added a new Wiki-page for compare lib:
> http://winmerge.org/Wiki/index.php?title=Compare_Library
>   
I've been reading the wiki and through some of the source code, and it's 
been pretty interesting. I'd like to bring several questions up for 
discussion. The first question is that of filtering, which seems to be 
pretty core to the design. (There's been some discussion at 
http://sourceforge.net/forum/message.php?msg_id=3940907 and 
http://sourceforge.net/forum/message.php?msg_id=3941584.)

The idea that elsapo advocated was that transformations should happen 
before the diff. If I understand correctly, a diff would consist of 
three steps:

1. Transform the file for diffing. The library would save deltas of 
insertions and deletions.
2. Diff the file.
3. Adjust diff positions for the transformation that happened in #1.

By doing this, the whole mess surrounding line endings could be kept out 
of the diff code. Case-insensitive compare could be done by converting 
the entire input string to lower case. Whitespace-agnostic compares 
could be done by collapsing whitespace. This would also allow multi-line 
filtering.

Additionally, I think the file needs be transformed before diffing so 
that it can be displayed meaningfully. This would be used to diff Excel, 
Word, XML files, and generic binary files. Optionally, the file could be 
transformed back to the original file format to allow editing of binary 
data.

What are the disadvantages of this approach? What are the advantages of 
post-diff filtering?

Also, I think I would instinctively favor returning line+char positions 
to allow stringdiffs to be moved into the comparison library. That way, 
the GUI doesn't have to duplicate effort that the library could (I 
think) easily do.

Regards,
Matthias Miller

[Winmerge-development] Filtering (was: Re: Wiki page for compare lib added)

From: Kimmo V. <ki...@wi...> - 2006-10-17 22:15:42

Thanks for your comments.

Yes, the filtering seems to be something I really can't agree with 
elsapo/Perry. But that's nothing new...

> I've been reading the wiki and through some of the source code, and it's 
> been pretty interesting. I'd like to bring several questions up for 
> discussion. The first question is that of filtering, which seems to be 
> pretty core to the design. (There's been some discussion at 
> http://sourceforge.net/forum/message.php?msg_id=3940907 and 
> http://sourceforge.net/forum/message.php?msg_id=3941584.)
> 
> The idea that elsapo advocated was that transformations should happen 
> before the diff. If I understand correctly, a diff would consist of 
> three steps:

Maybe it is best first to define what we think when we talk about 
filtering. When I'm working with files in [any OS] I usually do the file 
matching using wildcards. Lets think of that as simples file filtering. 
If I want to see .txt files in any folder, I type $[command] *.txt. So 
that filtering gives me a *subset* of the original set. I hope this kind 
of mental model is good when we talk about filtering in WinMerge: we 
give user a subset of original set. We don't change the original set. 
And the subset must be inside the original set, it can't include 
something not in original set.

Transformations are not filtering. If we alter the original set, then it 
is some another feature.

I know some people want to think these as same feature, but I don't. I'd 
like to keep filtering as easy to use and approach feature. Think about 
that file matching example I wrote above. It is easy to think you filter 
your results by some criteria (*.txt). But when you start thinking about 
transforming it, it gets easily a lot more complex. It is no trick to 
give a commad like
  $cat file.txt | grep todo
to find lines having word todo. But it is a lot more complex to think 
about commands changing those todo -words to e.g. done -words. (I don't 
even try to write the command for it now.)


> By doing this, the whole mess surrounding line endings could be kept out 
> of the diff code. Case-insensitive compare could be done by converting 
> the entire input string to lower case. Whitespace-agnostic compares 
> could be done by collapsing whitespace. This would also allow multi-line 
> filtering.

EOL issue is a good point. But we don't need transformations to solve 
it. We always give linedata to diffutils, so we can unify EOL bytes 
before diffutils. Regexps don't understand different EOL styles. It is 
unfortunate that current linefiltering is implemented using diffutils 
regexp so doing it is not trivial. If we unify the regexps then this 
would be easy to solve. (And in fact WinMerge already stores the lines 
and their EOL bytes separately, but that separation doesn't go into 
diffutils level.)

Case-insensitivity and whitespaces are very good points. I've never 
thought about doing them this way. But I'm afraid these are special 
cases too. They are pretty easy to hard-code and optimize. Since these 
are very common cases I'd say we also want to optimize them. It is 
faster to do tolower() in compare code than go through some custom 
transforming code.

But if we do generalize this transformation-idea then we have something 
along current plugins? What these transformations should and could do is 
something I haven't really thought yet.

Anyway, as I've said, I'd like to keep filtering and transformations as 
separate features.

> What are the disadvantages of this approach? What are the advantages of 
> post-diff filtering?

Major thing I'm against pre-filtering is we are doing a real-world 
software. Speed is very important. Remember users have large files and 
folders having lots of files to compare. And we all know data sizes grow 
with time.

These are assumptions we have to make (and hopefully can agree):
1) regexps are slow, they are advanced string matching, they just can't 
be fast.
2) compare is faster than matching. for compare you only need to find 
different chars, for matching you need to find exactly same chars. I 
mean you need to find one differing char from the lines to judge them 
different. But you must check whole line according matching rules to say 
if it matches or not. Ok this is not obvious and can be false also.
3) most of the time matching every line of the file is a waste of the 
time (as most lines are not differences).
4) assuming the normal case is most of the lines are not differences. 
I'd say it is an exception to to have files where there are more than 
50% lines different.
5) oh, and filtering only changes different lines to ignored 
differences. Nothing more.

Think about it this way: we have two files, having lines like this:
file a:
---
1 #include "statlink.h"
2 /**
3  * @brief About-dialog class.
4  */
5 class CAboutDlg : public CDialog
---
1 #include "statlink.h"
2 /***
3  * @brief About-dialog class and something else.
4  */
5 class CAboutDlg : public CDialog
---

Lines 2 and 3 are different. Lets assume you have filters to match lines 
beginning with /* and lines having @brief -word. So your filters would 
match lines 2 and 3.

Only changes there are possible by filtering, are to change lines 2 and 
3 to ignored differences. Filtering cannot add differences to other 
lines. So even trying to match other lines is waste of the time.

So by first comparing, we do first compare 5 lines. And then match 2 
lines. And by first matching we first match 5 lines and then compare 5 
lines. I'd bet first case is faster.

I keep talking about speed. But it is what users see in real life. It is 
a big difference if comparing two xml files takes 2 seconds or 10 
seconds. (Just some numbers, I don't claim that is the speed 
difference.) For file compare we rescan (and so filter/transform) files 
pretty often, every time the difference is merged, every time files are 
changed (if automatic rescan is enabled).

Sorry if this sounds like I'm repeating myself. Long day at work..

Regards,
Kimmo

Re: [Winmerge-development] Filtering

From: Matthias M. <Blog@OutOfHanwell.com> - 2006-10-19 04:01:52

Kimmo Varis wrote:
> Thanks for your comments.
>
> Yes, the filtering seems to be something I really can't agree with 
> elsapo/Perry. But that's nothing new...
>
>   
>> I've been reading the wiki and through some of the source code, and it's 
>> been pretty interesting. I'd like to bring several questions up for 
>> discussion. The first question is that of filtering, which seems to be 
>> pretty core to the design. (There's been some discussion at 
>> http://sourceforge.net/forum/message.php?msg_id=3940907 and 
>> http://sourceforge.net/forum/message.php?msg_id=3941584.)
>>
>> The idea that elsapo advocated was that transformations should happen 
>> before the diff. If I understand correctly, a diff would consist of 
>> three steps:
>>     
>
> Maybe it is best first to define what we think when we talk about 
> filtering. When I'm working with files in [any OS] I usually do the file 
> matching using wildcards. Lets think of that as simples file filtering. 
> If I want to see .txt files in any folder, I type $[command] *.txt. So 
> that filtering gives me a *subset* of the original set. I hope this kind 
> of mental model is good when we talk about filtering in WinMerge: we 
> give user a subset of original set. We don't change the original set. 
> And the subset must be inside the original set, it can't include 
> something not in original set.
>
> Transformations are not filtering. If we alter the original set, then it 
> is some another feature.
>
> I know some people want to think these as same feature, but I don't. I'd 
> like to keep filtering as easy to use and approach feature. Think about 
> that file matching example I wrote above. It is easy to think you filter 
> your results by some criteria (*.txt). But when you start thinking about 
> transforming it, it gets easily a lot more complex. It is no trick to 
> give a commad like
>   $cat file.txt | grep todo
> to find lines having word todo. But it is a lot more complex to think 
> about commands changing those todo -words to e.g. done -words. (I don't 
> even try to write the command for it now.)
>
>
>   
>> By doing this, the whole mess surrounding line endings could be kept out 
>> of the diff code. Case-insensitive compare could be done by converting 
>> the entire input string to lower case. Whitespace-agnostic compares 
>> could be done by collapsing whitespace. This would also allow multi-line 
>> filtering.
>>     
>
> EOL issue is a good point. But we don't need transformations to solve 
> it. We always give linedata to diffutils, so we can unify EOL bytes 
> before diffutils. Regexps don't understand different EOL styles. It is 
> unfortunate that current linefiltering is implemented using diffutils 
> regexp so doing it is not trivial. If we unify the regexps then this 
> would be easy to solve. (And in fact WinMerge already stores the lines 
> and their EOL bytes separately, but that separation doesn't go into 
> diffutils level.)
>
> Case-insensitivity and whitespaces are very good points. I've never 
> thought about doing them this way. But I'm afraid these are special 
> cases too. They are pretty easy to hard-code and optimize. Since these 
> are very common cases I'd say we also want to optimize them. It is 
> faster to do tolower() in compare code than go through some custom 
> transforming code.
>
> But if we do generalize this transformation-idea then we have something 
> along current plugins? What these transformations should and could do is 
> something I haven't really thought yet.
>
> Anyway, as I've said, I'd like to keep filtering and transformations as 
> separate features.
>   

This distinction between transformations and filters is a good one to 
make, and I also see them as to separate features. Your point on 
performance is well taken. I would enjoy seeing numbers regarding the 
performance of using regular expressions for this, however.

>> What are the disadvantages of this approach? What are the advantages of 
>> post-diff filtering?
>>     
>
> Major thing I'm against pre-filtering is we are doing a real-world 
> software. Speed is very important. Remember users have large files and 
> folders having lots of files to compare. And we all know data sizes grow 
> with time.
>
> These are assumptions we have to make (and hopefully can agree):
> 1) regexps are slow, they are advanced string matching, they just can't 
> be fast.
> 2) compare is faster than matching. for compare you only need to find 
> different chars, for matching you need to find exactly same chars. I 
> mean you need to find one differing char from the lines to judge them 
> different. But you must check whole line according matching rules to say 
> if it matches or not. Ok this is not obvious and can be false also.
> 3) most of the time matching every line of the file is a waste of the 
> time (as most lines are not differences).
> 4) assuming the normal case is most of the lines are not differences. 
> I'd say it is an exception to to have files where there are more than 
> 50% lines different.
> 5) oh, and filtering only changes different lines to ignored 
> differences. Nothing more.
>
> Think about it this way: we have two files, having lines like this:
> file a:
> ---
> 1 #include "statlink.h"
> 2 /**
> 3  * @brief About-dialog class.
> 4  */
> 5 class CAboutDlg : public CDialog
> ---
> 1 #include "statlink.h"
> 2 /***
> 3  * @brief About-dialog class and something else.
> 4  */
> 5 class CAboutDlg : public CDialog
> ---
>
> Lines 2 and 3 are different. Lets assume you have filters to match lines 
> beginning with /* and lines having @brief -word. So your filters would 
> match lines 2 and 3.
>
> Only changes there are possible by filtering, are to change lines 2 and 
> 3 to ignored differences. Filtering cannot add differences to other 
> lines. So even trying to match other lines is waste of the time.
>
> So by first comparing, we do first compare 5 lines. And then match 2 
> lines. And by first matching we first match 5 lines and then compare 5 
> lines. I'd bet first case is faster.
>
> I keep talking about speed. But it is what users see in real life. It is 
> a big difference if comparing two xml files takes 2 seconds or 10 
> seconds. (Just some numbers, I don't claim that is the speed 
> difference.) For file compare we rescan (and so filter/transform) files 
> pretty often, every time the difference is merged, every time files are 
> changed (if automatic rescan is enabled).
>
> Sorry if this sounds like I'm repeating myself. Long day at work..
>
> Regards,
> Kimmo
>   

I think the choice between filters or transformations simply comes down 
the question, "How will WinMerge be used?" Most of the diffing I do is 
from my source control, so I'm generally diffing <100KB and I can spare 
a couple of processor instructions for that! As you mentioned, however, 
this might not scale to directory compares or to large files, especially 
since there's a significant algorithmic difference between the two 
approaches.

Filters, as they are implemented today, are more of a UI feature than a 
comparison feature (except when exporting patches). This makes them very 
easy to understand and use. But they also limit their usefulness.

Possible limitations of filters are lack of support for intra-line diffs 
and moved-block detection, if that might be affected. (Please correct me 
if I'm wrong.) When I'm refactoring
code, I often correct several variable names, and I want to ignore those 
differences when reviewing my changes. However, I don't want to simply 
ignore all lines with the new variable names, since I might have 
introduced a typo in an inline comment. I need to see those typos! 
Rather than simply filtering the changes, I want to enter a regular 
expression that describes the changes I made so that I don't see those 
changes any more. This would still have the advantages of applying the 
regular expressions to only part of the file, but it would require that 
those sections be compared again.

To summarize, possible techniques are:
-Transform the entire file (useful for binary/XML/Word diffing)
-Transform text within changes (useful for ignoring only parts of 
changes--Perry mentioned ignoring columns in a log file)
-Filter specific lines within changes (useful for ignoring specific lines)

As mentioned on the forum, the cost of implementing these might outweigh 
any benefits.

By the way, I'm curious what a common use case is for line filters right 
now.

Hope these ideas make sense.

Regards,
Matthias Miller

Re: [Winmerge-development] Filtering

From: Kimmo V. <ki...@wi...> - 2006-10-19 16:32:20

> This distinction between transformations and filters is a good one to 
> make, and I also see them as to separate features. Your point on 
> performance is well taken. I would enjoy seeing numbers regarding the 
> performance of using regular expressions for this, however.

Yes, it would be good to get some real-life numbers. Though I don't know 
if it really means anything, since most users have their own usages. So 
even if we can say that in case or two filtering affects this much, it 
has no meaning in other cases.

Other way around, it would be good to have even couple of simple test 
cases for this in repository - to keep track of possible regressions in 
compare speed. It could be something simple like comparing two files 
with lots of differences.

> I think the choice between filters or transformations simply comes down 
> the question, "How will WinMerge be used?" Most of the diffing I do is 
> from my source control, so I'm generally diffing <100KB and I can spare 
> a couple of processor instructions for that! As you mentioned, however, 
> this might not scale to directory compares or to large files, especially 
> since there's a significant algorithmic difference between the two 
> approaches.

And this is really something I have to remind people regularly. There 
are lots and lots of different ways to use WinMerge. We can't say what 
is correct way to use WinMerge. Or more correct than some other. But we 
should try to not to add restrictions for any usage. So we should be 
more thinking about not adding restrictions and adding new 
possibilities. This is why I haven been so much against pre-filtering, 
as I see it easily adds restriction (loss of speed) in some cases.

> Filters, as they are implemented today, are more of a UI feature than a 
> comparison feature (except when exporting patches). This makes them very 
> easy to understand and use. But they also limit their usefulness.

I assume you mean only linefilters with that. I personally could not use 
WinMerge anymore without file filters, as they allow me to compare and 
see only source files, not all kinds of other related files.

But true, for linefilters it is about making compare/merge easier. We 
can't really hide parts of files currently (maybe there could be 
collapsing blocks later). But for filefiltering it is selecting the 
files/folders to compare. We don't comapre ignored files/folders at all, 
so speed difference in compare might be huge.

One new idea about linefilters: what if we allow inclusive filtering for 
them. Like with file filters you can define the files you want to see. 
So how about allowing user to define the differences he wants to see. 
Might not made much sense for code files, but for lots of data/log/xml 
files it really makes sense. Think about XML: with simple "<tag>" rule I 
could restrict WinMerge to show only differences where certain tag is 
changed...

> Possible limitations of filters are lack of support for intra-line diffs 
> and moved-block detection, if that might be affected. (Please correct me 
> if I'm wrong.)

You are correct. We can only ignore whole lines.

>  When I'm refactoring
> code, I often correct several variable names, and I want to ignore those 
> differences when reviewing my changes. However, I don't want to simply 
> ignore all lines with the new variable names, since I might have 
> introduced a typo in an inline comment. I need to see those typos! 
> Rather than simply filtering the changes, I want to enter a regular 
> expression that describes the changes I made so that I don't see those 
> changes any more. This would still have the advantages of applying the 
> regular expressions to only part of the file, but it would require that 
> those sections be compared again.

I've thought about this few times. But I don't know how to even start 
with it. If we want to hide only some chages in lines, we must hide 
these changes from the compare engine (diffutils).

I was thinking about one way to do it by replacing ignored texts with 
some placeholder texts "IGNOREME". If we replace ignored texts in both 
files with the same text, diffutils does not see the difference. But 
then we need to show the files.. Stringdiffing sees the original files. 
So we'd need to filter files again for stringdiffing. So there seems to 
be lots of problems with this approach.

> To summarize, possible techniques are:
> -Transform the entire file (useful for binary/XML/Word diffing)
> -Transform text within changes (useful for ignoring only parts of 
> changes--Perry mentioned ignoring columns in a log file)
> -Filter specific lines within changes (useful for ignoring specific lines)
> 
> As mentioned on the forum, the cost of implementing these might outweigh 
> any benefits.

We really can't know, since there are numerous different use cases. But 
we can't take the risk we make some other use cases considerably slower 
(think of large log files with few changes). I see your point that 
whole-file- (or pre-) filtering would be useful.

Your list above is good. So we really have several different areas where 
we want to  apply filters (whole file, difference, line). So how about 
just adding this application area as parameter/type for filter. So 
filter definition would have:
a) filter rule (regexp)
b) application area (file, difference, line)

Each filter type is handled in different compare phase. But we can 
handle that, I hope.

> By the way, I'm curious what a common use case is for line filters right 
> now.

Good question. I really can't say. But if I have to say something, I'd 
say it is simple filtering of comments, CVS tags etc. Complex filters 
are hard to write and easily give wrong results.

Anyway, I think this shows we still really have one *filtering* case 
where we filter lines out of diffs. Other two cases (filtering whole 
files and transforming lines) are more like transformations using filter 
rules (=regexps).

Filtering uses regexps. But that does not mean everything using regexps 
is about filtering. :)

Regards,
Kimmo