Thread: [Winmerge-development] Wiki page for compare lib added
Windows visual diff and merge for files and directories
Brought to you by:
christianlist,
grimmdp
From: Kimmo V. <ki...@wi...> - 2006-10-17 17:53:21
|
I added a new Wiki-page for compare lib: http://winmerge.org/Wiki/index.php?title=Compare_Library Feel free to add new content there if you have ideas and suggestions. But discussion should happen in this mailing-list. Regards, Kimmo |
From: Matthias M. <Blog@OutOfHanwell.com> - 2006-10-17 19:09:31
|
Kimmo Varis wrote: > I added a new Wiki-page for compare lib: > http://winmerge.org/Wiki/index.php?title=Compare_Library > I've been reading the wiki and through some of the source code, and it's been pretty interesting. I'd like to bring several questions up for discussion. The first question is that of filtering, which seems to be pretty core to the design. (There's been some discussion at http://sourceforge.net/forum/message.php?msg_id=3940907 and http://sourceforge.net/forum/message.php?msg_id=3941584.) The idea that elsapo advocated was that transformations should happen before the diff. If I understand correctly, a diff would consist of three steps: 1. Transform the file for diffing. The library would save deltas of insertions and deletions. 2. Diff the file. 3. Adjust diff positions for the transformation that happened in #1. By doing this, the whole mess surrounding line endings could be kept out of the diff code. Case-insensitive compare could be done by converting the entire input string to lower case. Whitespace-agnostic compares could be done by collapsing whitespace. This would also allow multi-line filtering. Additionally, I think the file needs be transformed before diffing so that it can be displayed meaningfully. This would be used to diff Excel, Word, XML files, and generic binary files. Optionally, the file could be transformed back to the original file format to allow editing of binary data. What are the disadvantages of this approach? What are the advantages of post-diff filtering? Also, I think I would instinctively favor returning line+char positions to allow stringdiffs to be moved into the comparison library. That way, the GUI doesn't have to duplicate effort that the library could (I think) easily do. Regards, Matthias Miller |
From: Kimmo V. <ki...@wi...> - 2006-10-17 22:15:42
|
Thanks for your comments. Yes, the filtering seems to be something I really can't agree with elsapo/Perry. But that's nothing new... > I've been reading the wiki and through some of the source code, and it's > been pretty interesting. I'd like to bring several questions up for > discussion. The first question is that of filtering, which seems to be > pretty core to the design. (There's been some discussion at > http://sourceforge.net/forum/message.php?msg_id=3940907 and > http://sourceforge.net/forum/message.php?msg_id=3941584.) > > The idea that elsapo advocated was that transformations should happen > before the diff. If I understand correctly, a diff would consist of > three steps: Maybe it is best first to define what we think when we talk about filtering. When I'm working with files in [any OS] I usually do the file matching using wildcards. Lets think of that as simples file filtering. If I want to see .txt files in any folder, I type $[command] *.txt. So that filtering gives me a *subset* of the original set. I hope this kind of mental model is good when we talk about filtering in WinMerge: we give user a subset of original set. We don't change the original set. And the subset must be inside the original set, it can't include something not in original set. Transformations are not filtering. If we alter the original set, then it is some another feature. I know some people want to think these as same feature, but I don't. I'd like to keep filtering as easy to use and approach feature. Think about that file matching example I wrote above. It is easy to think you filter your results by some criteria (*.txt). But when you start thinking about transforming it, it gets easily a lot more complex. It is no trick to give a commad like $cat file.txt | grep todo to find lines having word todo. But it is a lot more complex to think about commands changing those todo -words to e.g. done -words. (I don't even try to write the command for it now.) > By doing this, the whole mess surrounding line endings could be kept out > of the diff code. Case-insensitive compare could be done by converting > the entire input string to lower case. Whitespace-agnostic compares > could be done by collapsing whitespace. This would also allow multi-line > filtering. EOL issue is a good point. But we don't need transformations to solve it. We always give linedata to diffutils, so we can unify EOL bytes before diffutils. Regexps don't understand different EOL styles. It is unfortunate that current linefiltering is implemented using diffutils regexp so doing it is not trivial. If we unify the regexps then this would be easy to solve. (And in fact WinMerge already stores the lines and their EOL bytes separately, but that separation doesn't go into diffutils level.) Case-insensitivity and whitespaces are very good points. I've never thought about doing them this way. But I'm afraid these are special cases too. They are pretty easy to hard-code and optimize. Since these are very common cases I'd say we also want to optimize them. It is faster to do tolower() in compare code than go through some custom transforming code. But if we do generalize this transformation-idea then we have something along current plugins? What these transformations should and could do is something I haven't really thought yet. Anyway, as I've said, I'd like to keep filtering and transformations as separate features. > What are the disadvantages of this approach? What are the advantages of > post-diff filtering? Major thing I'm against pre-filtering is we are doing a real-world software. Speed is very important. Remember users have large files and folders having lots of files to compare. And we all know data sizes grow with time. These are assumptions we have to make (and hopefully can agree): 1) regexps are slow, they are advanced string matching, they just can't be fast. 2) compare is faster than matching. for compare you only need to find different chars, for matching you need to find exactly same chars. I mean you need to find one differing char from the lines to judge them different. But you must check whole line according matching rules to say if it matches or not. Ok this is not obvious and can be false also. 3) most of the time matching every line of the file is a waste of the time (as most lines are not differences). 4) assuming the normal case is most of the lines are not differences. I'd say it is an exception to to have files where there are more than 50% lines different. 5) oh, and filtering only changes different lines to ignored differences. Nothing more. Think about it this way: we have two files, having lines like this: file a: --- 1 #include "statlink.h" 2 /** 3 * @brief About-dialog class. 4 */ 5 class CAboutDlg : public CDialog --- 1 #include "statlink.h" 2 /*** 3 * @brief About-dialog class and something else. 4 */ 5 class CAboutDlg : public CDialog --- Lines 2 and 3 are different. Lets assume you have filters to match lines beginning with /* and lines having @brief -word. So your filters would match lines 2 and 3. Only changes there are possible by filtering, are to change lines 2 and 3 to ignored differences. Filtering cannot add differences to other lines. So even trying to match other lines is waste of the time. So by first comparing, we do first compare 5 lines. And then match 2 lines. And by first matching we first match 5 lines and then compare 5 lines. I'd bet first case is faster. I keep talking about speed. But it is what users see in real life. It is a big difference if comparing two xml files takes 2 seconds or 10 seconds. (Just some numbers, I don't claim that is the speed difference.) For file compare we rescan (and so filter/transform) files pretty often, every time the difference is merged, every time files are changed (if automatic rescan is enabled). Sorry if this sounds like I'm repeating myself. Long day at work.. Regards, Kimmo |
From: Matthias M. <Blog@OutOfHanwell.com> - 2006-10-19 04:01:52
|
Kimmo Varis wrote: > Thanks for your comments. > > Yes, the filtering seems to be something I really can't agree with > elsapo/Perry. But that's nothing new... > > >> I've been reading the wiki and through some of the source code, and it's >> been pretty interesting. I'd like to bring several questions up for >> discussion. The first question is that of filtering, which seems to be >> pretty core to the design. (There's been some discussion at >> http://sourceforge.net/forum/message.php?msg_id=3940907 and >> http://sourceforge.net/forum/message.php?msg_id=3941584.) >> >> The idea that elsapo advocated was that transformations should happen >> before the diff. If I understand correctly, a diff would consist of >> three steps: >> > > Maybe it is best first to define what we think when we talk about > filtering. When I'm working with files in [any OS] I usually do the file > matching using wildcards. Lets think of that as simples file filtering. > If I want to see .txt files in any folder, I type $[command] *.txt. So > that filtering gives me a *subset* of the original set. I hope this kind > of mental model is good when we talk about filtering in WinMerge: we > give user a subset of original set. We don't change the original set. > And the subset must be inside the original set, it can't include > something not in original set. > > Transformations are not filtering. If we alter the original set, then it > is some another feature. > > I know some people want to think these as same feature, but I don't. I'd > like to keep filtering as easy to use and approach feature. Think about > that file matching example I wrote above. It is easy to think you filter > your results by some criteria (*.txt). But when you start thinking about > transforming it, it gets easily a lot more complex. It is no trick to > give a commad like > $cat file.txt | grep todo > to find lines having word todo. But it is a lot more complex to think > about commands changing those todo -words to e.g. done -words. (I don't > even try to write the command for it now.) > > > >> By doing this, the whole mess surrounding line endings could be kept out >> of the diff code. Case-insensitive compare could be done by converting >> the entire input string to lower case. Whitespace-agnostic compares >> could be done by collapsing whitespace. This would also allow multi-line >> filtering. >> > > EOL issue is a good point. But we don't need transformations to solve > it. We always give linedata to diffutils, so we can unify EOL bytes > before diffutils. Regexps don't understand different EOL styles. It is > unfortunate that current linefiltering is implemented using diffutils > regexp so doing it is not trivial. If we unify the regexps then this > would be easy to solve. (And in fact WinMerge already stores the lines > and their EOL bytes separately, but that separation doesn't go into > diffutils level.) > > Case-insensitivity and whitespaces are very good points. I've never > thought about doing them this way. But I'm afraid these are special > cases too. They are pretty easy to hard-code and optimize. Since these > are very common cases I'd say we also want to optimize them. It is > faster to do tolower() in compare code than go through some custom > transforming code. > > But if we do generalize this transformation-idea then we have something > along current plugins? What these transformations should and could do is > something I haven't really thought yet. > > Anyway, as I've said, I'd like to keep filtering and transformations as > separate features. > This distinction between transformations and filters is a good one to make, and I also see them as to separate features. Your point on performance is well taken. I would enjoy seeing numbers regarding the performance of using regular expressions for this, however. >> What are the disadvantages of this approach? What are the advantages of >> post-diff filtering? >> > > Major thing I'm against pre-filtering is we are doing a real-world > software. Speed is very important. Remember users have large files and > folders having lots of files to compare. And we all know data sizes grow > with time. > > These are assumptions we have to make (and hopefully can agree): > 1) regexps are slow, they are advanced string matching, they just can't > be fast. > 2) compare is faster than matching. for compare you only need to find > different chars, for matching you need to find exactly same chars. I > mean you need to find one differing char from the lines to judge them > different. But you must check whole line according matching rules to say > if it matches or not. Ok this is not obvious and can be false also. > 3) most of the time matching every line of the file is a waste of the > time (as most lines are not differences). > 4) assuming the normal case is most of the lines are not differences. > I'd say it is an exception to to have files where there are more than > 50% lines different. > 5) oh, and filtering only changes different lines to ignored > differences. Nothing more. > > Think about it this way: we have two files, having lines like this: > file a: > --- > 1 #include "statlink.h" > 2 /** > 3 * @brief About-dialog class. > 4 */ > 5 class CAboutDlg : public CDialog > --- > 1 #include "statlink.h" > 2 /*** > 3 * @brief About-dialog class and something else. > 4 */ > 5 class CAboutDlg : public CDialog > --- > > Lines 2 and 3 are different. Lets assume you have filters to match lines > beginning with /* and lines having @brief -word. So your filters would > match lines 2 and 3. > > Only changes there are possible by filtering, are to change lines 2 and > 3 to ignored differences. Filtering cannot add differences to other > lines. So even trying to match other lines is waste of the time. > > So by first comparing, we do first compare 5 lines. And then match 2 > lines. And by first matching we first match 5 lines and then compare 5 > lines. I'd bet first case is faster. > > I keep talking about speed. But it is what users see in real life. It is > a big difference if comparing two xml files takes 2 seconds or 10 > seconds. (Just some numbers, I don't claim that is the speed > difference.) For file compare we rescan (and so filter/transform) files > pretty often, every time the difference is merged, every time files are > changed (if automatic rescan is enabled). > > Sorry if this sounds like I'm repeating myself. Long day at work.. > > Regards, > Kimmo > I think the choice between filters or transformations simply comes down the question, "How will WinMerge be used?" Most of the diffing I do is from my source control, so I'm generally diffing <100KB and I can spare a couple of processor instructions for that! As you mentioned, however, this might not scale to directory compares or to large files, especially since there's a significant algorithmic difference between the two approaches. Filters, as they are implemented today, are more of a UI feature than a comparison feature (except when exporting patches). This makes them very easy to understand and use. But they also limit their usefulness. Possible limitations of filters are lack of support for intra-line diffs and moved-block detection, if that might be affected. (Please correct me if I'm wrong.) When I'm refactoring code, I often correct several variable names, and I want to ignore those differences when reviewing my changes. However, I don't want to simply ignore all lines with the new variable names, since I might have introduced a typo in an inline comment. I need to see those typos! Rather than simply filtering the changes, I want to enter a regular expression that describes the changes I made so that I don't see those changes any more. This would still have the advantages of applying the regular expressions to only part of the file, but it would require that those sections be compared again. To summarize, possible techniques are: -Transform the entire file (useful for binary/XML/Word diffing) -Transform text within changes (useful for ignoring only parts of changes--Perry mentioned ignoring columns in a log file) -Filter specific lines within changes (useful for ignoring specific lines) As mentioned on the forum, the cost of implementing these might outweigh any benefits. By the way, I'm curious what a common use case is for line filters right now. Hope these ideas make sense. Regards, Matthias Miller |
From: Kimmo V. <ki...@wi...> - 2006-10-19 16:32:20
|
> This distinction between transformations and filters is a good one to > make, and I also see them as to separate features. Your point on > performance is well taken. I would enjoy seeing numbers regarding the > performance of using regular expressions for this, however. Yes, it would be good to get some real-life numbers. Though I don't know if it really means anything, since most users have their own usages. So even if we can say that in case or two filtering affects this much, it has no meaning in other cases. Other way around, it would be good to have even couple of simple test cases for this in repository - to keep track of possible regressions in compare speed. It could be something simple like comparing two files with lots of differences. > I think the choice between filters or transformations simply comes down > the question, "How will WinMerge be used?" Most of the diffing I do is > from my source control, so I'm generally diffing <100KB and I can spare > a couple of processor instructions for that! As you mentioned, however, > this might not scale to directory compares or to large files, especially > since there's a significant algorithmic difference between the two > approaches. And this is really something I have to remind people regularly. There are lots and lots of different ways to use WinMerge. We can't say what is correct way to use WinMerge. Or more correct than some other. But we should try to not to add restrictions for any usage. So we should be more thinking about not adding restrictions and adding new possibilities. This is why I haven been so much against pre-filtering, as I see it easily adds restriction (loss of speed) in some cases. > Filters, as they are implemented today, are more of a UI feature than a > comparison feature (except when exporting patches). This makes them very > easy to understand and use. But they also limit their usefulness. I assume you mean only linefilters with that. I personally could not use WinMerge anymore without file filters, as they allow me to compare and see only source files, not all kinds of other related files. But true, for linefilters it is about making compare/merge easier. We can't really hide parts of files currently (maybe there could be collapsing blocks later). But for filefiltering it is selecting the files/folders to compare. We don't comapre ignored files/folders at all, so speed difference in compare might be huge. One new idea about linefilters: what if we allow inclusive filtering for them. Like with file filters you can define the files you want to see. So how about allowing user to define the differences he wants to see. Might not made much sense for code files, but for lots of data/log/xml files it really makes sense. Think about XML: with simple "<tag>" rule I could restrict WinMerge to show only differences where certain tag is changed... > Possible limitations of filters are lack of support for intra-line diffs > and moved-block detection, if that might be affected. (Please correct me > if I'm wrong.) You are correct. We can only ignore whole lines. > When I'm refactoring > code, I often correct several variable names, and I want to ignore those > differences when reviewing my changes. However, I don't want to simply > ignore all lines with the new variable names, since I might have > introduced a typo in an inline comment. I need to see those typos! > Rather than simply filtering the changes, I want to enter a regular > expression that describes the changes I made so that I don't see those > changes any more. This would still have the advantages of applying the > regular expressions to only part of the file, but it would require that > those sections be compared again. I've thought about this few times. But I don't know how to even start with it. If we want to hide only some chages in lines, we must hide these changes from the compare engine (diffutils). I was thinking about one way to do it by replacing ignored texts with some placeholder texts "IGNOREME". If we replace ignored texts in both files with the same text, diffutils does not see the difference. But then we need to show the files.. Stringdiffing sees the original files. So we'd need to filter files again for stringdiffing. So there seems to be lots of problems with this approach. > To summarize, possible techniques are: > -Transform the entire file (useful for binary/XML/Word diffing) > -Transform text within changes (useful for ignoring only parts of > changes--Perry mentioned ignoring columns in a log file) > -Filter specific lines within changes (useful for ignoring specific lines) > > As mentioned on the forum, the cost of implementing these might outweigh > any benefits. We really can't know, since there are numerous different use cases. But we can't take the risk we make some other use cases considerably slower (think of large log files with few changes). I see your point that whole-file- (or pre-) filtering would be useful. Your list above is good. So we really have several different areas where we want to apply filters (whole file, difference, line). So how about just adding this application area as parameter/type for filter. So filter definition would have: a) filter rule (regexp) b) application area (file, difference, line) Each filter type is handled in different compare phase. But we can handle that, I hope. > By the way, I'm curious what a common use case is for line filters right > now. Good question. I really can't say. But if I have to say something, I'd say it is simple filtering of comments, CVS tags etc. Complex filters are hard to write and easily give wrong results. Anyway, I think this shows we still really have one *filtering* case where we filter lines out of diffs. Other two cases (filtering whole files and transforming lines) are more like transformations using filter rules (=regexps). Filtering uses regexps. But that does not mean everything using regexps is about filtering. :) Regards, Kimmo |