Re: [Winmerge-development] Filtering
Windows visual diff and merge for files and directories
Brought to you by:
christianlist,
grimmdp
From: Matthias M. <Blog@OutOfHanwell.com> - 2006-10-19 04:01:52
|
Kimmo Varis wrote: > Thanks for your comments. > > Yes, the filtering seems to be something I really can't agree with > elsapo/Perry. But that's nothing new... > > >> I've been reading the wiki and through some of the source code, and it's >> been pretty interesting. I'd like to bring several questions up for >> discussion. The first question is that of filtering, which seems to be >> pretty core to the design. (There's been some discussion at >> http://sourceforge.net/forum/message.php?msg_id=3940907 and >> http://sourceforge.net/forum/message.php?msg_id=3941584.) >> >> The idea that elsapo advocated was that transformations should happen >> before the diff. If I understand correctly, a diff would consist of >> three steps: >> > > Maybe it is best first to define what we think when we talk about > filtering. When I'm working with files in [any OS] I usually do the file > matching using wildcards. Lets think of that as simples file filtering. > If I want to see .txt files in any folder, I type $[command] *.txt. So > that filtering gives me a *subset* of the original set. I hope this kind > of mental model is good when we talk about filtering in WinMerge: we > give user a subset of original set. We don't change the original set. > And the subset must be inside the original set, it can't include > something not in original set. > > Transformations are not filtering. If we alter the original set, then it > is some another feature. > > I know some people want to think these as same feature, but I don't. I'd > like to keep filtering as easy to use and approach feature. Think about > that file matching example I wrote above. It is easy to think you filter > your results by some criteria (*.txt). But when you start thinking about > transforming it, it gets easily a lot more complex. It is no trick to > give a commad like > $cat file.txt | grep todo > to find lines having word todo. But it is a lot more complex to think > about commands changing those todo -words to e.g. done -words. (I don't > even try to write the command for it now.) > > > >> By doing this, the whole mess surrounding line endings could be kept out >> of the diff code. Case-insensitive compare could be done by converting >> the entire input string to lower case. Whitespace-agnostic compares >> could be done by collapsing whitespace. This would also allow multi-line >> filtering. >> > > EOL issue is a good point. But we don't need transformations to solve > it. We always give linedata to diffutils, so we can unify EOL bytes > before diffutils. Regexps don't understand different EOL styles. It is > unfortunate that current linefiltering is implemented using diffutils > regexp so doing it is not trivial. If we unify the regexps then this > would be easy to solve. (And in fact WinMerge already stores the lines > and their EOL bytes separately, but that separation doesn't go into > diffutils level.) > > Case-insensitivity and whitespaces are very good points. I've never > thought about doing them this way. But I'm afraid these are special > cases too. They are pretty easy to hard-code and optimize. Since these > are very common cases I'd say we also want to optimize them. It is > faster to do tolower() in compare code than go through some custom > transforming code. > > But if we do generalize this transformation-idea then we have something > along current plugins? What these transformations should and could do is > something I haven't really thought yet. > > Anyway, as I've said, I'd like to keep filtering and transformations as > separate features. > This distinction between transformations and filters is a good one to make, and I also see them as to separate features. Your point on performance is well taken. I would enjoy seeing numbers regarding the performance of using regular expressions for this, however. >> What are the disadvantages of this approach? What are the advantages of >> post-diff filtering? >> > > Major thing I'm against pre-filtering is we are doing a real-world > software. Speed is very important. Remember users have large files and > folders having lots of files to compare. And we all know data sizes grow > with time. > > These are assumptions we have to make (and hopefully can agree): > 1) regexps are slow, they are advanced string matching, they just can't > be fast. > 2) compare is faster than matching. for compare you only need to find > different chars, for matching you need to find exactly same chars. I > mean you need to find one differing char from the lines to judge them > different. But you must check whole line according matching rules to say > if it matches or not. Ok this is not obvious and can be false also. > 3) most of the time matching every line of the file is a waste of the > time (as most lines are not differences). > 4) assuming the normal case is most of the lines are not differences. > I'd say it is an exception to to have files where there are more than > 50% lines different. > 5) oh, and filtering only changes different lines to ignored > differences. Nothing more. > > Think about it this way: we have two files, having lines like this: > file a: > --- > 1 #include "statlink.h" > 2 /** > 3 * @brief About-dialog class. > 4 */ > 5 class CAboutDlg : public CDialog > --- > 1 #include "statlink.h" > 2 /*** > 3 * @brief About-dialog class and something else. > 4 */ > 5 class CAboutDlg : public CDialog > --- > > Lines 2 and 3 are different. Lets assume you have filters to match lines > beginning with /* and lines having @brief -word. So your filters would > match lines 2 and 3. > > Only changes there are possible by filtering, are to change lines 2 and > 3 to ignored differences. Filtering cannot add differences to other > lines. So even trying to match other lines is waste of the time. > > So by first comparing, we do first compare 5 lines. And then match 2 > lines. And by first matching we first match 5 lines and then compare 5 > lines. I'd bet first case is faster. > > I keep talking about speed. But it is what users see in real life. It is > a big difference if comparing two xml files takes 2 seconds or 10 > seconds. (Just some numbers, I don't claim that is the speed > difference.) For file compare we rescan (and so filter/transform) files > pretty often, every time the difference is merged, every time files are > changed (if automatic rescan is enabled). > > Sorry if this sounds like I'm repeating myself. Long day at work.. > > Regards, > Kimmo > I think the choice between filters or transformations simply comes down the question, "How will WinMerge be used?" Most of the diffing I do is from my source control, so I'm generally diffing <100KB and I can spare a couple of processor instructions for that! As you mentioned, however, this might not scale to directory compares or to large files, especially since there's a significant algorithmic difference between the two approaches. Filters, as they are implemented today, are more of a UI feature than a comparison feature (except when exporting patches). This makes them very easy to understand and use. But they also limit their usefulness. Possible limitations of filters are lack of support for intra-line diffs and moved-block detection, if that might be affected. (Please correct me if I'm wrong.) When I'm refactoring code, I often correct several variable names, and I want to ignore those differences when reviewing my changes. However, I don't want to simply ignore all lines with the new variable names, since I might have introduced a typo in an inline comment. I need to see those typos! Rather than simply filtering the changes, I want to enter a regular expression that describes the changes I made so that I don't see those changes any more. This would still have the advantages of applying the regular expressions to only part of the file, but it would require that those sections be compared again. To summarize, possible techniques are: -Transform the entire file (useful for binary/XML/Word diffing) -Transform text within changes (useful for ignoring only parts of changes--Perry mentioned ignoring columns in a log file) -Filter specific lines within changes (useful for ignoring specific lines) As mentioned on the forum, the cost of implementing these might outweigh any benefits. By the way, I'm curious what a common use case is for line filters right now. Hope these ideas make sense. Regards, Matthias Miller |