#216 Partial loading/compare large files

open
nobody
None
5
2004-01-23
2003-11-20
No

For a large program/database conversion project,
we have large text-files (export from old database,
for import to new). To compare the difference between
export/import text files, I tried winmerge. But it is
very slow!
Directory compare: slow, can also be done by file
size/date?
File compare: slow, whole file is compared.

Because we have large and many text files (20 - 50mb),
and it does a complete thorough compare (which is
not intelligent and unneccesary), it is very slow.

Please some better handling, for example: partial
loading/compare for large files.

Discussion

  • André Mussche

    André Mussche - 2003-11-20
    • summary: Parial loading/compare large files --> Partial loading/compare large files
     
  • Anonymous - 2003-11-20

    Logged In: YES
    user_id=60964

    Which version were you using ?

    Also, if you could spare the time to test several versions,
    that would be really wonderful.

    Eg,

    - experimental release 2.1.3.9
    - beta release 2.1.2.0
    - stable release 2.0
    - stable release 1.7

     
  • Anonymous - 2003-11-20

    Logged In: YES
    user_id=60964

    > Directory compare: slow, can also be done by file size/date?

    It will use file size if you select options equivalent to
    binary identical:

    - Disable "Ignore Blank Lines"
    - Disable "Ignore Case"
    - Enable "Compare Whitespace"
    - Enable "Sensitive to EOL"

    Otherwise, files with differing sizes may be identical, and
    the only way to tell is to actually process them.

    It will never use file date (file date is simply not
    reliable -- we assume nobody wants speedy but incorrect
    results).

     
  • Anonymous - 2003-11-20

    Logged In: YES
    user_id=60964

    > Please some better handling, for example: partial
    loading/compare for large files.

    If you can think this through some more, or explain it more
    fully, please would you post it over on the RFE (request for
    feature enhancement) list ?

    But, if you have many large files which are identical except
    for whitespace changes, and you have told WinMerge to ignore
    whitespace changes, then it is going to have to read every
    byte of every file -- how else will it be able to tell that
    they are actually identical ? At least, I can't think of a
    way around this.

     
  • ganier

    ganier - 2003-11-20

    Logged In: YES
    user_id=804270

    It is slow : a bug or a RFE ?

    Directory compare by size/date :
    Someone is already working for this RFE : [ 826652 ] Compare
    Files By Modified Date. When the RFE will work, to add the
    size will probably be easy.

    File compare :
    Perry, if I am not mistaken :
    - Disable "Ignore Blank Lines"
    - Disable "Ignore Case"
    - Enable "Compare Whitespace"
    - Enable "Sensitive to EOL"
    is the same as binary compare. But it is not like comparing file
    size only.

    Andr : what do you mean by partial loading/compare ?
    Is there a permanent section of the file that you never
    compare ?

     
  • Anonymous - 2003-11-20

    Logged In: YES
    user_id=60964

    I thought that diffutils internally uses filesize if the
    options are set equivalent to binary compare. However, I may
    be wrong. Plus, we still do all the plugin stuff to get in.

    A good RFE would be to set different in DirScan if binary
    compare equivalent and filesizes are different.

    I personally have almost no respect for file date, because
    file dates are different very often, for identical files.
    But, indeed, they could both be implemented in the same
    place in DirScan.

     
  • André Mussche

    André Mussche - 2003-11-20

    Logged In: YES
    user_id=661937

    Oke, I understand it a little more...

    But, wouldn't it be an option to switch to binary mode if the
    files are larger than 1mb for example? So you can say: file
    size is different, file is different. That saves a lot of scanning
    time! Or an popup question/warning, whatever.

    With partial I mean: load the first 500 lines and compare it
    (file compare, not folder compare).
    When the user scrolls down, read the next 500 lines.
    Or something like that. If 2 files of 50mb are different in the
    header, I know enough. After all, I don't want to totaly
    compare that 50mb!

    But, If no changes in the first 500 lines, yeah... ehm...
    Keep loading? Warning? That's up to you folks! :-)

    Ow yeah, maybe some progressbar while loading/comparing?
    Makes sense if already done 90% of 50mb or just 1% :-)

     
  • Anonymous - 2003-11-20

    Logged In: YES
    user_id=60964

    #1) Only do partial comparisons, and/or ignore options for
    large files

    It sounds like maybe you're asking for a really big change
    in behavior.

    Right now, WinMerge figures out which files are different,
    and shows the status in the dirview.

    If you want it to just partially test them, that is a big
    change. (It can't show you which files are different, if it
    doesn't really know.)

    As a user myself, I'm not clear how I would use that. Also,
    I don't (as a user) want WinMerge to ignore my options for
    big files -- at least, that worries me, because it sounds
    confusing.

    #2) status bar

    What about the status bar floating dialogbar that we have
    now ? Do you mean you don't like it ?

    *****************
    PS: I think you still haven't said which version(s) you've
    tested this with ? Please, if you have time, do that -- at
    least, do say what version you're discussing. Version 2.0 is
    pretty old, and Version 1.7 is *very* old, and if you're
    talking about really old stuff, and we don't know that, it
    may cause miscommunication :)

     
  • Anonymous - 2003-11-20

    Logged In: YES
    user_id=60964

    I just posted an untested patch#845981 (Add filesize
    optimization to DirScan). I would welcome anyone willing to
    test and/or troubleshoot it! :)

     
  • Kimmo Varis

    Kimmo Varis - 2003-11-20

    Logged In: YES
    user_id=631874

    Partially comparing would be good for big files. But
    implementing it is all but easy. Its not just "take
    prev/next 100 kb and compare them". Actually just running
    part of file through diff-engine is the easy part.

    But if we only load and analyse part of file we are missing
    lot of important information about files. Well, just total
    amount of different lines is very important missing
    information. How could we know when files are identical?
    What about going to next diff at end of this partially
    loaded file?

     
  • Anonymous - 2003-11-20

    Logged In: YES
    user_id=60964

    A cheaper solution might be to have DirScan queue any large
    files (say, files over 10megs by default), and dirview
    display is enabled (press any key to continue) when it
    finishes everything under 10megs. Then it continues to work
    on the big ones. They are all set to status "Not finished"
    in display.

    This is a lot of work to implement, I think, but it might be
    feasible. It might be a nice usability improvement.

    I mean, if the file is "Not finished", you *can not* open it
    in mergeview -- we don't allow that until file is finished.

     
  • ganier

    ganier - 2003-11-20

    Logged In: YES
    user_id=804270

    I am convinced by the utility of better dir compare with great
    files (like Perry proposes).

    But I am doubtful for a 500 lines by 500 lines file comparison.

    Is it for long logs ? Then the interesting part will be as often
    in the end as in the middle as in the beginning. With a simple
    incremental read, you will wait half the time for the middle,
    and full time for the end. Are you really interested in the
    beginning of the file ?
    Do you rather think to an application like Word ? = move
    quickly anywhere in the file (if you don't care for page
    number), and only the active portion is loaded. That is very
    different from incremental load, so please bring some
    precisions.

     
  • André Mussche

    André Mussche - 2003-11-21

    Logged In: YES
    user_id=661937

    I understand the difficulties of it. That's almost always
    with programming: saying it is a lot easier than
    implementing it :-).

    Well, I don't really mind how it is done. But for now, it is
    not *working* with large files. Maybe add an option to scan
    the files very *light* (when files > 1mb?). And of course,
    less accurate. That's the choice the user has to make.
    For now on, I want a fast rough compare of many large files.
    Maybe later a more thoroughly scan, but than I'm willing to
    wait.

    Partial/incremental load. Just see it more as some more
    feedback to the user. Now, it loads the
    entire file, and than shows the difference. So I can't see
    if there are already some differences found.
    In this case, I'm only interested in the beginning of the
    files. So complete loading is not necessary.
    And after waiting a long time of loading, and seeing that
    the difference was not in the beginning (or not
    interesting), I've wasted my time :-(.

    I know, I have some exotic wishes. At least, please think
    about how to optimize it with large files!
    I don't mind how, maybe some more options to chose from, or
    a check and popup in case of large files, whatever!

    I've used Araxis Merge, which has at least a much faster
    dirscan. But the same problem with comparing :-). But that's
    the power of open source: discuss and bring in new/better
    features! :-)

    By the way, I've used version 1.7 and the newest 2.1.2 of
    WinMerge.
    (It's a nice program, good job! But there are still things
    that can be improved :-) )

     
  • Kimmo Varis

    Kimmo Varis - 2003-11-21

    Logged In: YES
    user_id=631874

    You should try new beta 2.1.4.0. Beta 2.1.2.0 was indeed
    slow in dir compare, because we tried to read version
    information from all files. 2.1.4.0 should do a lot better
    in dir compare speed.

    Yes, it can always be done faster. :)

    Perry's idea of delaying bigger files could be one solution.
    Another could be to first only compare files, not bother
    reading all other information, dates, attributes etc.
    Difference is the important info, dates etc can be delayed
    to read after that. That could help with large directories.

     
  • Anonymous - 2003-11-21

    Logged In: YES
    user_id=60964

    I think patch#845981 is going to help the dirview problem a
    lot -- it avoids loading plugins and file, when the size is
    different (and binary equivalent options). I would guess
    that this is the problem, not reading version & dates,
    because for a 50meg file, reading version and date is
    probably insignificant compared to costs of messing with
    buffers (plugin & diffutil loading) ?

     
  • ganier

    ganier - 2003-11-21

    Logged In: YES
    user_id=804270

    For file comparison, if that is a repetitive need, if you are
    interested in the beginning of the file only, if you just want to
    diff (and not merge), if you program in VB or VC++, then
    there maybe a solution.
    You may create a plugin which just copy the file and
    truncates the copy before diffing.

     
  • André Mussche

    André Mussche - 2003-11-22

    Logged In: YES
    user_id=661937

    Maybe a stupid question, but I can't find the patch.
    What is the best way to test the patch/new version?
    Do I have to compile it myself? Or can someone send me
    it, so I can test it?

    Thanks a lot!

    (hmm, a plugin to truncate... Sounds interesting. In which
    language do I have to program it? If I have some sparetime, I
    will take a look :-) )

     
  • ganier

    ganier - 2003-11-22

    Logged In: YES
    user_id=804270

    > Maybe a stupid question, but I can't find the patch.
    > What is the best way to test the patch/new version?
    We are probably going to build an experimental version soon.
    Perry may even want to build one now.
    For my part, I didn't have time to look at this time, and I'd
    prefer give a look before building the experimental.

    > hmm, a plugin to truncate... Sounds interesting. In which
    > language do I have to program it? If I have some
    > sparetime, I will take a look :-)
    There are some docs and examples in the CVS (CVS->plugins).
    The language is to your preference amongst VB, VC and
    Delphi.
    Please read the docs first once and report any difficulty, as
    the docs probably need to be improved a lot.

     
  • Anonymous - 2003-11-25

    Logged In: YES
    user_id=60964

    The patch was source code only.

    But, it isn't really good enough to apply, because if we
    skip files without scanning them, we don't know if they are
    text or binary.

    - We don't know what icon to use

    - Our file opening code doesn't allow for this case (of not
    knowing whether text or binary)

     
  • Anonymous - 2004-01-23

    Logged In: YES
    user_id=60964

    This doesn't belong on the bug list anyway, but on the RFE
    list, I think. Moving to RFE list.

     
  • Anonymous - 2004-01-23
    • labels: 559476 -->
    • milestone: 102449 -->
     
  • Kimmo Varis

    Kimmo Varis - 2004-02-02

    Logged In: YES
    user_id=631874

    Latest experimental (2.1.5.8) build includes support for
    alternative compare method for directory compare, comparing
    by modified date. If puddle is still interested, we could
    get this compare by size patch in too. Patch would be pretty
    simple, new function to compare sizes to dirscan.cpp and new
    combobox item to options.

    Thinking of this.. Maybe we can allow adding custom
    *directory* compare methods through plugins? Directory
    compare is pretty simple anyway, two files are identical or
    different. Yes, default diffutils compare does a lot of
    work, but sometimes all that work and information is not needed.

     
  • Anonymous - 2004-02-02

    Logged In: YES
    user_id=60964

    Not only does diffutils do a lot of work for all the various
    compare options which complicate the picture beyond simple
    binary comparison, but also it does a lot of work to produce
    a list of block changes; this list is nearly useless to us
    in dirview. We used to only use it to deduce simple
    different/same flag. Now we may also use it to get count of
    differences.

     
  • ganier

    ganier - 2004-03-25

    Logged In: YES
    user_id=804270

    Two new plugins are available : see patch [ 923044 ] Partial
    compare large files.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks