Support for large file diffs
Brought to you by:
alnd
Just wondering if any benchmarks exist for what files sizes CLOC can handle when running diffs?
In my environment I am comparing releases which consist of around 1K files containing 10.5 million blank lines, 30K comments, and 18 million lines of XML; I killed the diff after 25 hours.
No mention of bdiff or large file support in Algorithm::Diff so wondering if it's even possible.
Currently testing with GNU diff '--speed-large-files' and if feasible may write a wrapper (maybe Algorithm::diff could be enhanced)
Anonymous
1.8e7 lines of XML is a lot. Could you let cloc run to completion, perhaps over a weekend, just to establish the baseline performance--which can hopefully be improved later? Better yet, run cloc with a Perl profiler (http://www.perl.org/about/whitepapers/perl-profiling.html) to see where the time is spent. The results would be most valuable.
As a point of reference my development box can do a straight count of gcc-4.5.3.tar.bz2 in 242 seconds, or 31679.1 lines/s. I'll report back later on how long it takes to do a diff between gcc-4.5.2.tar.bz2 and gcc-4.5.3.tar.bz2.
More performance info:
time cloc --diff gcc-4.5.2.tar.bz2 gcc-4.5.3.tar.bz2
57298 text files.
57414 text files.es
7029 files ignored.
[...output trimmed..]
SUM:
same 53076 0 1653021 4902805
modified 320 0 165 996
added 116 630 799 4943
removed 1 83 151 1134
real 29m4.001s
user 23m56.697s
sys 3m59.772s
So a diff of two archives, each with roughly 6.5e+6 lines of code and comments, took just under a half hour.
Hardware is Intel Core2 Quad CPU Q6600 @ 2.40GHz; runs Fedora 14.
Changing to pending; will close in two weeks unless more discussion is needed.