Hi, I tried to use dump3 on a large (about 6000) set of large (average size ~500KB) text files. However, with the text algorithms that dump3 uses, it tests a representative subsample of ~200 files for the same amount of time, that sim_text (part of similarity-tester) checks all of them (about an hour). Assuming O(n^2) algorithms, this means a factor of ~1000 in speed! I also found that sim_text gives better signal-to-noise (even after playing around for a while with the parameters in the config file in dump3 I could get dump3 to give similarly nice results), i.e. gives less false positives, and detects more actually similar files! I have no idea if it would be at all possible, but it would be great if you can incorporate the sim_text code in dump3!
Oops! I meant: "(even after playing around for a while with the parameters in the config file in dump3 I could NOT get dump3 to give similarly nice results)"
Do you know where I can find the source of sim_text?
You can find the source here: http://www.cs.vu.nl/~dick/sim.html
Sorry to resurrect an old thread.
I have been doing similar comparisons for years (since 2003).
Originally I used a program known as wdiff. This was simply a wrapper around
the existing diff program. Basically pre-parse text files into one-word-per-
line files and run it through diff, and then generate a merged differences
file formated using the second file given. It works very well, but as you can
imagine quite slow.
Because of this I ended up writing a perl script to fine and locate 'keywords'
in the text files, to try and reduce the files to be compared from a quadratic
to something much much smaller. Initial sets of key words were created by
hand, (peoples names, actions, objects and other things) but later were
creating using 'rare word' type searches across the file, and ignoring a set
of 200 or so 'extremely common' words, (like 'the')
wdiff was later improved by the use of 'context'. That is each word in the the
preprocessed file, was wrapped by the one or two words before and after it.
This use of context, removed 'synchronization' problems where diff failed to
compare large blocks of similar text after a large block of dis-similar text.
This worked tremendously, and even later more exhaustive searches of the texts
on faster machines (taking tens of days) found very few extra unknown
I also discovered sim_text, and modified it to completely ignore anything that
was not a word or number. That is ignore punctuation as well as layout. The
result is a "sim_words" variant. Testing showed it to be a LOT faster, though
I needed to adjust my 'perl wrapper' to collect and group batches of files
that was to be compared.
The percentage of matching or common text generated by "sim" is however far
lower that that of a context wdiff, but as it kept compressed pre-processed
texts in memory while comparing multiple files, and avoiding temproary file IO
and process spawning, "sim" is a lot lot faster than wdiff.
Typically when a new large set of texts come in, I would not do a run with
sim_words, to find initial duplicates, and then a second slower run using
wdiff to finish the task.