similarity-tester

Algorithms
pardyu
2008-09-27
2012-07-19
  • pardyu
    pardyu
    2008-09-27

    Hi, I tried to use dump3 on a large (about 6000) set of large (average size ~500KB) text files. However, with the text algorithms that dump3 uses, it tests a representative subsample of ~200 files for the same amount of time, that sim_text (part of similarity-tester) checks all of them (about an hour). Assuming O(n^2) algorithms, this means a factor of ~1000 in speed! I also found that sim_text gives better signal-to-noise (even after playing around for a while with the parameters in the config file in dump3 I could get dump3 to give similarly nice results), i.e. gives less false positives, and detects more actually similar files! I have no idea if it would be at all possible, but it would be great if you can incorporate the sim_text code in dump3!

     
    • pardyu
      pardyu
      2008-09-27

      Oops! I meant: "(even after playing around for a while with the parameters in the config file in dump3 I could NOT get dump3 to give similarly nice results)"

       
    • Hi
      Do you know where I can find the source of sim_text?
      Thanks
      Alex

       
      • pardyu
        pardyu
        2008-10-03

        Hi,
        You can find the source here: http://www.cs.vu.nl/~dick/sim.html

         
  • Anthony
    Anthony
    2011-12-13

    Sorry to resurrect an old thread.

    I have been doing similar comparisons for years (since 2003).

    Originally I used a program known as wdiff. This was simply a wrapper around
    the existing diff program. Basically pre-parse text files into one-word-per-
    line files and run it through diff, and then generate a merged differences
    file formated using the second file given. It works very well, but as you can
    imagine quite slow.

    Because of this I ended up writing a perl script to fine and locate 'keywords'
    in the text files, to try and reduce the files to be compared from a quadratic
    to something much much smaller. Initial sets of key words were created by
    hand, (peoples names, actions, objects and other things) but later were
    creating using 'rare word' type searches across the file, and ignoring a set
    of 200 or so 'extremely common' words, (like 'the')

    wdiff was later improved by the use of 'context'. That is each word in the the
    preprocessed file, was wrapped by the one or two words before and after it.
    This use of context, removed 'synchronization' problems where diff failed to
    compare large blocks of similar text after a large block of dis-similar text.

    This worked tremendously, and even later more exhaustive searches of the texts
    on faster machines (taking tens of days) found very few extra unknown
    duplicates.

    I also discovered sim_text, and modified it to completely ignore anything that
    was not a word or number. That is ignore punctuation as well as layout. The
    result is a "sim_words" variant. Testing showed it to be a LOT faster, though
    I needed to adjust my 'perl wrapper' to collect and group batches of files
    that was to be compared.

    The percentage of matching or common text generated by "sim" is however far
    lower that that of a context wdiff, but as it kept compressed pre-processed
    texts in memory while comparing multiple files, and avoiding temproary file IO
    and process spawning, "sim" is a lot lot faster than wdiff.

    Typically when a new large set of texts come in, I would not do a run with
    sim_words, to find initial duplicates, and then a second slower run using
    wdiff to finish the task.