Option to compare file name

Tong Sun
2010-01-30
2013-04-02
  • Tong Sun
    Tong Sun
    2010-01-30

    Hi,

    I think for most people, over 99% of identical files have same names. I know files with different names might be identical, but I believe the chance would be rather low.  For me, I only care about identical files with same names, and don't want the program to hectically computes and compares the md5s just because some files happen to be the same size.

    Thus I hope there is an option to limit dupmerge only to compare files with same names. I know this means giving up the quicksort cmp algorithm, but please think of the improvement in speed - we only need linear comparison now and moreover, only a fraction of md5sums need to be calculated.

    thanks

     
  • Freitag, Rolf
    Freitag, Rolf
    2010-04-10

    Hi,

    it's possible to add an option and a line in the cmp function to check for the same name (without path).
    Because most identical files have diffent names and i don't care about the file name, i did not made it. But you are free to do it.
    It should also be possible to use a find duplicate file names script and feed dupmerge with the found file names (with the path and zero-terminated):
    http://code.activestate.com/recipes/364953-find-duplicate-file-names/

    A workaround is using dupmerge in nodo mode (option -n) and look for duplicates with equal file names.

    This dupmerge uses direct comparison of files and not hash values, because i have an MD5-collision, SHA1SUM is nearly broken and sooner or later longer hash values like SHA256SUM will also be broken.
    Another point are random hash collisions and collision attacks: if the hash values of n files are equal you can never be absolutely sure that the files are equal. Therefore for deduplication the comparison of hash values is only the first step and (direct) file comparison the last.
    So using the file size as hash value (in this dupmerge) is a good and efficient compromise.

     
  • Tong Sun
    Tong Sun
    2010-04-10

    Thanks for the explanation.