Recommended hash for 400GB of .flac files?

  • What hash would you recommend for 400GB of .flac files (~19000 files)?  I chose MD5 and it's been running for 70 hours with an estimated 17 remaining (running a server tuned for background processes).  I'll need to repeat on a mirror, so if there's a hash that runs faster I might finish in less time by restarting using a different hash.

  • Tom Bramer
    Tom Bramer

    The rate at which the files are being hashed in this case is likely lower than the raw rate of even the most expensive algorithm.  To see this, you can use the benchmark option of FVC.  Here's an example:

    C:\> fvc -b
    Performing algorithm benchmark.  Number of bytes to be hashed = 32.0 MiB
    CRC16: 261.7 MiB/s.
    CRC32: 294.5 MiB/s.
    BZIP2 CRC: 285.4 MiB/s.
    MPEG2 CRC: 286.2 MiB/s.
    JamCRC: 282.1 MiB/s.
    Posix CRC: 285.0 MiB/s.
    ADLER32: 1.5 GiB/s.
    MD4: 650.7 MiB/s.
    MD5: 452.9 MiB/s.
    EDONKEY2K: 652.4 MiB/s.
    RMD128: 257.9 MiB/s.
    RMD160: 181.2 MiB/s.
    RMD256: 283.3 MiB/s.
    RMD320: 194.9 MiB/s.
    SHA1: 245.9 MiB/s.
    SHA224: 130.4 MiB/s.
    SHA256: 134.7 MiB/s.
    SHA384: 197.5 MiB/s.
    SHA512: 200.6 MiB/s.
    WHIRLPOOL: 42.0 MiB/s.
    WHIRLPOOL-T: 40.0 MiB/s.

    How fragmented is the filesystem in which the files being hashed reside?

    You might want to compare the processing rate of FVC vs. FV.  I found a possible lock contention issue that becomes evident on SMP systems and made a change in v0.6.6 to mitigate the problem, but have never had any feedback on whether or not the change solved another user's reported slowdown.

  • Fragmentation might be the issue - will need to look into that.  I'll also check out FVC vs FV. 

    The stats on the first fv run came out to:
    Processing 18957 files.
    Processing 398.3 GiB total.
    Operation complete: 18957 files processed, 0 files skipped, 0 files not processed due to errors.
    Elapsed time: 94:04:24.903
    Average processing rate: 1.2 MiB/s.

    Just kicked off the run on the mirror and will check back in a week.

  • I don't think disk fragmentation is the root cause.  I just saw feature request ID: 3041789 and I think the performance that I'm experiencing is related to what the person there saw and attributed to external processes.  The first 50GB were processed quite quickly, but now as each hour passes, the processing speed just continues to decrease (went from 100s of MiB/s to less than 5MiB/s in 5 hours).  The fv process appears to be cpu bound rather than disk bound (25% cpu usage on an i5 with 4 logical cores); memory usage does not appear to change.

  • Tom Bramer
    Tom Bramer

    In any case, you should also test against version v0.6.6.6050, as a change was made as another user was having an issue with performance drops.

  • Night and day difference.  The run that I kicked off yesterday was only about 22% complete after 20 hours.  I stopped it, uninstalled it, installed and restarted a verify on the mirror:
        Processing 18971 files.
        Processing 398.6 GiB total.
        Operation result: 18971 good files, 0 bad files, 0 files not processed.
        Elapsed time: 2:04:40.327
        Average processing rate: 54.6 MiB/s.

    It was no longer CPU bound.


  • Tom Bramer
    Tom Bramer

    Thanks for reporting back.  This is the first report back on the fix applied to v0.6.6.6050 concerning the performance drop.

  • drm2525

    Subject: Performance drops … I've notice some discussion here regarding performance drops.  I also am experiencing this.  I've got 2TB of data that I'm trying to calculate the MD5 on the files.  I'm running the latest beta version v0.6.6.6050 on Windows 7 64 bit on a dual core intel processor.  I've got 4gb of onboard memory.  The files I'm checking are on a USB drive.  The files were newly copied to this drive and there is not a disk fragmentation issue.  The only virus software I run is the Microsoft Security Esssentials and I have disabled it without effect.

    When I initially start to calculate MD5 hashes the processing rate begins at about 32 mb/second.  If I use the windows resource monitor, I notice cpu usage is only 25%.  The disk is active 90% of the time.  After about an hour the rate drops to 25 mb/sec.  An hour later to about 20 .. and continues to drop …. after about 4 hours it processing at 9 mb/sec.    As the processing rate declines so does the disk usage of the USB drive.  The CPU usage remains at about 25%

    I've tried  using CRC32 hases instead of MD5 to see if this effected the drop in thru put.  The CRC32 hash calculations experience the same drop in thru put.

    I've got an 8 gb USB flash memory drive that I sometimes use for "readyboost" to increase available memory.  Enabling this has no impact on the problem

    If I stop fv and then restart it … It again starts out at 32 mb/sec …  and then begins again to drop processing thruput.  So right now, I stop the processing every few hours and restart fv.

    I think if was only calculating the hash on a 20 or 30 gb of data, I would have never noticed the performance drop.  It is very noticeable though when I'm trying to work on 2 TB of data.

  • drm2525

    Subject: Performance Drops
    I've done some additional investigation and am providing this additional information.
    1.  After a few hours my processing thruput decreases from about 32 mb/s to about 9 mb/s per my previous post.

    2.  If I completely exit fv and restart the program, I always revert to a 32 mb/s thruput.   However, If I only stop fv and restart the scan (without exiting the program) the thruput restarts at about 30 mb/s and rapidly degrades within 40 or 60 minutes … (whereas it degrades much slower if I completely exited fv before the restart).

    3. The USB hard disk that I am calculating hashes on is a 2 TB Western Digital with USB 3.0 capability.  My laptop has USB 2.0 capability.  Yesterday, I purchased a $15 AKE USB 3.0 card for my laptop. Now using the USB 3.0 connection my thruput while calculating hashes starts out at 110 mb/second and over a few hours degrades to about 63 mb/s.  It stabilizes at 63 mb/s.   So the degradation problem seems to be something related with disk thruput and not CPU or memory.

    4. Using the USB 3.0 connection during a verify operation - Its interesting that the verify operation starts out at about 75 mb/s and remains fairly constant. It sometimes goes up to 80 mb/s and sometimes drops to 70 mb/s .. but it remains in this range after 5 hours.  So the process of hash verification is much more stable than the process of hash calculation. 

    Now that I'm using the usb 3.0 connection, I'm quite satisfied with the speed of hash creation/verification.  It takes about 6 hours to process 1.5 TB of data.

  • Tom Bramer
    Tom Bramer

    You might want to compare fv to fvc for the same calculation operation.  During the investigation of the performance drops found in v0.6.5.6000, it was found that fvc's performance was fairly consistent and higher than that of fv.  The inconsistency had to do with the thread synchronization that happens in fv and doesn't happen in fvc.