Home
Name Modified Size InfoDownloads / Week
fdf.tar.xz 2018-11-17 8.5 kB
fdf_ss2.gif 2018-11-17 8.5 kB
fdf3l1.c 2018-11-17 23.7 kB
readme.md 2018-11-17 3.1 kB
Makefile 2018-11-17 82 Bytes
Totals: 5 Items   44.0 kB 0

express version: fdf (scan everything, not fast) or fdf /path/to/scan

It's best run as root to be sure you have access to everything. Sudo will probably work.

I remember using a duplicate file finder under msdos in the 1990s, this has no real connection to it, I never saw the source, this just tries to work like it. Written under Linux, tested under OpenBSD, shouldn't have any dependencies. No libaries, written in plain C, no GUI.

It works by climbing a directory tree and recording information about files. Then it sorts them by size, any that match in size are potential duplicates so they're passsed on to the next stage, which is to take a CRC32 of them. Any unique files drop out. Then they're sorted again by size and CRC for grouping in the output, biggest files first.

The only command line parameter it knows is a starting path, if there isn't one it defaults to /. It will search everything that's mounted so it can be reached.

The output is in the form of a script to feed to your shell, but by default every line is commented out so it does nothing. I don't know which copy you want to keep.

It looks like this:

CRC: B486C5EF size: 34

rm /usr/programs/c/backdir/Makefile

rm /usr/programs/c/backdir/backups/Makefile_2013-05-02_2000

rm /usr/programs/c/backdir/old_backups/Makefile_2013-05-02_2000

You work through the file and uncomment (delete #) the lines for the files you want to get rid of. Biggest ones are first so you can reduce the size quickest by starting at the top. It doesn't have to be one at a time, if your editor can do search and replace all. You could search for

rm /usr/freebie_backup

and replace with rm /usr/freebie_backup to delete all the duplicates in a given directory. But only the duplicates, anything unique would still be left.

As drives get bigger one of the pitfalls is that it's possible to make bigger messes that take more work to clean up, if you get into stashing spare copies of files here and there for safekeeping. This will find them. Whether you're wading through 3 million files on a 1 TB drive or trying to fit more on a small SD card, this can help. If you run out of time or patience partway through it's not a big deal, just run what you've got and you've cleaned up the worst of it.

If it's really necessary to have those files in all those places for some reason, consider keeping just one of them and replacing the rest with symlinks to it. See man ln or mc makes them pretty painlessly. A symlink takes about 15 bytes and mostly works the same as the real file. It is the real file, just another directory listing pointing to it. Symlinks are ignored by this program BTW.

My apologies for what seems to be the slowness of the CRC32 calculations. I used a slice-by-8 algorithm which didn't seem to help much. Then I experimented putting a bunch of gettimeofday calls in to read with microsecond resolution and found the calculation time is dwarfed by the reading time, maybe 20:1. My 3/4-full 1 TB drive takes about 1.5 days for 3 million files. That will depend on the device it's reading from and how it's connected.

Source: readme.md, updated 2018-11-17