I made a short dupmerge tutorial:
The original dupmerge author, Phil Karn, has rewritten the original version from scratch, with a new algorithm: http://www.ka9q.net/code/dupmerge/ .
So there are now, since december 2008, two actual versions (forks) of dupmerge: The karn and the freitag version.
You can also find this dupmerge in the grml repository:
I found out that dupmerge can reach the system limit of hard links while dupmerging database directories (dupmerge output):
lstat: No such file or directory
Files linked: 208674 of 959741, Disk blocks reclaimed: 27286496
Minimum of found hard links: 1, Maximum: 32000.
When the hard link limit (of 32000 under ext3/Debian Lenny) is reached, inside dupmerge the unlink of the duplicate file works, but the hard linking does not. So when the limit is reached, the duplicate files get lost!
I'm working on a version with "ln -f" which should preserve the duplicates when the limit is reached, but future versions should find out the system limits and should merge the duplicates behind the limit to a second file till the system limit gets reached a second time and so on.
Because that's not easy, because of additional data structures, checks etc. and because more than 32000 equal files are uncommon, i will put it on the todo list.
I made some minor changes, added some comments and checked with coreutils-5.2.1 (find ./ -type f -print0 | dupmerge) and my archive (about 500 GB in one million files on an encrypted partition).
In the release (zip) is an executable (dupmerge: sticky ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.6.4, dynamically linked (uses shared libs), for GNU/Linux 2.6.4, stripped).
I made some minor changes, added Cygwin support and a new feature:
With the s option soft links instead of hard links are used!
So dupmerge can do linking on more filesystems, e. g. FAT.
I've used this to get more free space on USB memory sticks and memory cards.
dupmerge was first used under Unix and i'm using it often under Linux, but it also works under MS-Windows, with Cygwin (http://www.cygwin.com/): I tested only with MS-Windows XP 32 Bit, but it should work under every MS-Windows.
The only problem under MS-Windows is that compilation (with gcc in Cygwin) stops because mlockall is not availible on this platform. If you comment out or delete that line (number 320) it works without problems.
Cygwin is always needed for dupmerge under MS-Windows, because it needs the find program for the input.
The new version 1.6 is now in a zip archive with the changelog and readme.
Added changelog and several file checks because fread does not distinguish between end-of-file and error.
Fixed not quite "started" message in quiet mode.
Because the old versions of dupmerge do use fgets to read the filenames, filenames with newlines can not be read. Therefore fgets is replaced since version 1.5 by fread and the old sequence operator '\n' is replaced by '\0'; the input file names must now be separated with zero and not newline. This is now perfect because a file name is a zero-terminated string.
So instead of
find ./ -type f -print | dupmerge <options or not>
now you have to call
find ./ -type f -print0 | dupmerge <options or not>
And there was another little bug:
The version 1.4 did show the old version number 1.3.
Version 1.4 now has a deletion mode (option -d), which deletes multiple files.
This mode can be used e. g. for clearing movie and picture archives.
Tested with corutils and kernel sources and a movie and picture archive.