First, thanks for a very nice and useful tool which I have used on several occasions :)
I think that one feature is missing though...the ability to specify or control in which direction the link is created. At the moment you give dubmerge a bunch of files which dubmerge then nicely links together. The problem is here, as far as I can see, that dubmerge takes no preference of the direction of the linking. Eg. if you asks dubmerge to link content in directory 01 and content in directory 02, for some files the inode of the files in directory 01 will be used and for other files the inode of the files in directory 02 will be used.
This behaviour is not very good if you're dependend on the integrity of a certain directory (eg. the inodes within it) to stay intact. One real life example is when you're using rsync (with --link-dest argument), rsnapshot or another backup utility to do differential backups of data. If the old source dir specified with the --link-dest option gets new inodes, rsync will think that these are different files and therefore copy all files with different inodes once again, even though they are the same.
Lets take an example; we're backing up 10G of data on a daily basis which in this example doesn't change. However, due to errors in source file systems, upgrading of servers, etc. - the source data is sometimes moved around, causing full backups instead of differential backups. We would like to use dupmerge to recreate these missing hardlinks, by reconnecting identical data. A du -sm on the root of the backup directory would look something like:
10000mb backup01 (initial backup)
0mb backup02 (no changes in data, same inodes)
0mb backup03 (no changes in data, same inodes)
10000mb backup04 (backup source changed, causing rsync/rsnapshot to copy all data again)
10000mb backup05 (backup source changed, causing rsync/rsnapshot to copy all data again)
0mb backup06 (no changes in data, same inodes)
so we have 30GB of data, which optimally could be represented with 10GB of data by linking the identical files. Running dupmerge on these directories will bring the total size of backup01-06 down to 10GB, but dupmerge will also change the inodes of backup06, causing the upcoming backup07 to be a new (almost) full 10GB backup, leaving us with 20GB of total data instead of the intended 10GB. Running dupmerge again, will just extend the problem to next time we do a backup.
As far as I can see, there's no way to get around this in dupmerge atm., you'll always end up with a total backup size of up to 2 x datasize + diffsize instead of 1 x datasize + diffsize. If it was possible to tell dupmerge to use a "master-directory" when creating the hardlinks, causing dupmerge to use the inodes of this directory on a file match, it would be possible to solve this issue with dupmerge :)
Another utility called Freedup (http://freedup.org), which is somewhat similar to dupmerge, has implemented such a feature, but unfortunately the program is too buggy when handling large amounts of data, making it unusable in real life.
I hope this could be a feature in a new version of dupmerge :)
Thanks for your effort on the program.
> rsync will think that these are different files and therefore copy all files with differ
> ent inodes once again, even though they are the same.
rsync uses mod-time & size and not inode, but often the mod-times equal files are different, so also the mod-time often gets changed when the inode gets changed (by hard linking).
Using a "master-directory" would be possilbe, but currently i don't have anough time to do it.
But you can use rsync with the option --checksum, so mod-times (and inodes) are not checked by rsync.