Download Latest Version finddupe-2.1.0.tar.gz (261.4 kB)
Email in envelope

Get an email when there's a new version of finddupe - catalog files and find dups

Home / OldFiles
Name Modified Size InfoDownloads / Week
Parent folder
finddupe.dynamically-linked.stripped 2009-01-19 64.9 kB
finddupe.dynamically-linked.debuggable 2009-01-19 117.6 kB
finddupe.c 2009-01-19 98.4 kB
Catalog_File_Structure 2009-01-19 1.8 kB
Makefile 2009-01-19 883 Bytes
Theory_of_Operation 2009-01-19 4.0 kB
README 2009-01-19 5.1 kB
Totals: 7 Items   292.8 kB 0
README for finddupe.c
copyright(c)2009 Scott Mitchell Jennings
under the terms of the GNU General Public License
Sun Jan 18 16:12:26 PST 2009

* Preamble / Justification

The overriding principle in the development of finddupe is speed.

There are many programs available for managing and cataloging
various archives of data, and many for finding duplication within
archives. Many are extremely featureful, and have amazing
graphical interfaces. When I began trying them all, I found that
none of there were effectively usable on the extremely large
archives now easily possible with the availability of large and
inexpensive disk drives. Most were so slow merely to open their
catalogs, that I usually found it was faster to locate the file
using /usr/bin/find, if the disk drive was online.

Eventually I just used shell scripts which used "find" and "ls"
to produce text files showing what files were on what disks. Then
I'd find what I had by name or by size using grep. Somewhat
tedious, not exactly fast, but still faster than anything else
I'd found. No program to load, just a quick grep. The slow part
was keeping the text files up to date, as I organized my data.

Meanwhile, I had been using an old utility called "finddupe"
basically forever. It was extremely fast at establishing there
were no duplicate files in large repositories, but when there
were duplicates, it slowed way down, as it did a byte wise
compare between all potentialy duplicate files.  I began to teach
myself C by hacking on it. My first hack was to make it ignore my
.xvpics directories. My next hack was to make it report
duplicates in quotes, so I could feed it's output to shell
scripts. Eventually, the added features made the new finddupe
virtually unrecognizable as the old finddupe.

* Invocation

Finddupe accepts multiple command line switches in either long
(two dashes) or short forms. (--hardlink-dups could be seen as
dangerous, and has no short form) It then expects a list paths.

Paths which are directories are searched deep for all regular
files on the same filesystem.

Here is the current list of accepted command line switches:

 -v or --verbose or --verbose=N

Each -v increases "verbosity" by one level, or N is an explicit level.

 -c or --catalog

This causes finddupe to check the files in the paths specified
against all files in all catalogs, as well as against themselves.
(by default command line paths are checked only against themselves)

 -a or --all

This causes finddupe to check for duplicates everywhere in all
files in all catalogs.

 -m --force-md5

This causes finddupe to ensure that MD5 data exists for all files
reference on the command line. By default MD5 data is only
generated when it is needed to identify duplication. This feature
is extremely useful for removable media, ensuring that the MD5
data will be available in future, even if the media itself is not
presently available.

 -p --paranoid

When duplicates are identified (files are of identical size and
have identical MD5sums) this switch causes finddupe to also do a
bytewise compare of the two file's contents. (slow)

 -h --show-hard

This causes finddupe to treat files that are hardlinked as if
they are duplicates. By default hard linkes are not considered
duplicates, but are *not* ignored (as in the original finddupe)
and will always be reported in the duplicate list if they are the
same content as other files in at least one other inode.

 -z N --ignore-less=N

This causes finddupe to ignore files of less that N size, when
looking for duplicates.

 -. --ignore-hidden

This causes finddupe to ignore all files and directories who's
names start with a '.' Entire branches (all subdirectories) are
also ignored.

 -e --edit

This causes finddupe to allow you to edit the comments for all
volumes of media reference by the command line paths. If any of
the command lines paths explicitly reference a single file,
comments for those files will also be edited.  Note that files
whos comments start with a '.' are *always* ignored when
searching for duplication.

 -n --no-catalogs

This causes finddupe never to read in or write out any catalog
files.  This makes it's behavour much closer to the original
finddupe. This also means that MD5 data will have to be generated
from scratch, if needed.

 -rN --reports=N

This causes finddupe to report only the first N groups of
duplicate files.

 -fpathname --catalog-dir=pathname

This causes finddupe to use "pathname" as the directory to store
catalog files in.  By default this is "~/.finddupe".

 -xfilename --exclude=filename

This causes finddupe to ignore files and/or directories of
exactly this filename.
 
 --hardlink-dups

This causes finddupe to attempt to hard link all duplicate files
together. There is no guarantee which file will be trashed, so
one of them will inherit the stats of the other.

 -d --debug --debug=N

Finddupe is still in beta testing, and contains many lines of
code useful only for debugging. These switches act like the -v
switch, increasing the verbosity of the debug messages.

  -smj
Source: README, updated 2009-01-19