lsdup Code

List files with duplicate contents

Brought to you by: redsaz

Tree [36dfb8] master /

History

HTTPS access

File	Date	Author	Commit
src	2013-08-26	Redsaz	[36dfb8] - Changed any remaining references from Foobar...
tools	2013-04-16	redsaz	[94bca7] Increasing the memory that can be allowed by th...
.gitignore	2013-06-21	Redsaz	[b6bfe3] Adding intellij ide files to the ignore list.
COPYING	2013-04-10	Redsaz	[d458e4] Initial commit
README	2013-04-10	Redsaz	[d458e4] Initial commit
assembly.xml	2013-04-11	Redsaz	[59f676] A single jar/exe can now be created.
pom.xml	2013-08-26	Redsaz	[36dfb8] - Changed any remaining references from Foobar...

Read Me

lsdup - List Duplicates.

usage: lsdup [options] [DIRS]
--delete Delete duplicates.
-h,--help Print the help contents of the program.
--identify-earliest-modified The duplicate that is earliest modified
will be considered original.
--identify-latest-modified The duplicate that is latest modified
will be considered original.
--identify-least-parents The duplicate that has the least
parents will be considered original.
--identify-longest-path The duplicate that has the longest
relative path will be considered
original.
--identify-most-parents The duplicate that has the most parents
will be considered original.
--identify-shortest-path The duplicate that has the shortest
relative path will be considered
original.
--ignore Ignore any file or directory matching
any of the patterns. * and ? wildcards
may be used.
--list <arg> Write the list of duplicates to this
file. If not specified, the list will
be written to standard output.
--move <arg> move duplicate files to this directory.
--override-read-only <arg> when moving or deleting duplicate
files, ignore any read-only
permissions.
--perform <arg> Perform the directions in the list
file.
--read-only <arg> DIR... - Any directories in this list
are only compared against, but will not
be moved or deleted.
--subjects <arg> DIR... - Must be used in conjunction
with "--read-only". Specified the
folders which may have duplicates
listed.

Lists files whose contents are equal. The list can then be used to move or
delete duplicates.

In order to use lsdup, you must run it once to create a list of original files
and duplicate files. You can then look through the list, making changes as
needed (change which files are duplicates and which are originals). Finally,
run lsdup with the list file, and it will perform the necessary actions on the
duplicates as needed.

There are two different modes when creating the list of duplicates:
* Comparison mode: directories are put into two groups:
read-only and subjects. All files in group "read-only" will be kept safe from
being moved, deleted, or changed and marked as originals. Any files in group
"subjects" that are found to be equal to the first group will be marked as
duplicates.

lsdup --list list_of_dupes --read-only /home/user/keepers /home/user/morekeepers --subjects /home/user/mayhaveduplicates /home/user/moredupe

* Single mode: All directories will be scanned for duplicates. How the original
is singled out from the duplicates is based on one of the following of the
user's choice:
* Keep earliest-modified: In a set of files of identical content, the file
with the earliest modification date will be marked as original.
* Keep most-recently-modified: In a set of files of identical content, the file
with the most-recent modification date will be marked as original.
* Keep least parents: In a set of files of identical content, the file
with the least amount of parent directories will be marked as original.
* Keep most parents: In a set of files of identical content, the file
with the most amount of parent directories will be marked as original.
* Keep least path: In a set of files of identical content, the file
with the shortest path (in characters) will be marked as original.
* Keep longest path: In a set of files of identical content, the file
with the longest path (in characters) will be marked as original.

The file that is generated may look like the following:
***START EXAMPLE***

= "test\ro\3\a deep dir\linux.iso" 693 MB adef5e36...
d "test\dup\2\deeper folder\linux.iso"

= "test\ro\2\original group fred.txt" 29 bytes 07e2baec...
d "test\dup\1\duplicate group fred3.txt"
d "test\dup\2\duplicate group fred.txt"
d "test\dup\2\duplicate group fred2.txt"

= "test\ro\1\original group 1.txt" 41 bytes 201ec4b9...
m "test\dup\2\duplicate group 1.txt"

= "test\ro\1\original group 2.txt" 41 bytes 21fb5f09...
= "test\ro\2\duplicate group 2 .txt"
d "test\dup\2\duplicate group 2 .txt"

4 duplicate sets
6 duplicate files
693 MB duplicate bytes

***END EXAMPLE***

The file is sorted into groups separated by blank lines, starting with the
group that has the most duplicated data (in bytes) and ending with the least.

Each group represents a duplicate set, where one or more files are considered
original (denoted with "="), and one or more files are considered duplicates
(denoted with "d").

The size of each file in the group is listed in the first line of the group.
The text after that is a snippet of the hash of the contents of the files of
that group.

Finally, the last three lines indicate how many duplicate sets were found, how
many duplicate files were found (originals are not counted), and how many total
duplicate bytes were found (original files do not have their sizes included).

The list is editable. Any lines that begin with:
* "#" is a comment line. It is ignored. You may use it to make notes or disable
an action.
* "=" is an original file line. The file on this line will not be affected.
* "d" is a duplicate file line. The file may be moved or deleted depending on
if the "--move" or "--delete" options were used. If neither are used, then
nothing happens (The default action is "skip")
* "m" is a move file line. It can read as:

m "original"

which will move the file to whatever directory was given on the commandline
for "--move (destination)"

the other option is

m "original" > "destination"

which moves the file to the destination directory.
* "r" is a remove file line. It deletes the file.
* "s" is a skip file line. No action is performed on the file (similar to
how "o" behaves, but without saying that the file is original.

Once all of the changes are made (if any) to the list file, it can be passed to
lsdup via the --perform command, to perform the actions in the list. Since the
directories in the file can be relative, it is best to run lsdup from the same
directory as the first time:

lsdup --perform list_of_dupes

Once all the actions are performed, your files will be deduplicated, unless
any errors are encountered, which will be listed. The list can be run again,
and any files that were already moved or deleted will be skipped.

Why I made this:

I've had multiple computers over the years, and multiple
different backup systems. Over time, files and backups got shifted from one
computer to another, some get "archived" onto other hard drives or CDs, DVDs,
etc.

It got so bad that finally I had to sit down and sort out the mess of which
files I ultimately wanted to keep, and which I wanted to get rid of forever.
I kept a drive that I considered to be the master: Files that I wanted to
keep were organized in one sub-directory, and files I wanted to eventually
delete were kept in another.

With this strategy in place, I could finally go through my multiple archives,
and using CloneSpy (only on Windows), which let me group the "master"
directories together in group 1, and group the directories I still needed to
organize into group 2. It had an option to only delete directories from
group 2 but leave group 1 alone. In this way, I was easily able to figure out
which files I already organized or "deleted" and which were unique to the
archive.

When it came time to organize my Linux drives, there were many deduplication
tools available, but none had that feature. So I developed a command line
equivalent to it.

lsdup Code

List files with duplicate contents

Branches

Tree [36dfb8] master / Download Snapshot History

Read Me

Tree [36dfb8] master /

History