Menu

Tree [36dfb8] master /
 History

HTTPS access


File Date Author Commit
 src 2013-08-26 Redsaz Redsaz [36dfb8] - Changed any remaining references from Foobar...
 tools 2013-04-16 redsaz redsaz [94bca7] Increasing the memory that can be allowed by th...
 .gitignore 2013-06-21 Redsaz Redsaz [b6bfe3] Adding intellij ide files to the ignore list.
 COPYING 2013-04-10 Redsaz Redsaz [d458e4] Initial commit
 README 2013-04-10 Redsaz Redsaz [d458e4] Initial commit
 assembly.xml 2013-04-11 Redsaz Redsaz [59f676] A single jar/exe can now be created.
 pom.xml 2013-08-26 Redsaz Redsaz [36dfb8] - Changed any remaining references from Foobar...

Read Me

lsdup - List Duplicates.

Copyright 2013 Shayne Riley
GPL v3 - Read COPYING for more information.

usage: lsdup [options] [DIRS]
    --delete                       Delete duplicates.
 -h,--help                         Print the help contents of the program.
    --identify-earliest-modified   The duplicate that is earliest modified
                                   will be considered original.
    --identify-latest-modified     The duplicate that is latest modified
                                   will be considered original.
    --identify-least-parents       The duplicate that has the least
                                   parents will be considered original.
    --identify-longest-path        The duplicate that has the longest
                                   relative path will be considered
                                   original.
    --identify-most-parents        The duplicate that has the most parents
                                   will be considered original.
    --identify-shortest-path       The duplicate that has the shortest
                                   relative path will be considered
                                   original.
    --ignore                       Ignore any file or directory matching
                                   any of the patterns. * and ? wildcards
                                   may be used.
    --list <arg>                   Write the list of duplicates to this
                                   file. If not specified, the list will
                                   be written to standard output.
    --move <arg>                   move duplicate files to this directory.
    --override-read-only <arg>     when moving or deleting duplicate
                                   files, ignore any read-only
                                   permissions.
    --perform <arg>                Perform the directions in the list
                                   file.
    --read-only <arg>              DIR... - Any directories in this list
                                   are only compared against, but will not
                                   be moved or deleted.
    --subjects <arg>               DIR... - Must be used in conjunction
                                   with "--read-only". Specified the
                                   folders which may have duplicates
                                   listed.

Lists files whose contents are equal. The list can then be used to move or
delete duplicates.

In order to use lsdup, you must run it once to create a list of original files
and duplicate files. You can then look through the list, making changes as
needed (change which files are duplicates and which are originals). Finally,
run lsdup with the list file, and it will perform the necessary actions on the
duplicates as needed.

There are two different modes when creating the list of duplicates:
* Comparison mode: directories are put into two groups:
  read-only and subjects. All files in group "read-only" will be kept safe from
  being moved, deleted, or changed and marked as originals. Any files in group
  "subjects" that are found to be equal to the first group will be marked as
  duplicates.

    lsdup --list list_of_dupes --read-only /home/user/keepers /home/user/morekeepers --subjects /home/user/mayhaveduplicates /home/user/moredupe 

* Single mode: All directories will be scanned for duplicates. How the original
  is singled out from the duplicates is based on one of the following of the
  user's choice:
  * Keep earliest-modified: In a set of files of identical content, the file
    with the earliest modification date will be marked as original.
  * Keep most-recently-modified: In a set of files of identical content, the file
    with the most-recent modification date will be marked as original.
  * Keep least parents: In a set of files of identical content, the file
    with the least amount of parent directories will be marked as original.
  * Keep most parents: In a set of files of identical content, the file
    with the most amount of parent directories will be marked as original.
  * Keep least path: In a set of files of identical content, the file
    with the shortest path (in characters) will be marked as original.
  * Keep longest path: In a set of files of identical content, the file
    with the longest path (in characters) will be marked as original.
 
The file that is generated may look like the following:
***START EXAMPLE***
 
= "test\ro\3\a deep dir\linux.iso" 693 MB adef5e36...
d "test\dup\2\deeper folder\linux.iso"

= "test\ro\2\original group fred.txt" 29 bytes 07e2baec...
d "test\dup\1\duplicate group fred3.txt"
d "test\dup\2\duplicate group fred.txt"
d "test\dup\2\duplicate group fred2.txt"

= "test\ro\1\original group 1.txt" 41 bytes 201ec4b9...
m "test\dup\2\duplicate group 1.txt"

= "test\ro\1\original group 2.txt" 41 bytes 21fb5f09...
= "test\ro\2\duplicate group 2 .txt"
d "test\dup\2\duplicate group 2 .txt"

4 duplicate sets
6 duplicate files
693 MB duplicate bytes

***END EXAMPLE***

The file is sorted into groups separated by blank lines, starting with the
group that has the most duplicated data (in bytes) and ending with the least.

Each group represents a duplicate set, where one or more files are considered
original (denoted with "="), and one or more files are considered duplicates
(denoted with "d").

The size of each file in the group is listed in the first line of the group.
The text after that is a snippet of the hash of the contents of the files of
that group.

Finally, the last three lines indicate how many duplicate sets were found, how
many duplicate files were found (originals are not counted), and how many total
duplicate bytes were found (original files do not have their sizes included).

The list is editable. Any lines that begin with:
 * "#" is a comment line. It is ignored. You may use it to make notes or disable
   an action.
 * "=" is an original file line. The file on this line will not be affected.
 * "d" is a duplicate file line. The file may be moved or deleted depending on
   if the "--move" or "--delete" options were used. If neither are used, then
   nothing happens (The default action is "skip")
 * "m" is a move file line. It can read as:
   
   m "original"
   
   which will move the file to whatever directory was given on the commandline
   for "--move (destination)"
   
   the other option is
   
   m "original" > "destination"
   
   which moves the file to the destination directory.
 * "r" is a remove file line. It deletes the file.
 * "s" is a skip file line. No action is performed on the file (similar to
   how "o" behaves, but without saying that the file is original.

Once all of the changes are made (if any) to the list file, it can be passed to
lsdup via the --perform command, to perform the actions in the list. Since the
directories in the file can be relative, it is best to run lsdup from the same
directory as the first time:

    lsdup --perform list_of_dupes

Once all the actions are performed, your files will be deduplicated, unless
any errors are encountered, which will be listed. The list can be run again,
and any files that were already moved or deleted will be skipped.


 
 
 Why I made this:
 
 I've had multiple computers over the years, and multiple
 different backup systems. Over time, files and backups got shifted from one
 computer to another, some get "archived" onto other hard drives or CDs, DVDs,
 etc.
 
 It got so bad that finally I had to sit down and sort out the mess of which
 files I ultimately wanted to keep, and which I wanted to get rid of forever.
 I kept a drive that I considered to be the master: Files that I wanted to
 keep were organized in one sub-directory, and files I wanted to eventually
 delete were kept in another.
 
 With this strategy in place, I could finally go through my multiple archives,
 and using CloneSpy (only on Windows), which let me group the "master"
 directories together in group 1, and group the directories I still needed to
 organize into group 2. It had an option to only delete directories from
 group 2 but leave group 1 alone. In this way, I was easily able to figure out
 which files I already organized or "deleted" and which were unique to the
 archive.
 
 When it came time to organize my Linux drives, there were many deduplication
 tools available, but none had that feature. So I developed a command line 
 equivalent to it.
 
  
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.